Working with Regular Expressions
Regular expressions (regex) define search patterns that certain functions use to search for strings.
Regular expressions provide a flexible way to match all characters or portions of text in a string. The functions that accept regular expressions use syntax that is based on Java Regular Expressions.
Regex Basics
At a high level, a regular expression (regex) is a description of the characters you want to match. If you do not supply any reserved, special characters, a regex looks for an exact match, so ‘Cat’
matches only Cat
and not cat
.
A few examples of the reserved characters that you can include in an expression to make matching more flexible follow.
.
denotes any character except a line break, so'.at'
matchescat
,hat
,#at
,9at
, and so on.^
asserts the beginning of a line, so‘^Large’
matchesLarge shirts
but notSize: Large
.|
denotes a logical OR. This serves as a dividing point for your regex string.‘this|that’
matchesthis
orthat
.*
indicates zero or more matches, and+
indicates one or more matches, so the regex'a*b+'
matchesab
,baby
,babby
, andbbbbb
, but notaaaaa
orarb
.\
is the escape character you use to invoke the reserved character that follows it in a different way.
Tips for Tamr Core Regex
Escape characters
In Tamr Core regex, you use two backslash characters (\\
) to escape a reserved character. In Java regex, you only use one. Examples of this difference follow.
To match this literal | Tamr Core regex | Java Regex |
---|---|---|
. | \\. | \. |
| | \| | | |
any digit0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 | \\d | \d |
' (Tamr reserved character, not a Java reserved character) | \' | ' |
In Tamr Core regex, a single quote is a reserved character: you must enclose each regex clause with single quote '
characters. As a result, to match a single, literal quote character, you supply the regex '\''
.
Additional Tips
- For the most part, Tamr Core regex allows the same patterns as Java regex. For example, you can use character groups like
'\\p{IsLatin}'
for non-accented lowercase or uppercase letters or'\\p{IsAlphabetic}'
for lowercase or uppercase letters with or without diacritical marks. - Nulls are aggressive. Any
null
argument to most functions results in anull
output. - Many of the functions that use regular expressions allow you to capture groups of characters, treating multiple characters as a single unit. To capture groups of characters, enclose them in parentheses as in Java. For example, the regex
‘(\\d+)ml’
matches10ml
and captures10
. For more information, see capturing groups.
Functions that Use Regex
Examples of using regex in each of these functions follow.
extract
matches a string against a regex, and returns the specified group.extract_all
matches a string against a regex, and returns results for all groups.matches
matches a string against a regex. Returns true if it matches the entire input string.replace_all
replaces all matches of the target regex in the given string with the replacement.
Note: Thereplace()
function does not take a regex.split
splits a string into an array of strings based on where the provided regex matches.
For additional, task-oriented examples see Example Tasks for Transformation Scripts.
extract(str, regex, group)
extract(str, regex, group)
Extract matches the entire input against a regex, and returns the portion of the input that matches the specified group. Returns NULL if the input is NULL, or if there are fewer matching groups than specified.
For example, you can use extract()
to get the house number from an address: extract(address, '(^\\d+).*', 1)
Notes:
- Groups are counted from 1. A group_index of 0 returns the match on the entire pattern.
- The entire input must match the regex.
Value | Regex | Group | Result |
---|---|---|---|
"This and that." | 'This (and|or) that\\.' | 1 | and |
"This or that." | 'This (and|or) that\\.' | 1 | or |
"This and that." | 'This (and|or) that\\.' | 2 | NULL |
"This and that." | 'This ((a.*)|(o.*)) that\\.' | 2 | and |
extract_all(str, regex)
extract_all(str, regex)
Extract_all extracts all matches for all capturing groups into an array.
For example, extract_all(colname, '(\\d+)')
extracts all sequences of one or more digits into an array. If the input has no digits, the output is an empty array. If the input is NULL, the output is NULL.
Notes:
- Regex must be within a capture group
( )
for most use cases. - The input column is not modified.
- Nested groups result in portions of the input being copied to more than one part of the output.
- Matches are non-overlapping.
Value | Regex | Result | Notes |
---|---|---|---|
"This 123 and 456 that." | '(\d+)' | ["123","456"] | Multiple matches |
"This and that." | '(\d+)' | [] | Result is an empty array |
NULL | '(\d+)' | NULL | |
"This 123 and 456 that." | 'and (\d+) that' | ["456"] | Regular expression has non-capturing portions surrounding the capturing group |
"This and that." | '(This and|and that|that)' | ["This and"] | Only the first overlapping match is returned. |
"This 123 and 456 that." | '(\d+)|([A-Za-z]+)' | ["This","123","and","456","that."] | Two capturing groups, but only one matches at a time. |
"This 123 and 456 that." | '(\d(\d*))' | ["123","23","456","56"] | Two nested capturing groups; all capturing groups present in the output. |
matches(regex, str)
matches(regex, str)
Matches returns a Boolean: true
if the input matches the expression, false
if it does not, and NULL if the input is NULL.
For example, matches('(?i).*street.*', myCol)
returns true
if myCol
contains the string "street" (case-insensitive).
Notes:
- The order of arguments is reversed compared to other functions, with the regex first.
- The entire input must match the expression. Think of the expression as having implicit
^$
around it.
replace_all(str, regex, replacement)
replace_all(str, regex, replacement)
Replace_all returns a copy of the input, with all occurrences of the entire regex replaced with the supplied value. If there are no matches, the input is copied unmodified. If the input is NULL, the result is NULL.
For example, this script removes all non-alphanumeric characters, including spaces, from a field:
SELECT *,
replace_all(to_string(description), ‘([^a-zA-Z\\d])’, ‘’)
AS description_trigram;
Notes:
- Regex must be within a capture group
( )
for most use cases. - The entire regex is matched, not groups within the regex.
split(str, regex)
split(str, regex)
Split creates an array from the input by splitting it into values delimited by strings that match the regex. The portions of the input that match the regex are omitted from the output.
For example, this script takes the space-separated values (words) in a description field and outputs them into an array:
SELECT *,
split(to_string(description), ‘\\ ’)
AS description_words;
Updated about 2 years ago