Working with Regular Expressions
Use Java-based regular expressions in Tamr transformation functions.
Overview
Regular expressions provide a way to match all characters or portions of text. The functions that use regular expressions use the syntax from the Java Regular Expressions. At a high level, a regular expression (or "regex") describes the characters that must match, using *
to indicate zero or more matches, and +
to indicate one or more matches.
For example, the regex a*b+
matches ab
, baby
, babby
, and bbbbb
, but not aaaaa
or arb
.
Many of the functions with regular expressions allow you to capture groups of characters, treating multiple characters as a single unit. To capture groups of characters, enclose them in parentheses. For more information, see capturing groups.
When writing regular expressions, use the following tips:
- Enclose regular expressions in single quotes because they are of type
string
. - Use the backslash (
\
) character twice because it has a special meaning in strings. For example, to include this expression(\d+)
inside your regex statement, use:'(\\d+)'
. To escape the backslash (\
) itself, use'\\\\'
.
Examples
extract(str, regex, group)
extract(str, regex, group)
This function will match the entire input against a regex, and return the portion of the input that matches the specified group. Returns NULL if the input is NULL, or if there are fewer matching groups than specified.
Notes:
- Groups are counted from 1. Group 0 is a pseudonym for a copy of the entire input.
- the entire input must match the regex
Value | Regex | Group | Result |
---|---|---|---|
"This and that." | This (and|or) that. | 1 | and |
"This or that." | This (and|or) that. | 1 | or |
"This and that." | This (and|or) that. | 2 | NULL |
"This and that." | This ((a.*)|(o.*)) that. | 2 | and |
extract_all(str, regex)
extract_all(str, regex)
This function will extract all matches for all capturing groups into an array.
For example, extract_all(colname, '(\\d+)')
will extract all sequences of one or more digits into an array.
If the input has no digits, the output is an empty array. If the input is NULL, the output is NULL.
Notes:
- the input column is not modified
- nested groups will result in portions of the input being copied to more than one part of the output
- matches are non-overlapping
Value | Regex | Result | Notes |
---|---|---|---|
"This 123 and 456 that." | (\d+) | ["123","456"] | Multiple matches |
"This and that." | (\d+) | [] | Result is an empty array |
NULL | (\d+) | NULL | |
"This 123 and 456 that." | and (\d+) that | ["456"] | Regular expression has non-capturing portions surrounding the capturing group |
"This and that." | (This and|and that|that) | ["This and"] | Only the first overlapping match will be returned. |
"This 123 and 456 that." | (\d+)|([A-Za-z]+) | ["This","123","and","456","that."] | Two capturing groups, but only one matches at a time. |
"This 123 and 456 that." | (\d(\d*)) | ["123","23","456","56"] | Two nested capturing groups; all capturing groups present in the output. |
matches(regex, str)
matches(regex, str)
This function returns true
if the input matches the expression, false
if it does not, and NULL if the input is NULL.
Notes:
- The entire input must match the expression. Think of the expression as having implicit
^$
around it.
For example, matches('(?i).*street.*', myCol)
returns true
if myCol
contains the word "street" (case- insensitve).
replace_all(str, regex, replacement)
replace_all(str, regex, replacement)
This function returns a copy of the input, with all occurrences of the entire regex replaced with the supplied replacement. If there are no matches, the input is copied unmodified. If the input is NULL, the result is NULL.
Notes:
- The entire regex is matched, not groups within the regex.
split(str, regex)
split(str, regex)
Creates an array from the input by splitting it into values delimited by strings that match the regex. The portions of the input that match the regex are omitted from the output.
Updated almost 5 years ago