HomeTamr Core GuidesTamr Core API Reference
Tamr Core GuidesTamr Core API ReferenceTamr Core TutorialsEnrichment API ReferenceSupport Help CenterLog In

Working with Regular Expressions

Regular expressions (regex) define search patterns that certain functions use to search for strings.

Regular expressions provide a flexible way to match all characters or portions of text in a string. The functions that accept regular expressions use syntax that is based on Java Regular Expressions.

Regex Basics

At a high level, a regular expression (regex) is a description of the characters you want to match. If you do not supply any reserved, special characters, a regex looks for an exact match, so ‘Cat’ matches only Cat and not cat.

A few examples of the reserved characters that you can include in an expression to make matching more flexible follow.

  • . denotes any character except a line break, so '.at' matches cat, hat, #at, 9at, and so on.
  • ^ asserts the beginning of a line, so ‘^Large’ matches Large shirts but not Size: Large.
  • | denotes a logical OR. This serves as a dividing point for your regex string. ‘this|that’ matches this or that.
  • * indicates zero or more matches, and
  • + indicates one or more matches, so the regex 'a*b+' matches ab, baby, babby, and bbbbb, but not aaaaa or arb.
  • \ is the escape character you use to invoke the reserved character that follows it in a different way.

Tips for Tamr Core Regex

Escape characters

In Tamr Core regex, you use two backslash characters (\\) to escape a reserved character. In Java regex, you only use one. Examples of this difference follow.

To match this literal

Tamr Core regex

Java Regex

.

\\.

\.

|

\\|

\|

any digit
0, 1, 2, 3, 4, 5, 6, 7, 8, 9

\\d

\d

'
(Tamr reserved character, not a Java reserved character)

\'

'

In Tamr Core regex, a single quote is a reserved character: you must enclose each regex clause with single quote ' characters. As a result, to match a single, literal quote character, you supply the regex '\'' .

Additional Tips

  • For the most part, Tamr Core regex allows the same patterns as Java regex. For example, you can use character groups like '\\p{IsLatin}' for non-accented lowercase or uppercase letters or '\\p{IsAlphabetic}' for lowercase or uppercase letters with or without diacritical marks.
  • Nulls are aggressive. Any null argument to most functions results in a null output.
  • Many of the functions that use regular expressions allow you to capture groups of characters, treating multiple characters as a single unit. To capture groups of characters, enclose them in parentheses as in Java. For example, the regex ‘(\\d+)ml’ matches 10ml and captures 10. For more information, see capturing groups.

Functions that Use Regex

Examples of using regex in each of these functions follow.

  • extract matches a string against a regex, and returns the specified group.
  • extract_all matches a string against a regex, and returns results for all groups.
  • matches matches a string against a regex. Returns true if it matches the entire input string.
  • replace_all replaces all matches of the target regex in the given string with the replacement.
    Note: The replace() function does not take a regex.
  • split splits a string into an array of strings based on where the provided regex matches.

For additional, task-oriented examples see Example Tasks for Transformation Scripts.

extract(str, regex, group)

Extract matches the entire input against a regex, and returns the portion of the input that matches the specified group. Returns NULL if the input is NULL, or if there are fewer matching groups than specified.

For example, you can use extract() to get the house number from an address: extract(address, '(^\\d+).*', 1)

Notes:

  • Groups are counted from 1. A group_index of 0 returns the match on the entire pattern.
  • The entire input must match the regex.

Value

Regex

Group

Result

"This and that."

'This (and|or) that.'

1

and

"This or that."

'This (and|or) that.'

1

or

"This and that."

'This (and|or) that.'

2

NULL

"This and that."

'This ((a.\*)|(o.\*)) that.'

2

and

extract_all(str, regex)

Extract_all extracts all matches for all capturing groups into an array.

For example, extract_all(colname, '(\\d+)') extracts all sequences of one or more digits into an array. If the input has no digits, the output is an empty array. If the input is NULL, the output is NULL.

Notes:

  • Regex must be within a capture group ( ) for most use cases.
  • The input column is not modified.
  • Nested groups result in portions of the input being copied to more than one part of the output.
  • Matches are non-overlapping.

Value

Regex

Result

Notes

"This 123 and 456 that."

'(\d+)'

["123","456"]

Multiple matches

"This and that."

'(\d+)'

[]

Result is an empty array

NULL

'(\d+)'

NULL

"This 123 and 456 that."

'and (\d+) that'

["456"]

Regular expression has non-capturing portions surrounding the capturing group

"This and that."

'(This and|and that|that)'

["This and"]

Only the first overlapping match is returned.

"This 123 and 456 that."

'(\d+)|([A-Za-z]+)'

["This","123","and","456","that."]

Two capturing groups, but only one matches at a time.

"This 123 and 456 that."

'(\d(\d*))'

["123","23","456","56"]

Two nested capturing groups; all capturing groups present in the output.

matches(regex, str)

Matches returns a Boolean: true if the input matches the expression, false if it does not, and NULL if the input is NULL.

For example, matches('(?i).*street.*', myCol) returns true if myCol contains the string "street" (case-insensitive).

Notes:

  • The order of arguments is reversed compared to other functions, with the regex first.
  • The entire input must match the expression. Think of the expression as having implicit ^$ around it.

replace_all(str, regex, replacement)

Replace_all returns a copy of the input, with all occurrences of the entire regex replaced with the supplied value. If there are no matches, the input is copied unmodified. If the input is NULL, the result is NULL.

For example, this script removes all non-alphanumeric characters, including spaces, from a field:

SELECT *, 
replace_all(to_string(description), ‘([^a-zA-Z\\d])’, ‘’) 
AS description_trigram;

Notes:

  • Regex must be within a capture group ( ) for most use cases.
  • The entire regex is matched, not groups within the regex.

split(str, regex)

Split creates an array from the input by splitting it into values delimited by strings that match the regex. The portions of the input that match the regex are omitted from the output.

For example, this script takes the space-separated values (words) in a description field and outputs them into an array:

SELECT *, 
split(to_string(description), ‘\\ ’) 
AS description_words;

Did this page help you?