Working with Regular Expressions
Overview
Regular expressions provide a way to match all or portions of text. The functions that use regular expressions follow the Java regular expression syntax. At a high level, a regular expression (or "regex") describes the characters that need to match, using *
to indicate zero-or-more matches, and +
to indicate one-or-more matches.
For example, the regex a*b+
will match ab
, baby
, babby
, and bbbbb
, but not aaaaa
or arb
.
Many of these functions work with capturing groups; capturing groups are enclosed in parentheses.
Some things to note when writing regular expressions:
- Regular expressions are strings, and as such need to be enclosed in single quotes
- Regular expressions use the backslash (
\
) character extensively. Because this character has special meaning in strings, it needs to be doubled when writing a regex.
For example, the regex(\d+)
would need to be written as a string like this:'(\\d+)'
.
Example Usage
extract(str, regex, group)
extract(str, regex, group)
This function will match the entire input against a regex, and return the portion of the input that matches the specified group. Returns NULL if the input is NULL, or if there are fewer matching groups than specified.
Notes:
- Groups are counted from 1. Group 0 is a pseudonym for a copy of the entire input.
- the entire input must match the regex
Value | Regex | Group | Result |
---|---|---|---|
"This and that." | This (and|or) that. | 1 | and |
"This or that." | This (and|or) that. | 1 | or |
"This and that." | This (and|or) that. | 2 | NULL |
"This and that." | This ((a.*)|(o.*)) that. | 2 | and |
extract_all(str, regex)
extract_all(str, regex)
This function will extract all matches for all capturing groups into an array.
For example, extract_all(colname, '(\\d+)')
will extract all sequences of one or more digits into an array.
If the input has no digits, the output is an empty array. If the input is NULL, the output is NULL.
Notes:
- the input column is not modified
- nested groups will result in portions of the input being copied to more than one part of the output
- matches are non-overlapping
Value | Regex | Result | Notes |
---|---|---|---|
"This 123 and 456 that." | (\d+) | ["123","456"] | Multiple matches |
"This and that." | (\d+) | [] | Result is an empty array |
NULL | (\d+) | NULL | |
"This 123 and 456 that." | and (\d+) that | ["456"] | Regular expression has non-capturing portions surrounding the capturing group |
"This and that." | (This and|and that|that) | ["This and"] | Only the first overlapping match will be returned. |
"This 123 and 456 that." | (\d+)|([A-Za-z]+) | ["This","123","and","456","that."] | Two capturing groups, but only one matches at a time. |
"This 123 and 456 that." | (\d(\d*)) | ["123","23","456","56"] | Two nested capturing groups; all capturing groups present in the output. |
matches(regex, str)
matches(regex, str)
This function will return true if the input matches the expression, false if it does not, and NULL if the input is NULL.
Notes:
- The entire input must match the expression. Think of the expression as having implicit
^$
around it.
replace_all(str, regex, replacement)
replace_all(str, regex, replacement)
This function returns a copy of the input, with all occurrences of the entire regex replaced with the supplied replacement. If there are no matches, the input is copied unmodified. If the input is NULL, the result is NULL.
Notes:
- The entire regex is matched, not groups within the regex.
split(str, regex)
split(str, regex)
Creates an array from the input by splitting it into values delimited by strings that match the regex. The portions of the input that match the regex are omitted from the output.
Updated over 5 years ago