Example Tasks for Transformation Scripts
These example scripts can help you address issues in, and apply standards to, your data.
Unless otherwise noted, these examples assume that all existing columns are string type. If your column is another type, such as array<string>
, you can use data type conversion functions to convert to a string type.
Tip: Tamr recommends that you minimize type conversions by completing them once at the start of transformations.
For more information about Tamr regular expressions (regex), see Working with Regular Expressions.
The code blocks below demonstrate transformations in Tamr Core, but are presented in this guide with SQL syntax highlighting.
Cleanse accents, whitespace, or delimiters
//Replace GERMAN-STYLE Diacritic characters in givenName
SELECT *,
replace_all(givenName ,'ä', 'ae') AS tamr_givenName;
//Strip out accents
SELECT *,
strip_accents(givenName) AS tamr_givenName;
//Clean lone pipe delimiter with possible whitespaces in tamr_addressLines
SELECT *,
replace_all(
tamr_addressLines,
'^\\s*\\|\\s*$',
''
) AS tamr_addressLines;
The replace_all function accepts regular expressions (regex) as an argument. See Working with Regular Expressions.
Cleanse and concatenate fields
Scripts like these can be useful for combining addresses or names.
/* Concatenate givenName and familyName, remove whitespace,
and accommodate null values */
SELECT *,
concat(
coalesce(trim(givenName),''),
' ',
coalesce(trim(familyName),'')
) AS tamr_fullName;
The following example provides a simpler alternative to the example above. It does not trim white space, however.
//Concatenate First Name and Last Name to Full Name
SELECT *,
concat(
"First Name",
' ',
"Last Name"
) AS “tamr_Full_Name”;
//Concatenate addressLine1 and addressLine2 to tamr_addressLines
SELECT *,
concat(
addressLine1,
'|',
addressLine2
) AS tamr_addressLines;
//Standardize to 5-digit US ZIP codes by removing the +4 from any ZIP+4 values
SELECT *,
substr2(Zip_Code, 0, 5) as tamr_Zip_Code;
Validate, clean, and manipulate email addresses
//Remove invalid emails
SELECT *,
CASE
WHEN valid_email(customer_email) THEN customer_email
END AS tamr_customer_email;
In the following example, customer_emails
is an array<string>
type:
//Remove invalid emails from an array
SELECT *,
filter(valid_email, customer_emails) AS tamr_customer_email;
In the following example, customer_emails
is an array<string>
type and the "custom definition" of an invalid email is a format that contains an '@' and does not contain the words 'noreply', 'dummy', or 'fake', case-insensitive:
//Remove invalid emails from an array with a custom definition of invalid
SELECT *,
filter(
lambda a_email:
matches('.*@.*', a_email) AND NOT
matches('.*(?i)(noreply|dummy|fake).*', a_email),
customer_emails
) AS tamr_customer_emails;
//Extract local and domain from email address to separate attributes
SELECT *,
extract(
customer_email,
'^(.+)\\@.*',1
) AS tamr_customer_email_local,
extract(
customer_email,
'.*\\@(.+)$',1
) AS tamr_customer_email_domain;
//Extract emails from a description attribute
SELECT *, extract_all(to_string(Description), '([\\w\\.-]+\\@[\\w\\.-]+)') as Email;
The matches, extract, and extract_all functions accept regular expressions (regex) as arguments. See Working with Regular Expressions.
Validate and cleanse telephone numbers
//Split an array field where the strings contain a list of entries
SELECT *,
split("customer_phone", ',') AS "tamr_customer_phone";
// Split a string of pipe-separated values into an array of strings
SELECT *,
split("Phone"), '\\|') AS "tamr_customer_phone";
In the following example, customer_phones
is an array<string>
type:
//Remove any non-digit characters from telephone number
SELECT *,
map(
lambda x: replace_all(x, '[^0-9]+', ''),
customer_phones
) AS tamr_customer_phones;
/* Remove junk repeated digits from telephone number
array<string> attribute customer_phones */
SELECT *,
map(
lambda x:
replace_all(
x,
'^(0+|1+|2+|3+|4+|5+|6+|7+|8+|9+)$',
''
),
customer_phones
) AS tamr_customer_phones;
The split and replace_all functions accept regular expressions (regex) as an argument. See Working with Regular Expressions.
Increase uniformity or extract key fields
/* Using case expressions to map from shorthand variants
to long form for colours in description */
SELECT *,
CASE
WHEN matches(
'(?i).*(Lime(\\s)?Green|GRN).*',
product_description
) THEN 'Green'
WHEN matches(
'(?i).*(Banana(\\s)?Yellow|YLW).*',
product_description
) THEN 'Yellow'
WHEN matches(
'(?i).*(Plum(\\s)?Purple|PPL).*',
product_description
) THEN 'Purple'
ELSE ''
END AS tamr_colour;
The matches function accepts regular expressions (regex) as an argument. See Working with Regular Expressions.
Remove records from the unified dataset
//Filter out records where a field is empty
FILTER "Supplier Name" IS NOT EMPTY;
Create a desired cluster name
/* Generating a cluster ID column that takes parent name
(if present) otherwise takes supplier name */
SELECT *,
CASE
WHEN "parent_name" IS NOT EMPTY THEN "parent_name"
WHEN "parent_name" IS EMPTY THEN "supplier_name"
END AS tamr_cluster_name;
Split a string to an array and remove any empty values
//Split string to array<string>, trim whitespaces and remove any empty elements
SELECT *, split(get(Alias, 0), ',') as Alias;
SELECT *, MAP(Lambda x: trim(x), Alias) as Alias;
SELECT *, FILTER(is_not_empty, Alias) as Alias;
Concatenate values from multiple array[string] columns into a single array, removing null/empty values
// Concatenate values from multiple array<string> columns (NameX) into a single array, removing null/empty values
// similar to just array.concat, but this will return the non-null values even if some inputs are null
// in array.concat, null is returned if any of the inputs are null
SELECT *,
array.concat(...array.non_empties(array(Name1, Name2, Name3, Name4))) AS All_Names;
// version that keeps only distinct values
SELECT *,
array.distinct(array.concat(...array.non_empties(array(Name1, Name2, Name3, Name4)))) AS All_Names;
Updated over 2 years ago