Transformations are an incredibly useful tool to affect the final output of a unified dataset without having to individually transform each of the sources before ingesting them into Tamr. Transformations can be applied to all records in the unified dataset, or records from specific input datasets (but will not change the input datasets themselves). Transformations will not alter input datasets directly. Locate transformations on the Unified Dataset tab.
Transformations operate on one dataset at a time and each transformation produces a new dataset from the current dataset.
On the Unified Dataset page, there is a Show Transformations button on the right hand side. Click this to display a pull-out menu, where transformations can be added.
To begin adding transformations, simply click the Add Transformation button.
Transformations can be applied to records from specific input datasets, or to records from the unified dataset after the input datasets have been unioned. When the input datasets are unioned, the attributes are converted to the data type associated with that unified attribute (default of array[string]).
You can collapse or expand the two sections, and add a transformation directly to each section. You may also drag a transformation across the open sections to change the scope. If you add a transformation in the Input Datasets section, you will see "For records from __ datasets" in the bottom left corner. Clicking on this text allows you to choose which datasets will have transformations applied to their records. One or more datasets can be chosen.
Once a transformation is written, you can use the Preview all button in the transformations panel to view a preview of the transformation. If the transformation violates any rules, you will receive an error message explaining why it failed. This preview is a way to test transformations without affecting the entire dataset; you can iterate and improve your transformations quickly, viewing the results as you go before you save any changes.
Once you are satisfied with the results, you can save the changes. The Save button has a small number on it, indicating how many changes have been made since the last time you saved changes. If you don't save transformations, they will not persist if you navigate away from the page. The Save button is disabled if you have any transformations with errors.
Clicking Cancel changes reverts transformations back to the last time they were saved, thereby undoing all changes noted in the orange badge on the Save button.
You can also preview a set of transformations by clicking Preview on each individual transformation. Subsequent transformations will be grayed out to signify that they are not included in the preview, but can still be edited and reordered. Pressing Save changes will save all changes, not only those made to the previewed set of transformations.
Transformations only have local effects and can be reordered freely. This reordering may change the output, but are permitted at any time.
Transformations can be reordered by clicking and holding on the two horizontal line icon at the top left of the transformation and then dragging it to the desired location.
In order for transformations to be saved as part of the data pipeline, you must use the Update Unified Dataset button found on the Unified Dataset page and Schema Mapping page to apply the transformations to the unified dataset.
The Save changes button on the transformation panel keeps your work in case you navigate away or want to come back to it later; to apply this saved work to all of your records in your Unified Dataset, export the transformed Unified Dataset, or have the transformed Unified Dataset continue in your data pipeline, make sure to use the Update Unified Dataset button.
The primary key of a dataset is a specific minimal set of attributes that uniquely identify a record.
For example, when you upload a source dataset to Tamr, Tamr selects the dataset's primary key as the dataset column that uniquely identifies its records. If no such column exists, Tamr adds a column and populates it using a generated value that is guaranteed to be unique. See Uploading a Dataset.
All non-input (or non-source) datasets in Tamr are known as derived datasets. Equally, derived datasets also have a primary key column.
When working with unified datasets, for example in a Mastering or Categorization project, which may be derived from one-or-more input datasets, Tamr automatically generates the additional column
tamr_id to uniquely identify the records of the unified dataset.
When working with Transformations, to manipulate source or derived datasets, the creation and populating of the column
tamr_id is automatic or manual.
The automatic generation and populating of
tamr_id ensures that records are always uniquely identified and is a particularly convenient feature when working with transformations such as
GROUP BY and
JOIN, that intentionally transform the uniqueness of records.
The manual generation and populating of
tamr_id allows you to not only ensure that records are always uniquely identified, but also, in contrast to the Tamr-generated values, consider the stability of the values in uniquely identifying the same record over time.
To manually manage the primary key
- Use transformations to directly populate the attribute
- Use the transformations hint
Unify Transformations supports multiple data types. The data types are
Some functions accept values of any data type, denoted as type
any. You can convert between data types using casting functions such as
to_integer(), which casts values of any type to type
integer. Any data that fails to convert will return a null. Some functions only accept values of certain data types. For example,
upper() can only be used on data with a type
The complex data type
array[ ] supports each primitive data type, e.g.
array[integer]. Additionally it supports nesting, e.g.
The Unify-generated attributes,
tamr_id, have a default data type of
string and must be of type
string when used in the unified dataset. Non Unify-generated attributes have a default data type of
array[string] and may be of any type when used in the unified dataset.
Transformations can be enabled in Categorization and Mastering projects during project creation, or after a project is already created by clicking the pencil icon on the project card. Once enabled, transformations cannot be disabled.
If you are writing transformations in a Categorization or Mastering project, or plan to use a Unified Dataset containing transformations in a second project, it is important that the Unify-generated columns
origin_entity_id meet certain conditions. Transformations can be used to maintain these conditions:
origin_source_namemust be a string. Each string should be a name of one of the input datasets.
origin_entity_idmust be a string.
Additionally, the Unify-generated column
tamr_id must be a unique string, since it is a primary key. However, this is automatically managed for users by default.
A list of all supported functions is maintained here.
Attributes can be referenced by wrapping them in double quotes, although this is not required (
"attribute" both work). You may reference an attribute without using any quotes, however, any attribute containing spaces or escaped characters must be wrapped in double quotes. An attribute name containing double quotes itself can be referenced by escaping the double quotes. For example,
this is an "attribute name" becomes
"this is an ""attribute name""".
When writing attributes in transformations, keep in mind that they are case sensitive.
Dataset names follow the same pattern, and need to be wrapped in double quotes if they include spaces or escaped characters , e.g.
USE "myData.csv"; or
USE my_data;. See join for an example referencing a source dataset.
Single quotes are interpreted as string literals
For transformations such as Script and Formula, pressing the
tab key will give a list of suggested input, including functions and attributes.
Hints also autocomplete with
tab in the code editor.