You can apply transformations to all records in the unified dataset, or to records from specific input datasets. As a best practice, Tamr recommends that you apply transformations to the unified dataset whenever possible.
- Applying transformations to the unified dataset is, in most cases, more efficient than applying them to the input datasets, so processing completes faster.
- You can use the Tamr-created column
origin_source_nameto apply transformations to specific input datasets, even when they are run on the unified dataset. For example, you can use a
caseexpression that checks the value of
origin_source_namebefore applying a data cleaning transformation. Alternatively, you can use
origin_source_nameas part of the condition of a
JOINstatement to apply the
JOINto a subset of input datasets. See Managing Primary Keys and Join.
Transformations operate at a project level and produce a single output dataset based on one or more input datasets. Transformations never change input datasets.
You can add transformations to schema mapping, categorization, and mastering projects. See Enabling Transformations in Categorization and Mastering projects.
To access transformations:
- On the Unified Dataset page, select Show Transformations. The transformation editor appears with drop-down menus for adding transformations.
Choose Add Transformation.
If you add a transformation to the Input Datasets section, the note "For records from <name> datasets" appears in the bottom left corner of the transformation panel. Select this note to specify the datasets affected by this transformation. By default, the transformation is applied to all input datasets.
To change the scope of a transformation from the input datasets to the unified dataset or vice versa, open both sections in the transformation editor and drag the transformation between the sections.
Once you write your transformation, you can use Preview all in the transformations panel to preview a transformation. If the transformation violates any rules, Tamr issues an error message explaining why it failed. This preview allows you to test transformations without affecting the entire dataset. You can iterate and improve your transformations quickly, viewing the results as you go before saving any changes.
Once you are satisfied with the transformation results, save the changes. The number on the Save button indicates how many changes have been made since the last time you saved changes. If you don't save transformations, they will not persist if you navigate away from the page. The Save button is disabled if you have any transformations with errors and when no transformations have changed.
To revert transformations and go back to the last time they were saved, choose Cancel Changes. This reverts all changes that weren't saved.
You can also preview a set of transformations by clicking Preview on each individual transformation. Subsequent transformations are then grayed out to signify that they are not included in the preview, but you can still edit and reorder them. To save changes, select Save changes. This saves all changes, and not only those made to the previewed set of transformations.
To preview your data before any transformations are applied, choose Preview at the top of the Input Datasets section.
Transformations only have local effects and you can reorder them at any time. Reordering may change the output.
To reorder transformations, select and hold the icon with two horizontal lines at the top left of the transformations panel and then drag it to the desired location within the transformations script.
To apply transformations so that they become part of the data pipeline for your unified dataset, choose Update Unified Dataset on the Unified Dataset and Schema Mapping pages. This applies transformations to the unified dataset.
The Save changes button on the transformation panel keeps your work in case you navigate away or want to come back to it later. To apply this saved work to all records in the unified dataset, export the transformed unified dataset. To include the transformed unified dataset in your data pipeline, select Update Unified Dataset.
Tamr transformations support multiple data types. See Data Types and Transformations.
You can enable transformations in Categorization and Mastering projects during project creation or after a project is already created. See Transformations. Once enabled, transformations cannot be disabled.
If you are writing transformations in a Categorization or Mastering project, or plan to use a unified dataset containing transformations in a second project, it is important that the Tamr-generated columns
origin_entity_id meet certain conditions. Transformations can be used to maintain these conditions:
origin_source_namemust be a string type. Each string should be a name of one of the input datasets.
origin_entity_idmust be a string type.
Additionally, the column
tamr_id generated by Tamr must be a unique string type, since it is a primary key that Tamr manages for you.
To reference attributes in a transformation script, wrap them in double quotes, although this is not required (
"attribute" both work). You may reference an attribute without using any quotes, however, any attribute containing spaces or escaped characters must be wrapped in double quotes. An attribute name containing double quotes itself can be referenced by escaping the double quotes. For example,
this is an "attribute name" becomes
"this is an ""attribute name""".
Attributes in transformations are case sensitive.
Dataset names follow the same pattern as attributes. Wrap dataset names in double quotes if they include spaces or escaped characters, such as
USE "myData.csv"; or
USE my_data;. See join for an example referencing an input dataset.
Single quotes are interpreted as string literals
For transformations such as Script and Formula, pressing the
tab key provides a list of suggested inputs, including functions and attributes.
Hints autocomplete with
tab in the code editor.
Updated about a month ago