Tamr Documentation

Overview

Transformations are an incredibly useful tool to affect the final output of a unified dataset without having to individually transform each of the input datasets before loading them into Tamr.

Overview

You can apply transformations to all records in the unified dataset, or to records from specific input datasets. Transformations create new datasets and do not change the input datasets themselves. Locate transformations on the Unified Dataset tab. Transformations operate on one dataset at a time and each transformation produces a new dataset from the current dataset.

Accessing Transformations

To access transformations:

  1. On the Unified Dataset page, select Show Transformations. A pull-out menu displays where you can start adding transformations.
  1. Choose Add Transformation.

Changing Scope

You can apply transformations to records from specific input datasets, or to records from the unified dataset after the input datasets have been unioned. When the input datasets are unioned, the attributes are converted to the data type associated with that unified attribute (default of array[string]).

You can collapse or expand the two sections, and add a transformation directly to each section. You may also drag a transformation across the open sections to change its scope. If you add a transformation in the Input Datasets section, Tamr reflects this inside a note "For records from __ datasets" in the bottom left corner of the transformations panel. Selecting text allows you to choose which datasets will have transformations applied to their records. You can also choose additional datasets.

Displaying and Previewing Transformations

Once you write your transformation, you can use Preview all in the transformations panel to preview a transformation. If the transformation violates any rules, Tamr issues an error message explaining why it failed. This preview allows you to test transformations without affecting the entire dataset. You can iterate and improve your transformations quickly, viewing the results as you go before saving any changes.

Once you are satisfied with the transformation results, save the changes. The number on the Save button indicates how many changes have been made since the last time you saved changes. If you don't save transformations, they will not persist if you navigate away from the page. The Save button is disabled if you have any transformations with errors and when no transformations have changed.

To revert transformations and go back to the last time they were saved, choose Cancel Changes, This reversts all changes that weren't saved.

You can also preview a set of transformations by clicking Preview on each individual transformation. Subsequent transformations are then grayed out to signify that they are not included in the preview, but you can still edit and reorder them. To save changes, select *Save changes. This saves all changes, and not only those made to the previewed set of transformations.

To preview your data before any transformations are applied, choose Preview at the top of the Input Datasets section.

Reordering Transformations

Transformations only have local effects and you can reorder them at any time. Reordering may change the output.

To reorder transformations, select and hold the icon with two horizontal lines at the top left of the transformations panel and then drag it to the desired location within the transformations script.

Saving and Applying Transformations

Applying Transformations

To apply transformations so that they become part of the data pipeline for your unified dataset, choose Update Unified Dataset on the Unified Dataset and Schema Mapping pages. This applies transformations to the unified dataset.

The Save changes button on the transformation panel keeps your work in case you navigate away or want to come back to it later. To apply this saved work to all records in the unified dataset, export the transformed unified dataset. To include the transformed unified dataset in your data pipeline, select Update Unified Dataset.

Primary Key Management

The primary key of a dataset is a specific minimal set of attributes that uniquely identify a record.

For example, when you upload a source dataset to Tamr, Tamr selects the dataset's primary key as the dataset column that uniquely identifies its records. If no such column exists, Tamr adds a column and populates it using a generated value that is guaranteed to be unique. See Uploading a Dataset.

All non-input (or non-source) datasets in Tamr are known as derived datasets. Derived datasets also have a primary key column.

Unified datasets, for example in a Mastering or Categorization project, may be derived from one or more input datasets. In such cases, Tamr automatically generates the additional column tamr_id to uniquely identify the records of the unified dataset.

When working with transformations, to manipulate input of derived datasets, the creation and populating of the column tamr_id can be automatic or manual:

  • The automatic generation and populating of tamr_id ensures that records are always uniquely identified. It is a convenient feature when working with transformations such as PIVOT, GROUP BY and JOIN, that transform the uniqueness of records. Automatic primary key management was introduced in Tamr v.2019.014.1.
  • The manual generation and populating of tamr_id allows you to not only ensure that records are always uniquely identified, but also, in contrast to the Tamr-generated values, consider the stability of the values in uniquely identifying the same record over time.

User feedback is linked to tamr_id

All types of user feedback, including record categorizations, record pair labels, record locks, and record comments, are linked to the tamr_id of the unified dataset of the project. If you add or change transformations that change the value of tamr_id, user feedback, such as labels, will be lost.

When to Disable Automatic Primary Key Management

If you have started using Tamr after v.2019.014.1, Tamr automatically manages primary keys and you don't need to turn this feature off, for any projects, unless you have a specific workflow that would require you to specify your own keys.

If you have created projects before Tamr v.2019.014.1, then you may want to temporarily disable automatic assignment of primary keys for workflow stability between versions. For example, this might be useful if you don't want to lose your labels after an upgrade. In some cases, you may also want to always create primary keys manually. In this case, you can disable automatic management of keys using the methods listed in the following procedure.

You can add USE HINT statements manually or with a script.

  • Manual option. The USE HINT statements apply a hint to the current transformation in the editor and to all of the subsequent transformations in that project.
  • Script option. An option in the <unify-zip>/tamr/libs/transform-tools.jar script exists to automate the process of disabling primary key assigments after an upgrade.

To manually manage the primary key tamr_id:

  1. Use transformations to directly populate the attribute tamr_id, and use one of these methods:
    a. Use the transformations USE and HINT, and specify pkmanagement.manual. For example, to disable automatic primary key management by Tamr in a particular project, add: USE HINT(pkmanagement.manual); in the first transformation, or
    b. Use an option from <unify-zip>/tamr/libs/transform-tools.jar script. This option adds a HINT to project's transformations. To learn how to use it, run java -jar transform-tools.jar or java -jar transform-tools.jar pk-mgmt-disabler.

Data Types

Transformations in Tamr support multiple data types. The data types are:

Data type
Literal representation

integer

1

long

1L

double

1.0

string

'example'

boolean

true

array[ ]

NA

Some functions accept values of any data type, denoted as type any. You can convert between data types using casting functions such as to_integer(), which casts values of any type to type integer. Any data that fails to convert returns a null. Some functions only accept values of certain data types. For example, you can use upper() only on data with a type string.

The complex data type array[ ] supports each primitive data type, such as array[integer]. It also supports nesting of datatypes, for example: array[array[string]].

To view the data type of a unified attribute at that point within a transformation script, hover over the name of a unified attribute in the script.

Those attributes that are generated by Tamr, such as origin_entity_id, origin_source_name, and tamr_id, have a default data type of string and must be of type string when you include them in the unified dataset. Attributes that aren't generated by Tamr have a default data type of array[string] and may be of any type when used in the unified dataset.

See also Geospatial Data Types.

Enabling Transformations in Categorization and Mastering Projects

You can enable transformations in Categorization and Mastering projects during project creation, or after a project is already created by choosing the pencil icon on the project card. Once enabled, transformations cannot be disabled.

If you are writing transformations in a Categorization or Mastering project, or plan to use a unified dataset containing transformations in a second project, it is important that the Tamr-generated columns origin_source_name and origin_entity_id meet certain conditions. Transformations can be used to maintain these conditions:

  • origin_source_name must be a string type. Each string should be a name of one of the input datasets.
  • origin_entity_id must be a string type.

Additionally, the column tamr_id generated by Tamr must be a unique string type, since it is a primary key that Tamr manages for you.

Additional Information

Functions List

For a full list of all supported functions, see column-producing functions.

Referencing Attributes

To reference attributes in a transformation script, wrap them in double quotes, although this is not required (attribute and "attribute" both work). You may reference an attribute without using any quotes, however, any attribute containing spaces or escaped characters must be wrapped in double quotes. An attribute name containing double quotes itself can be referenced by escaping the double quotes. For example, this is an "attribute name" becomes "this is an ""attribute name""".

Attributes in transformations are case sensitive.

Referencing Datasets

Dataset names follow the same pattern as attributes. Wrap dataset names in double quotes if they include spaces or escaped characters, such as USE "myData.csv"; or USE my_data;. See join for an example referencing an input dataset.

Using Single Quotes

Single quotes are interpreted as string literals 'string'.

Tab Autocomplete

For transformations such as Script and Formula, pressing the tab key provides a list of suggested inputs, including functions and attributes.
Hints autocomplete with tab in the code editor.


What's Next

Fill

Overview


Transformations are an incredibly useful tool to affect the final output of a unified dataset without having to individually transform each of the input datasets before loading them into Tamr.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.