Tamr Documentation

Managing Primary Keys

When you work with transformations, the uniqueness of the primary key value in the Tamr-generated tamr_id attribute is essential. For categorization and mastering projects, the tamr_id must also remain stable over time.

Each time you add a dataset to Tamr you have the option to specify the attribute that contains the primary key. If you do not specify the primary key, Tamr generates one and populates it with the row number automatically. See Uploading a Dataset.

When you create the unified dataset for your project, Tamr automatically creates the following attributes with data type string:

  • origin_source_name, which Tamr populates with the name of the input dataset
  • origin_entity_id, which Tamr populates with the primary key value from the input dataset
  • tamr_id, which Tamr populates with a hash of origin_source_name and origin_entity_id to create the primary key for the unified dataset

At the end of your project’s transformations these attributes must be present with the data type string. Tamr automatically manages the tamr_id primary key throughout transformations to ensure that it remains a unique string for every row. However, in some cases the result of this management might not align with your goals for transformations that affect training data. See below.

In categorization and mastering projects, Tamr uses the tamr_id to store all user feedback on the categorization or matching and clustering of records.

Important: If your categorization or mastering project includes a transformation that can either explicitly or unintentionally change the value of the tamr_id, all prior user feedback can be lost. As a result, when you design the transformations for a categorization or mastering project you may need to take extra steps to ensure the stability of tamr_id over time. See When to Include a Transformation for tamr_id.

When upgrading from early versions of Tamr, you may need to disable automatic primary key management in order to keep the same primary key values across the upgrade. See When to Disable Automatic Primary Key Management.

When to Include a Transformation for tamr_id

Transformations can either explicitly or unintentionally modify the Tamr-generated attributes.

  • Linear transformations (transformations that do not combine or add rows in a dataset) do not change the Tamr internal attributes unless their values are explicitly modified, such as upper(origin_source_name) AS origin_source_name.
  • Non-linear transformations modify tamr_id when automatic primary key management is enabled. Changes to these transformations can result in unintentional changes to tamr_id. Non-linear transformations include the EXPLODE, GROUP BY, JOIN, MERGE, PIVOT, UNION ALL, and UNPIVOT statements. Examples of changes that can unintentionally modify tamr_id include updates to your MERGE or GROUP BY keys, JOIN conditions, or EXPLODE array values, or adding or deleting input dataset rows when an UNPIVOT is included. See the description for each statement type for additional detail.
    If automatic primary key management is disabled, non-linear transformations can result in rows with duplicate values for tamr_id. If there are duplicates in tamr_id, the uniqueness requirement of a primary key at the end of transformations is violated and only one data row per tamr_id will be kept in the final output dataset. Automatic primary key management guarantees that no data is lost, but cannot guarantee the stability of tamr_id.

To enforce the stability of tamr_id in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the tamr_id in the last transformation.

To set tamr_id in a transformation:

  1. Identify a subset of attributes in the unified dataset that uniquely identify individual records after all transformations.
  2. In your project’s last transformation, include a hash function to generate a hash code from those attributes: hash(col1, col2, col3) AS tamr_id. See hash.

When to Disable Automatic Primary Key Management

The automatic primary key management feature was introduced in Tamr v.2019.014.1.

  • If you started using Tamr after v.2019.014.1, Tamr automatically manages primary keys and you don't need to turn this feature off, for any projects, unless you have a specific workflow that would require you to specify your own keys.
  • If you created projects before Tamr v.2019.014.1, then you may want to temporarily disable automatic assignment of primary keys for workflow stability between versions. For example, this might be useful if you don't want to lose your labels after an upgrade. Automatic assignment of primary keys should remain permanently disabled for these projects.

You can disable automatic primary key management using the following procedures.

  • Manual option. Insert USE HINT(pkmanagement.manual) statements to apply a hint to the current transformation in the editor and to all subsequent transformations in that section. Note that a hint applied to the input dataset transformations section is not automatically carried over to the unified dataset transformations section. See Hint.
  • Script option. Use the ./unify-admin.sh maintenance --script DisablePKManagement option to disable primary key assignments after an upgrade. This script automatically adds the HINT described above to each project in your instance.

Summary of Changes to Primary Key Management

In the previous releases, the following changes were made to primary key (PK) management:

  • Starting with v2019.014.1, Tamr automatically assigns unique primary keys (tamr_id) if you have not assigned a tamr_id manually to your records. After you upgrade to this version, automatic primary key management is turned on. This leads to changes of primary keys (tamr_ids), which may lead to pair labels no longer marked as verified. You may need to re-run pair label verification for such records. If this is an issue, adding use hint(pkmanagement.manual); in front of all of your transformations will restore your previous primary keys. Running the DisablePKManagement maintenance script as noted later in this section automatically disables automatic assignment of primary keys for all your projects.
  • Starting with Tamr v2019.023, Tamr provides a script that can apply the use hint(pkmanagement.manual); statement to each transformation in each project, to disable automatic management of primary keys and allow you to manually update your transformations after you upgrade from Tamr v2019.014.1 or greater to this version or any subsequent versions.
  • Starting with v2020.007, Tamr added a command in the unify-admin.sh utility,
    ./unify-admin.sh maintenance --script DisablePKManagement. It disables primary key management for all unified datasets where transformations are used.
  • Starting with Tamr v2020.016 and greater, Tamr automatically assigns primary keys to all LOOKUP statements with non-equality join conditions that you add in this version or in subsequent versions. This means that Tamr changes primary keys (tamr_id) for such LOOKUP statements. For information about primary key management with LOOKUP statements, see Lookup.

Note: When primary keys change, all downstream labels, such as pair labels or labels in a categorization project, begin to refer to the records with the previous set of primary keys. The labels no longer link to any existing records that have new primary keys. This means that all previously-assigned labels are lost and you need to re-assign them.

See also Upgrading Tamr.

For information about other maintenance utilities that are available for unify-admin.sh, see Utilities for Validation and System-Wide Processes.

Updated 4 months ago

Managing Primary Keys

When you work with transformations, the uniqueness of the primary key value in the Tamr-generated tamr_id attribute is essential. For categorization and mastering projects, the tamr_id must also remain stable over time.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.