User GuidesAPI ReferenceRelease NotesEnrichment APIs
Doc HomeSupportLog In

Managing Primary Keys

When you work with transformations, the uniqueness of the primary key value in the system-generated tamr_id attribute is essential. For categorization and mastering projects, the tamr_id must also remain stable over time.

Each time you upload a dataset to a project, you have the option to specify the attribute that contains the primary key. If you do not specify the primary key, Tamr Core generates one and populates it with the row number automatically. See Uploading a Dataset.

When you create the unified dataset for your project, Tamr Core automatically creates the following attributes with a data type of string:

  • origin_source_name, populated with the name of the input dataset.
  • origin_entity_id, populated with the primary key value from the input dataset.
  • tamr_id, populated with a hash of origin_source_name and origin_entity_id to create the primary key for the unified dataset.

At the end of your project’s transformations, these attributes must be present with the data type string. Tamr Core automatically manages the tamr_id primary key throughout transformations to ensure that it remains a unique string for every row. However, in some cases the result of this management might not align with your goals for transformations that affect training data.

In categorization and mastering projects, Tamr Core uses the tamr_id to store all user feedback on the categorization or matching and clustering of records.

importantimportant Important: If your categorization or mastering project includes a transformation that can either explicitly or unintentionally change the value of the tamr_id, all prior user feedback, including pair labels and categorization labels, can be lost. The labels no longer link to any existing records that have new primary keys. As a result, when you design the transformations for a categorization or mastering project you may need to take extra steps to ensure the stability of tamr_id over time; see When to Include a Transformation for tamr_id.

When to Include a Transformation for tamr_id

Transformations can either explicitly or unintentionally modify the system-generated attributes.

  • Linear transformations (transformations that do not combine or add rows in a dataset) do not change the internal attributes unless their values are explicitly modified, such as upper(origin_source_name) AS origin_source_name.

  • Non-linear transformations modify tamr_id when automatic primary key management is enabled (default). Changes to these transformations can result in unintentional changes to tamr_id. Non-linear transformations are EXPLODE, GROUP BY, JOIN, MERGE, PIVOT, UNION ALL, and UNPIVOT statements. Examples of changes that can unintentionally modify tamr_id include updates to your MERGE or GROUP BY keys, JOIN conditions, or EXPLODE array values, or adding or deleting input dataset rows when an UNPIVOT is included. See the description for each of these statements for additional detail.

    If your system administrator disables automatic primary key management, non-linear transformations can result in rows with duplicate values for tamr_id. If there are duplicates in tamr_id, the uniqueness requirement of a primary key at the end of transformations is violated and only one data row per tamr_id is kept in the final output dataset. Automatic primary key management guarantees that no data is lost, but cannot guarantee the stability of tamr_id.

  • The system automatically modifies the primary keys to all LOOKUP statements with non-equality join conditions. This means that the system changes primary keys (tamr_id) for such LOOKUP statements. For information about primary key management with LOOKUP statements, see Lookup.

To enforce the stability of tamr_id in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the tamr_id in the last transformation.

To set tamr_id in a transformation:

  1. Identify a subset of attributes in the unified dataset that uniquely identify individual records after all transformations.
  2. In your project’s last transformation, include a hash function to generate a hash code from those attributes: hash(col1, col2, col3) AS tamr_id. See hash.

How to Disable Automatic Primary Key Management

The system automatically manages primary keys. You do not need to disable this feature, for any projects, unless you have a specific workflow that requires you to specify your own keys.

You can disable automatic primary key management using the following procedures.

  • Manual option. Insert USE HINT(pkmanagement.manual) statements to apply a hint to the current transformation in the editor and to all subsequent transformations in that section.

    Note: A hint applied to the input dataset transformations section is not automatically carried over to the unified dataset transformations section. See Hint.

  • Script option. Use the ./ maintenance --script DisablePKManagement option to disable primary key assignments after an upgrade. This script automatically adds the HINT described above to each project in your instance. See Upgrading Tamr Core.

    For information about other maintenance utilities that are available for, see Utilities for Validation and System-Wide Processes.

Did this page help you?