Managing Primary Keys
When you work with transformations, the uniqueness of the primary key value in the system-generated tamr_id
attribute is essential. For categorization and mastering projects, the tamr_id
must also remain stable over time.
Each time you upload a dataset to a project, you have the option to specify the attribute that contains the primary key. If you do not specify the primary key, Tamr Core generates one and populates it with the row number automatically. See Uploading a Dataset.
When you create the unified dataset for your project, Tamr Core automatically creates the following attributes with a data type of string:
origin_source_name
, populated with the name of the input dataset.origin_entity_id
, populated with the primary key value from the input dataset.tamr_id
, populated with a hash oforigin_source_name
andorigin_entity_id
to create the primary key for the unified dataset.
At the end of your project’s transformations, these attributes must be present with the data type string
. Tamr Core automatically manages the tamr_id
primary key throughout transformations to ensure that it remains a unique string for every row. However, in some cases the result of this management might not align with your goals for transformations that affect training data.
In categorization and mastering projects, Tamr Core uses the tamr_id
to store all user feedback on the categorization or matching and clustering of records.
Important: If your categorization or mastering project includes a transformation that can either explicitly or unintentionally change the value of the
tamr_id
, all prior user feedback, including pair labels and categorization labels, can be lost. The labels no longer link to any existing records that have new primary keys. As a result, when you design the transformations for a categorization or mastering project you may need to take extra steps to ensure the stability oftamr_id
over time; see When to Include a Transformation for tamr_id.
When to Include a Transformation for tamr_id
Transformations can either explicitly or unintentionally modify the system-generated attributes.
-
Linear transformations (transformations that do not combine or add rows in a dataset) do not change the internal attributes unless their values are explicitly modified, such as
upper(origin_source_name) AS origin_source_name
. -
Non-linear transformations modify
tamr_id
when automatic primary key management is enabled (default). Changes to these transformations can result in unintentional changes totamr_id
. Non-linear transformations areEXPLODE
,GROUP BY
,JOIN
,MERGE
,PIVOT
,UNION ALL
, andUNPIVOT
statements. Examples of changes that can unintentionally modifytamr_id
include updates to yourMERGE
orGROUP BY
keys,JOIN
conditions, orEXPLODE
array values, or adding or deleting input dataset rows when anUNPIVOT
is included. See the description for each of these statements for additional detail.If your system administrator disables automatic primary key management, non-linear transformations can result in rows with duplicate values for
tamr_id
. If there are duplicates intamr_id
, the uniqueness requirement of a primary key at the end of transformations is violated and only one data row pertamr_id
is kept in the final output dataset. Automatic primary key management guarantees that no data is lost, but cannot guarantee the stability oftamr_id
. -
The system automatically modifies the primary keys to all
LOOKUP
statements with non-equality join conditions. This means that the system changes primary keys (tamr_id) for suchLOOKUP
statements. For information about primary key management withLOOKUP
statements, see Lookup.
To enforce the stability of tamr_id
in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the tamr_id
in the last transformation.
To set tamr_id
in a transformation:
- Identify a subset of attributes in the unified dataset that uniquely identify individual records after all transformations.
- In your project’s last transformation, include a hash function to generate a hash code from those attributes:
hash(col1, col2, col3) AS tamr_id
. See hash.
How to Disable Automatic Primary Key Management
The system automatically manages primary keys. You do not need to disable this feature, for any projects, unless you have a specific workflow that requires you to specify your own keys.
You can disable automatic primary key management using the following procedures.
-
Manual option. Insert
USE HINT(pkmanagement.manual)
statements to apply a hint to the current transformation in the editor and to all subsequent transformations in that section.Note: A hint applied to the input dataset transformations section is not automatically carried over to the unified dataset transformations section. See Hint.
-
Script option. Use the
./unify-admin.sh maintenance --script DisablePKManagement
option to disable primary key assignments after an upgrade. This script automatically adds theHINT
described above to each project in your instance. See Upgrading Tamr Core.For information about other maintenance utilities that are available for
unify-admin.sh
, see Utilities for Validation and System-Wide Processes.
Updated 3 months ago