Managing Primary Keys
When you work with transformations, the uniqueness of the primary key value in the system-generated tamr_id
attribute is essential. For categorization and mastering projects, the tamr_id
must also remain stable over time.
Each time you upload a dataset to a project, you have the option to specify the attribute that contains the primary key. If you do not specify the primary key, Tamr Core generates one and populates it with the row number automatically. See Uploading a Dataset.
When you create the unified dataset for your project, Tamr Core automatically creates the following attributes with a data type of string:
origin_source_name
, populated with the name of the input dataset.origin_entity_id
, populated with the primary key value from the input dataset.tamr_id
, populated with a hash oforigin_source_name
andorigin_entity_id
to create the primary key for the unified dataset.
At the end of your project’s transformations, these attributes must be present with the data type string
. Tamr Core automatically manages the tamr_id
primary key throughout transformations to ensure that it remains a unique string for every row. However, in some cases the result of this management might not align with your goals for transformations that affect training data.
In categorization and mastering projects, Tamr Core uses the tamr_id
to store all user feedback on the categorization or matching and clustering of records.
Important: If your categorization or mastering project includes a transformation that can either explicitly or unintentionally change the value of the
tamr_id
, all prior user feedback, including pair labels and categorization labels, can be lost. The labels no longer link to any existing records that have new primary keys. As a result, when you design the transformations for a categorization or mastering project you may need to take extra steps to ensure the stability oftamr_id
over time; see When to Include a Transformation for tamr_id.
When to Include a Transformation for tamr_id
Transformations can either explicitly or unintentionally modify the system-generated attributes.
-
Linear transformations (transformations that do not combine or add rows in a dataset) do not change the internal attributes unless their values are explicitly modified, such as
upper(origin_source_name) AS origin_source_name
. -
Non-linear transformations modify
tamr_id
when automatic primary key management is enabled (default). Changes to these transformations can result in unintentional changes totamr_id
. Non-linear transformations areEXPLODE
,GROUP BY
,JOIN
,MERGE
,PIVOT
,UNION ALL
, andUNPIVOT
statements. Examples of changes that can unintentionally modifytamr_id
include updates to yourMERGE
orGROUP BY
keys,JOIN
conditions, orEXPLODE
array values, or adding or deleting input dataset rows when anUNPIVOT
is included. See the description for each of these statements for additional detail.If your system administrator disables automatic primary key management, non-linear transformations can result in rows with duplicate values for
tamr_id
. If there are duplicates intamr_id
, the uniqueness requirement of a primary key at the end of transformations is violated and only one data row pertamr_id
is kept in the final output dataset. Automatic primary key management guarantees that no data is lost, but cannot guarantee the stability oftamr_id
. -
The system automatically modifies the primary keys to all
LOOKUP
statements with non-equality join conditions. This means that the system changes primary keys (tamr_id) for suchLOOKUP
statements. For information about primary key management withLOOKUP
statements, see Lookup.
To enforce the stability of tamr_id
in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the tamr_id
in the last transformation.
To set tamr_id
in a transformation:
- Identify a subset of attributes in the unified dataset that uniquely identify individual records after all transformations.
- In your project’s last transformation, include a hash function to generate a hash code from those attributes:
hash(col1, col2, col3) AS tamr_id
. See hash.
How to Disable Automatic Primary Key Management
The system automatically manages primary keys. You do not need to disable this feature, for any projects, unless you have a specific workflow that requires you to specify your own keys.
You can disable automatic primary key management using the following procedures.
- Manual option. Insert
USE HINT(pkmanagement.manual)
statements to apply a hint to the current transformation in the editor and to all subsequent transformations in that section.
Note: A hint applied to the input dataset transformations section is not automatically carried over to the unified dataset transformations section. See Hint. - Script option. Use the
./unify-admin.sh maintenance --script DisablePKManagement
option to disable primary key assignments after an upgrade. This script automatically adds theHINT
described above to each project in your instance. See Upgrading Tamr Core.
For information about other maintenance utilities that are available forunify-admin.sh
, see Utilities for Validation and System-Wide Processes.
Manually-Managing Tamr IDs
Manually managing Tamr IDs means that we have more control over how Tamr Core links human feedback to records, which can be important when changing sources and transformations over time.
Example Scenario A:
Say you are developing a mastering project with subject matter experts (SME’s), and you have performed several iterations of pair labelling and cluster validation. The SMEs then realize that the dataset needs to be joined with another dataset.
With auto primary key management:
The join transformation causes the Tamr IDs of some or all of the unified dataset to change, losing the link between records and corresponding pair labels and cluster verifications.
With manual primary key management:
Each record’s Tamr ID behaves according to the logic you set. You may set up new logic for generating a Tamr ID for any new records created by the join, whilst any pre-existing records retain the Tamr ID based on the logic you originally set. Tamr Core retains links between original records and feedback.
Example Scenario B:
Say you imported a csv dataset into Tamr Core, and began to train the mastering model. During the project development process it is decided that in production the data should be read directly from the database, rather than extracted from a tool as a csv.
The database data is slightly different to the original csv, and a new source dataset is created in Tamr Core to hold this data. You then map the new dataset into the existing project and remove the old csv.
With auto primary key management:
The csv dataset and the database dataset have different names, and even if the source primary key is consistent, the Tamr IDs for these two datasets are different. When you replace the old csv with the new dataset, because of these different Tamr IDs, links between the records and feedback is lost.
With manual primary key management:
If your Tamr ID is generated based on a field (or fields) that are unique and consistent across the old and new datasets, then, when you swap the csv for the new dataset, the Tamr ID can be managed to remain consistent. This retains links between records and feedback.
Best Practices for Manually-Managing Tamr IDs:
- Set
USE HINT(pkmanagement.manual)
in the very first transformation in both the source and unified transformations, and set the Tamr ID in the very last transformation in the project. - Set the Tamr ID using either the tamr_id() or the hash() functions, for example,
select *, tamr_id(origin_source_name, origin_entity_id) as tamr_id
creates a Tamr ID equivalent to the default. - After you have performed all transformations, ensure that the primary key field(s) used to generate the Tamr ID are unique across all source datasets. For example, if there are two source datasets and both contain a record
ABC001
in the field feeding the Tamr ID, then it should also include a dataset identifier.
Note: If you are completely certain that there will not be duplicate primary keys across sources, then there is no need to include a dataset identifier in the Tamr ID. If however, you are not certain, then include a dataset identifier. - Consider whether any of the transformations will result in additional records in the unified dataset, and choose the primary key field(s) feeding the Tamr ID in line with the final state of the unified dataset.
Updated about 2 years ago