Each time you upload a dataset to a project you have the option to specify the attribute that contains the primary key. If you do not specify the primary key, Tamr Core generates one and populates it with the row number automatically. See Uploading a Dataset.
When you create the unified dataset for your project, Tamr Core automatically creates the following attributes with a data type of string:
origin_source_name, populated with the name of the input dataset.
origin_entity_id, populated with the primary key value from the input dataset.
tamr_id, populated with a hash of
origin_entity_idto create the primary key for the unified dataset.
At the end of your project’s transformations these attributes must be present with the data type string. Tamr Core automatically manages the
tamr_id primary key throughout transformations to ensure that it remains a unique string for every row. However, in some cases the result of this management might not align with your goals for transformations that affect training data.
In categorization and mastering projects, Tamr Core uses the
tamr_id to store all user feedback on the categorization or matching and clustering of records.
Important: If your categorization or mastering project includes a transformation that can either explicitly or unintentionally change the value of the
tamr_id, all prior user feedback can be lost. As a result, when you design the transformations for a categorization or mastering project you may need to take extra steps to ensure the stability of
tamr_idover time. See When to Include a Transformation for tamr_id.
When upgrading from early versions of Tamr Core, you may need to disable automatic primary key management in order to keep the same primary key values across the upgrade. See When to Disable Automatic Primary Key Management.
Transformations can either explicitly or unintentionally modify the system-generated attributes.
- Linear transformations (transformations that do not combine or add rows in a dataset) do not change the internal attributes unless their values are explicitly modified, such as
upper(origin_source_name) AS origin_source_name.
- Non-linear transformations modify
tamr_idwhen automatic primary key management is enabled. Changes to these transformations can result in unintentional changes to
tamr_id. Non-linear transformations are completed by the
UNION ALL, and
UNPIVOTstatements. Examples of changes that can unintentionally modify
tamr_idinclude updates to your
EXPLODEarray values, or adding or deleting input dataset rows when an
UNPIVOTis included. See the description for each of these statements for additional detail.
If your system administrator disables automatic primary key management, non-linear transformations can result in rows with duplicate values for
tamr_id. If there are duplicates in
tamr_id, the uniqueness requirement of a primary key at the end of transformations is violated and only one data row per
tamr_idis kept in the final output dataset. Automatic primary key management guarantees that no data is lost, but cannot guarantee the stability of
To enforce the stability of
tamr_id in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the
tamr_id in the last transformation.
tamr_id in a transformation:
- Identify a subset of attributes in the unified dataset that uniquely identify individual records after all transformations.
- In your project’s last transformation, include a hash function to generate a hash code from those attributes:
hash(col1, col2, col3) AS tamr_id. See hash.
Tamr introduced the automatic primary key management feature in v.2019.014.1.
- If you started using Tamr Core after v.2019.014.1, the system automatically manages primary keys and you don't need to turn this feature off, for any projects, unless you have a specific workflow that would require you to specify your own keys.
- If you created projects before v.2019.014.1, then you may want to temporarily disable automatic assignment of primary keys for workflow stability between versions. For example, this might be useful if you don't want to lose your labels after an upgrade. Automatic assignment of primary keys should remain permanently disabled for these projects.
You can disable automatic primary key management using the following procedures.
- Manual option. Insert
USE HINT(pkmanagement.manual)statements to apply a hint to the current transformation in the editor and to all subsequent transformations in that section.
Note: A hint applied to the input dataset transformations section is not automatically carried over to the unified dataset transformations section. See Hint.
- Script option. Use the
./unify-admin.sh maintenance --script DisablePKManagementoption to disable primary key assignments after an upgrade. This script automatically adds the
HINTdescribed above to each project in your instance.
In the previous releases, the following changes were made to primary key (PK) management:
- Starting with v2019.023, Tamr provides a script that can apply the use
hint(pkmanagement.manual);statement to each transformation in each project, to disable automatic management of primary keys and allow you to manually update your transformations after you upgrade from v2019.014.1 or greater to this version or any subsequent versions.
- Starting with v2020.007, Tamr added a command in the
./unify-admin.sh maintenance --script DisablePKManagement. It disables primary key management for all unified datasets where transformations are used.
- Starting with v2020.016 and greater, the system automatically assigns primary keys to all
LOOKUPstatements with non-equality join conditions that you add in this version or in subsequent versions. This means that the system changes primary keys (
tamr_id) for such
LOOKUPstatements. For information about primary key management with
LOOKUPstatements, see Lookup.
Note: When primary keys change, all downstream labels, such as pair labels or labels in a categorization project, begin to refer to the records with the previous set of primary keys. The labels no longer link to any existing records that have new primary keys. This means that all previously-assigned labels are lost and you need to re-assign them.
See also Upgrading Tamr Core.
For information about other maintenance utilities that are available for
unify-admin.sh, see Utilities for Validation and System-Wide Processes.
Updated 9 days ago