Each time you add a dataset to Tamr you have the option to specify the attribute that contains the primary key. If you do not specify the primary key, Tamr generates one and populates it with the row number automatically. See Uploading a Dataset.
When you create the unified dataset for your project, Tamr automatically creates the following attributes with data type string:
origin_source_name, which Tamr populates with the name of the input dataset
origin_entity_id, which Tamr populates with the primary key value from the input dataset
tamr_id, which Tamr populates with a hash of
origin_entity_idto create the primary key for the unified dataset
At the end of your project’s transformations these attributes must be present with the data type string. Tamr automatically manages the
tamr_id primary key throughout transformations to ensure that it remains a unique string for every row. However, in some cases the result of this management might not align with your goals for transformations that affect training data. See below.
In categorization and mastering projects, Tamr uses the
tamr_id to store all user feedback on the categorization or matching and clustering of records.
Important: If your categorization or mastering project includes a transformation that can either explicitly or unintentionally change the value of the
tamr_id, all prior user feedback can be lost. As a result, when you design the transformations for a categorization or mastering project you may need to take extra steps to ensure the stability of
tamr_id over time. See When to Include a Transformation for tamr_id.
When upgrading from early versions of Tamr, you may need to disable automatic primary key management in order to keep the same primary key values across the upgrade. See When to Disable Automatic Primary Key Management.
Transformations can either explicitly or unintentionally modify the Tamr-generated attributes.
- Linear transformations (transformations that do not combine or add rows in a dataset) do not change the Tamr internal attributes unless their values are explicitly modified, such as
upper(origin_source_name) AS origin_source_name.
- Non-linear transformations modify
tamr_idwhen automatic primary key management is enabled. Changes to these transformations can result in unintentional changes to
tamr_id. Non-linear transformations include the
UNION ALL, and
UNPIVOTstatements. Examples of changes that can unintentionally modify
tamr_idinclude updates to your
EXPLODEarray values, or adding or deleting input dataset rows when an
UNPIVOTis included. See the description for each statement type for additional detail.
If automatic primary key management is disabled, non-linear transformations can result in rows with duplicate values for
tamr_id. If there are duplicates in
tamr_id, the uniqueness requirement of a primary key at the end of transformations is violated and only one data row per
tamr_idwill be kept in the final output dataset. Automatic primary key management guarantees that no data is lost, but cannot guarantee the stability of
To enforce the stability of
tamr_id in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the
tamr_id in the last transformation.
tamr_id in a transformation:
- Identify a subset of attributes in the unified dataset that uniquely identify individual records after all transformations.
- In your project’s last transformation, include a hash function to generate a hash code from those attributes:
hash(col1, col2, col3) AS tamr_id. See hash.
The automatic primary key management feature was introduced in Tamr v.2019.014.1.
- If you started using Tamr after v.2019.014.1, Tamr automatically manages primary keys and you don't need to turn this feature off, for any projects, unless you have a specific workflow that would require you to specify your own keys.
- If you created projects before Tamr v.2019.014.1, then you may want to temporarily disable automatic assignment of primary keys for workflow stability between versions. For example, this might be useful if you don't want to lose your labels after an upgrade. Automatic assignment of primary keys should remain permanently disabled for these projects.
You can disable automatic primary key management using the following procedures.
- Manual option. Insert
USE HINT(pkmanagement.manual)statements to apply a hint to the current transformation in the editor and to all subsequent transformations in that section. Note that a hint applied to the input dataset transformations section is not automatically carried over to the unified dataset transformations section. See Hint.
- Script option. Use the
./unify-admin.sh maintenance --script DisablePKManagementoption to disable primary key assignments after an upgrade. This script automatically adds the
HINTdescribed above to each project in your instance.
In the previous releases, the following changes were made to primary key (PK) management:
- Starting with v2019.014.1, Tamr automatically assigns unique primary keys (tamr_id) if you have not assigned a
tamr_idmanually to your records. After you upgrade to this version, automatic primary key management is turned on. This leads to changes of primary keys (
tamr_ids), which may lead to pair labels no longer marked as verified. You may need to re-run pair label verification for such records. If this is an issue, adding
use hint(pkmanagement.manual);in front of all of your transformations will restore your previous primary keys. Running the
DisablePKManagementmaintenance script as noted later in this section automatically disables automatic assignment of primary keys for all your projects.
- Starting with Tamr v2019.023, Tamr provides a script that can apply the use
hint(pkmanagement.manual);statement to each transformation in each project, to disable automatic management of primary keys and allow you to manually update your transformations after you upgrade from Tamr v2019.014.1 or greater to this version or any subsequent versions.
- Starting with v2020.007, Tamr added a command in the
./unify-admin.sh maintenance --script DisablePKManagement. It disables primary key management for all unified datasets where transformations are used.
- Starting with Tamr v2020.016 and greater, Tamr automatically assigns primary keys to all
LOOKUPstatements with non-equality join conditions that you add in this version or in subsequent versions. This means that Tamr changes primary keys (
tamr_id) for such
LOOKUPstatements. For information about primary key management with
LOOKUPstatements, see Lookup.
Note: When primary keys change, all downstream labels, such as pair labels or labels in a categorization project, begin to refer to the records with the previous set of primary keys. The labels no longer link to any existing records that have new primary keys. This means that all previously-assigned labels are lost and you need to re-assign them.
See also Upgrading Tamr.
For information about other maintenance utilities that are available for
unify-admin.sh, see Utilities for Validation and System-Wide Processes.
Updated 4 months ago