Managing Primary Keys
When working with transformations, to manipulate input of derived datasets, the creation and populating of the column tamr_id
can be automatic or manual.
The primary key of a dataset is a specific minimal set of attributes that uniquely identify a record. When you upload a source dataset to Tamr, Tamr selects the dataset's primary key as the dataset's attribute that uniquely identifies records in this dataset. The dataset, once loaded, becomes an input dataset. If no such attribute exists, Tamr adds an attribute and populates it using a generated value that is guaranteed to be unique. See Uploading a Dataset.
All non-input datasets in Tamr are known as derived datasets. Derived datasets also have a primary key attribute.
Unified datasets in a Mastering or Categorization project may be derived from one or more input datasets. In such cases, Tamr automatically generates the additional column tamr_id
to uniquely identify the records of the unified dataset.
When working with transformations, to manipulate input of derived datasets, the creation and populating of the column tamr_id
with values can be automatic or manual:
- When Tamr adds primary keys automatically, this ensures that records are always uniquely identified. It is a convenient feature when working with transformations such as
PIVOT
,GROUP BY
andJOIN
, that transform the uniqueness of records. - When you manually add primary keys, this also ensures that records are always uniquely identified. In addition, this allows you to consider the stability of the record values over time.
Important: User feedback is linked to the primary keys (tamr_id
). All types of user feedback, including record categorizations, record pair labels, record locks, and record comments, are linked to the tamr_id
of the unified dataset of the project. If you add or change transformations that change the value of the primary key (tamr_id
), user feedback, such as labels, is lost.
When to Disable Automatic Primary Key Management
If you have started using Tamr after v.2019.014.1, Tamr automatically manages primary keys and you don't need to turn this feature off, for any projects, unless you have a specific workflow that would require you to specify your own keys.
If you have created projects before Tamr v.2019.014.1, then you may want to temporarily disable automatic assignment of primary keys for workflow stability between versions. For example, this might be useful if you don't want to lose your labels after an upgrade. After you upgrade, you can re-enable automatic assignment of primary keys.
In some cases, you may also want to always create primary keys manually. In this case, you can disable automatic management of keys using the following procedure.
You can add USE HINT
statements to your transformations that disable primary keys manually or with a script.
- Manual option. Use the
USE HINT pkmanagement.manual
statements to apply a hint to the current transformation in the editor and to all of the subsequent transformations in that project. See Labels, Hints, and Scope. - Script option. Use the
./unify-admin.sh maintenance --script DisablePKManagement
option to disable primary key assigments after an upgrade.
To manually manage primary keys (tamr_id
) in transformations:
- Use transformations to directly populate the attribute
tamr_id
, and use one of these methods:
a. Use the statementsUSE
andHINT
, and specifypkmanagement.manual
. For example, to disable automatic primary key management by Tamr in a particular project, add:USE HINT(pkmanagement.manual);
in the first transformation, or
b. Use a command in theunify-admin.sh
utility,
./unify-admin.sh maintenance --script DisablePKManagement
. It disables primary key management for all unified datasets where transformations are used.
Summary of Changes to Primary Key Management
In the previous releases, the following changes were made to primary key (PK) management:
- Starting with v2019.014.1, Tamr automatically assigns unique primary keys (tamr_id) if you have not assigned a
tamr_id
manually to your records. After you upgrade to this version, automatic primary key management is turned on. This leads to changes of primary keys (tamr_ids
), which may lead to pair labels no longer marked as verified. You may need to re-run pair label verification for such records. If this is an issue, addinguse hint(pkmanagement.manual);
in front of all of your transformations will restore your previous primary keys. Running theDisablePKManagement
maintenance script as noted later in this section automatically disables automatic assignment of primary keys for all your projects. - Starting with Tamr v2019.023, Tamr provides a script that can apply the use
hint(pkmanagement.manual);
statement to each transformation in each project, to disable automatic management of primary keys and allow you to manually update your transformations after you upgrade from Tamr v2019.014.1 or greater to this version or any subsequent versions. - Starting with v2020.007, Tamr added a command in the
unify-admin.sh
utility,
./unify-admin.sh maintenance --script DisablePKManagement
. It disables primary key management for all unified datasets where transformations are used. - Starting with Tamr v2020.016 and greater, Tamr automatically assigns primary keys to all
LOOKUP
statements with non-equality join conditions that you add in this version or in subsequent versions. This means that Tamr changes primary keys (tamr_id
) for suchLOOKUP
statements. For information about primary key management withLOOKUP
statements, see Lookup.
Note: When primary keys change, all downstream labels, such as pair labels, or labels in a categorization project begin to refer to the records with the previous set of primary keys. The labels no longer link to any existing records that have new primary keys. This means that all previously-assigned labels are lost and you need to re-assign them.
See also Upgrading Tamr.
Updated over 4 years ago