Understanding Primary Keys
What is a Primary Key?
A primary key is a single field or combination of fields that uniquely identify a record in a dataset.
Primary keys are unique and stable over time:
- Unique: each primary key appears only once in the dataset.
- Stable: the key for a given record does not arbitrarily change over time.
Tamr suggests the primary key to be meaningful to the data, as this reduces the likelihood of breaking changes upstream.
For example, if there is a designated primary key in the source system, it may be best to use this as the primary key, rather than another unique key.
Why is the Tamr ID Important?
- The Tamr ID acts as the primary key for the unified dataset. This means that every record must have a unique Tamr ID, and if two records somehow end up with the same Tamr ID, the unified dataset retains only one of them.
- All human feedback is either linked directly with a record’s Tamr ID or something derived from the Tamr ID. This means that if users apply feedback to a record, and then that record’s Tamr ID changes, the feedback is no longer associated with that record.
When is the Primary Key of a Dataset Important?
Tamr Core uses primary keys to identify records, keep track of cluster verification actions, and to map golden records or categorizations to source records.
Users interact with source datasets and output datasets differently. For example, you can directly change the content of source datasets via API, which is not possible for output datasets. When deleting, adding, or changing records in a source dataset, you use the record’s primary key. Therefore, if you are using an incremental workflow where you may want to change the contents of your source dataset over time (delta processing), you need stable, ideally meaningful, primary keys.
Primary Keys for Source Datasets
When you upload a source dataset to Tamr Core, you must identify the primary key for that dataset. You can either select a field present in the dataset, or select No Primary Key, which creates a numeric index for the dataset.
For source datasets, Tamr Core does not perform validation on the field selected as the primary key. A field containing duplicate values can be selected as the primary key, and will successfully upload, however only one record will exist for each unique primary key value. Tamr suggests verifying the uniqueness of your target primary key field prior to uploading.
Note: You cannot select a combination of fields when uploading source datasets via the UI. See Uploading a Dataset into a Project.
Primary Keys for Output Datasets
All output datasets have a primary key. In unified datasets, the primary key is the tamr_id
. In other output datasets, attributes such as entityId
(equivalent to the origin_entity_id
in the unified dataset) and clusterId
, are the primary key. The primary key depends on the content of the dataset and is not configurable.
How are Tamr IDs Generated?
There are three Tamr-generated attributes present in every unified dataset:
origin_source_name
: The name of the source dataset.origin_entity_id
: The primary key of the source dataset.tamr_id
: Generated by hashing origin_source_name and origin_entity_id, to create a primary key for the unified dataset.
By default, Tamr Core performs auto primary key management. This means that tamr_id
is generated using origin_source_name
and origin_source_id
as explained above. Some transformations may alter a record's tamr_id
in potentially unexpected ways. See Managing Primary Keys.
Updated almost 2 years ago