Each time you upload a dataset to a project, you have the option to specify the attribute that contains the primary key. If you do not specify the primary key, Tamr Core generates one and populates it with the row number automatically. See Uploading a Dataset.
When you create the unified dataset for your project, Tamr Core automatically creates the following attributes with a data type of string:
origin_source_name, populated with the name of the input dataset.
origin_entity_id, populated with the primary key value from the input dataset.
tamr_id, populated with a hash of
origin_entity_idto create the primary key for the unified dataset.
At the end of your project’s transformations, these attributes must be present with the data type
string. Tamr Core automatically manages the
tamr_id primary key throughout transformations to ensure that it remains a unique string for every row. However, in some cases the result of this management might not align with your goals for transformations that affect training data.
In categorization and mastering projects, Tamr Core uses the
tamr_id to store all user feedback on the categorization or matching and clustering of records.
Important: If your categorization or mastering project includes a transformation that can either explicitly or unintentionally change the value of the
tamr_id, all prior user feedback, including pair labels and categorization labels, can be lost. The labels no longer link to any existing records that have new primary keys. As a result, when you design the transformations for a categorization or mastering project you may need to take extra steps to ensure the stability of
tamr_idover time; see When to Include a Transformation for tamr_id.
Transformations can either explicitly or unintentionally modify the system-generated attributes.
Linear transformations (transformations that do not combine or add rows in a dataset) do not change the internal attributes unless their values are explicitly modified, such as
upper(origin_source_name) AS origin_source_name.
Non-linear transformations modify
tamr_idwhen automatic primary key management is enabled (default). Changes to these transformations can result in unintentional changes to
tamr_id. Non-linear transformations are
UNION ALL, and
UNPIVOTstatements. Examples of changes that can unintentionally modify
tamr_idinclude updates to your
EXPLODEarray values, or adding or deleting input dataset rows when an
UNPIVOTis included. See the description for each of these statements for additional detail.
If your system administrator disables automatic primary key management, non-linear transformations can result in rows with duplicate values for
tamr_id. If there are duplicates in
tamr_id, the uniqueness requirement of a primary key at the end of transformations is violated and only one data row per
tamr_idis kept in the final output dataset. Automatic primary key management guarantees that no data is lost, but cannot guarantee the stability of
The system automatically modifies the primary keys to all
LOOKUPstatements with non-equality join conditions. This means that the system changes primary keys (tamr_id) for such
LOOKUPstatements. For information about primary key management with
LOOKUPstatements, see Lookup.
To enforce the stability of
tamr_id in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the
tamr_id in the last transformation.
tamr_id in a transformation:
- Identify a subset of attributes in the unified dataset that uniquely identify individual records after all transformations.
- In your project’s last transformation, include a hash function to generate a hash code from those attributes:
hash(col1, col2, col3) AS tamr_id. See hash.
The system automatically manages primary keys. You do not need to disable this feature, for any projects, unless you have a specific workflow that requires you to specify your own keys.
You can disable automatic primary key management using the following procedures.
- Manual option. Insert
USE HINT(pkmanagement.manual)statements to apply a hint to the current transformation in the editor and to all subsequent transformations in that section.
Note: A hint applied to the input dataset transformations section is not automatically carried over to the unified dataset transformations section. See Hint.
- Script option. Use the
./unify-admin.sh maintenance --script DisablePKManagementoption to disable primary key assignments after an upgrade. This script automatically adds the
HINTdescribed above to each project in your instance. See Upgrading Tamr Core.
For information about other maintenance utilities that are available for
unify-admin.sh, see Utilities for Validation and System-Wide Processes.
Manually managing Tamr IDs means that we have more control over how Tamr Core links human feedback to records, which can be important when changing sources and transformations over time.
Say you are developing a mastering project with subject matter experts (SME’s), and you have performed several iterations of pair labelling and cluster validation. The SMEs then realize that the dataset needs to be joined with another dataset.
With auto primary key management:
The join transformation causes the Tamr IDs of some or all of the unified dataset to change, losing the link between records and corresponding pair labels and cluster verifications.
With manual primary key management:
Each record’s Tamr ID behaves according to the logic you set. You may set up new logic for generating a Tamr ID for any new records created by the join, whilst any pre-existing records retain the Tamr ID based on the logic you originally set. Tamr Core retains links between original records and feedback.
Say you imported a csv dataset into Tamr Core, and began to train the mastering model. During the project development process it is decided that in production the data should be read directly from the database, rather than extracted from a tool as a csv.
The database data is slightly different to the original csv, and a new source dataset is created in Tamr Core to hold this data. You then map the new dataset into the existing project and remove the old csv.
With auto primary key management:
The csv dataset and the database dataset have different names, and even if the source primary key is consistent, the Tamr IDs for these two datasets are different. When you replace the old csv with the new dataset, because of these different Tamr IDs, links between the records and feedback is lost.
With manual primary key management:
If your Tamr ID is generated based on a field (or fields) that are unique and consistent across the old and new datasets, then, when you swap the csv for the new dataset, the Tamr ID can be managed to remain consistent. This retains links between records and feedback.
USE HINT(pkmanagement.manual)in the very first transformation in both the source and unified transformations, and set the Tamr ID in the very last transformation in the project.
- Set the Tamr ID using either the tamr_id() or the hash() functions, for example,
select *, tamr_id(origin_source_name, origin_entity_id) as tamr_idcreates a Tamr ID equivalent to the default.
- After you have performed all transformations, ensure that the primary key field(s) used to generate the Tamr ID are unique across all source datasets. For example, if there are two source datasets and both contain a record
ABC001in the field feeding the Tamr ID, then it should also include a dataset identifier.
Note: If you are completely certain that there will not be duplicate primary keys across sources, then there is no need to include a dataset identifier in the Tamr ID. If however, you are not certain, then include a dataset identifier.
- Consider whether any of the transformations will result in additional records in the unified dataset, and choose the primary key field(s) feeding the Tamr ID in line with the final state of the unified dataset.
Updated over 1 year ago