How to Obtain a Primary Key
The primary key is the field or combination of fields (columns or attributes) that uniquely identify an entity. (See Understanding Primary Keys for more information.) Tamr Core uses the primary key to identify the record, keep track of labels, keep track of cluster verification actions, and to map golden records or categories to source records. In short, Tamr Core needs stable primary keys that cannot change in downstream processes or systems to keep your data up-to-date. You can obtain a primary key in one of these ways in Tamr Core:
- Use an existing primary key as previously identified in your dataset.
- Use a Tamr Core-generated primary key per record.*
- Use a combination of columns to define a unique record in the dataset.
This article describes how and when to use each of these methods, using the UI and APIs.
Using an Existing Primary Key
You can use an existing primary key from your datasets. The examples in this section are based on the following dataset, key_tutorial.csv
:
Importing in the Tamr Core UI
In the dataset above, notice that the transactionid
values are unique. You can specify the transactionid
column as the primary key when importing the dataset in the UI, as shown:
After you add the dataset, you can view it in the Dataset Catalog. Notice that the transactionid
key is shown as the ID Field for this dataset in the Dataset Catalog:
Importing with the Tamr Core API
You can also specify a dataset's primary key when importing it using the Tamr Core API. In the example below, notice that the primaryKey
is set to transactionid
:
After you import the dataset, you can view it in the Dataset Catalog. Notice that the transactionid
key is shown as the ID Field for this dataset in the Dataset Catalog:
Using a Generated Primary Key
When adding a dataset, you can select No Primary Key as the Primary Key column, as shown below. In this case, Tamr Core generates a primary key column for the dataset, using line numbers as the primary key values.
Using a primary key generated by Tamr Core is the least preferred method of obtaining a primary key. If the input dataset is truncated (meaning that all records are deleted) and reloaded, then any pair labels, categorizations, cluster verification actions, and any other subject matter expert work will be lost.
To prevent data loss, you can identify a field of interest for your project (for example, linking transactions to vendors) and then use transformations to merge by transactionid
, with the understanding that every record in the unified dataset represents one transactionid
, as shown below:
This transformation results in the following unified dataset, key_tutorial_preprocessing_merge_unified_dataset
:
At this point, the unified dataset for transactions can be used in any future projects.
Using a Combination of Columns as the Primary Key
If your dataset does not include an existing primary key, you can create a primary key using a combination of columns. The examples in this section are based on the following dataset: key_tutorial_with_duplicates.csv
. In this dataset, the column transactionid
is not unique, but a combination of columns (transactionid
and account_no
) are unique.
Importing with the Tamr Core UI
When adding this dataset in the UI, you can select No Primary Key as the Primary Key column, as shown below. In this case, Tamr Core generates a primary key column, primaryKey
for the dataset, using line numbers as the primary key values.
This dataset now appears in the Dataset Catalog, and the ID field is set to the generated primaryKey
column.
You can now use transformations to specify the primary key. In the example below, a transformation is added in a schema mapping project to preprocess the data:
This transformation results in the following unified dataset, which can now be used in any projects and preserve primary key fidelity:
Importing with the Tamr Core API
You can also import a dataset using the Tamr Core API without specifying the primary key. In the example below, notice that the primaryKey
parameter is not included:
For datasets imported using the API, the generated primary key field is TAMRSEQ
. This primary key can be specified in a preprocessing project. In the image below, TAMRSEQ
is shown as the ID Field for this dataset in the Dataset Catalog:
You can also specify two keys programmatically when importing to Tamr Core, as shown below. Note, however, that these datasets cannot be used in projects other than to facilitate joins and lookups.
In the image below, account_no, transactionid
is set as the ID Field for this dataset in the Dataset Catalog:
Updated almost 2 years ago