How to Obtain a Primary Key

The primary key is the field or combination of fields (columns or attributes) that uniquely identify an entity. (See Understanding Primary Keys for more information.) Tamr Core uses the primary key to identify the record, keep track of labels, keep track of cluster verification actions, and to map golden records or categories to source records. In short, Tamr Core needs stable primary keys that cannot change in downstream processes or systems to keep your data up-to-date. You can obtain a primary key in one of these ways in Tamr Core:

  1. Use an existing primary key as previously identified in your dataset.
  2. Use a Tamr Core-generated primary key per record.*
  3. Use a combination of columns to define a unique record in the dataset.

This article describes how and when to use each of these methods, using the UI and APIs.

Using an Existing Primary Key

You can use an existing primary key from your datasets. The examples in this section are based on the following dataset, key_tutorial.csv:

key_tutorial.csv dataset

Importing in the Tamr Core UI

In the dataset above, notice that the transactionid values are unique. You can specify the transactionid column as the primary key when importing the dataset in the UI, as shown:

Primary Key column is set to transactionid

After you add the dataset, you can view it in the Dataset Catalog. Notice that the transactionid key is shown as the ID Field for this dataset in the Dataset Catalog:

Importing with the Tamr Core API

You can also specify a dataset's primary key when importing it using the Tamr Core API. In the example below, notice that the primaryKey is set to transactionid:

Importing a dataset using the API

After you import the dataset, you can view it in the Dataset Catalog. Notice that the transactionid key is shown as the ID Field for this dataset in the Dataset Catalog:

Primary key field in the Dataset Catalog

Using a Generated Primary Key

When adding a dataset, you can select No Primary Key as the Primary Key column, as shown below. In this case, Tamr Core generates a primary key column for the dataset, using line numbers as the primary key values.

Autogenerating a primary key

Using a primary key generated by Tamr Core is the least preferred method of obtaining a primary key. If the input dataset is truncated (meaning that all records are deleted) and reloaded, then any pair labels, categorizations, cluster verification actions, and any other subject matter expert work will be lost.

To prevent data loss, you can identify a field of interest for your project (for example, linking transactions to vendors) and then use transformations to merge by transactionid, with the understanding that every record in the unified dataset represents one transactionid, as shown below:

Primary key transformation

This transformation results in the following unified dataset, key_tutorial_preprocessing_merge_unified_dataset:

Transformed unified dataset

At this point, the unified dataset for transactions can be used in any future projects.

Using a Combination of Columns as the Primary Key

If your dataset does not include an existing primary key, you can create a primary key using a combination of columns. The examples in this section are based on the following dataset: key_tutorial_with_duplicates.csv. In this dataset, the column transactionid is not unique, but a combination of columns (transactionid and account_no) are unique.

Dataset without a primary key

Importing with the Tamr Core UI

When adding this dataset in the UI, you can select No Primary Key as the Primary Key column, as shown below. In this case, Tamr Core generates a primary key column, primaryKey for the dataset, using line numbers as the primary key values.

This dataset now appears in the Dataset Catalog, and the ID field is set to the generated primaryKey column.

Dataset Catalog

You can now use transformations to specify the primary key. In the example below, a transformation is added in a schema mapping project to preprocess the data:

Primary key transformation

This transformation results in the following unified dataset, which can now be used in any projects and preserve primary key fidelity:

Unified datset

Importing with the Tamr Core API

You can also import a dataset using the Tamr Core API without specifying the primary key. In the example below, notice that the primaryKey parameter is not included:

Importing a dataset without a primary key using the API

For datasets imported using the API, the generated primary key field is TAMRSEQ. This primary key can be specified in a preprocessing project. In the image below, TAMRSEQ is shown as the ID Field for this dataset in the Dataset Catalog:

Dataset Catalog

You can also specify two keys programmatically when importing to Tamr Core, as shown below. Note, however, that these datasets cannot be used in projects other than to facilitate joins and lookups.

Dataset import using API

In the image below, account_no, transactionid is set as the ID Field for this dataset in the Dataset Catalog:

Dataset Catalog