Examples of Cluster ID Changes
Running jobs for record pairs and clusters assigns and updates the persistent IDs of clusters.
In a mastering project, you work to refine cluster membership by merging clusters together and moving records between clusters. The first time you run Apply feedback and update results or Update results only on the Pairs page, Tamr Core assigns persistent IDs to all clusters. Each time you run Review and publish clusters on the Clusters page after that, Tamr Core assigns persistent IDs to any newly-created clusters and deletes the IDs of any empty clusters.
Using the persistent IDs, you can:
- Merge the records in each cluster to form a single, merged record that describes an entity. See Golden Records Projects.
- Use the ID as a key in other systems.
- Track cluster changes over time using metrics and system-generated datasets of cluster membership.
See Publishing Clusters.
How Do Cluster IDs Change Over Time?
As you create, update, and delete clusters, they may change their IDs. The resulting clusters may retain the IDs of previous clusters, and new IDs can be issued to new clusters.
This section explains how Tamr Core handles cluster IDs and what you can do to review the cluster changes that take place over time:
- The retention of an existing cluster ID is known as surviving. The surviving of cluster IDs occurs when a new cluster retains the ID of a cluster that existed previously.
- The creating of a new cluster ID is known as minting. The minting of cluster IDs occurs when new clusters are created.
- The removal of a cluster ID is known as retiring. The retiring of cluster IDs occurs when existing clusters are deleted or emptied.
The following examples explain how IDs evolve over time.
Example 1: Surviving and Minting of Cluster IDs
Consider a published cluster A that splits into two clusters. If one of the two clusters keeps the cluster ID A and the other cluster obtains a newly created cluster ID B, then cluster ID A "survived" and cluster ID B is "minted".
Cluster IDs are unique. Tamr never re-issues a previously minted cluster ID.
Example 2: Surviving and Retiring of Cluster IDs
Consider two published clusters A and B that merge into one cluster. If the merged cluster keeps the ID A, cluster A's ID survives and cluster B's is retired.
When you merge clusters using Actions > Merge, Tamr Core survives the cluster ID of the cluster with the largest number of records. In the example below, cluster A's ID survives, as it is the larger cluster of the two.
Tamr Core does not re-issue retired cluster IDs.
Automatic and Manual Survivorship of Cluster IDs
Tamr Core automatically survives, mints, and retires cluster IDs between the latest published clusters and the current clusters:
- There are two ways to merge clusters in the Tamr Core UI, by selecting Actions > Merge, or via drag and drop:
- When you merge clusters via Actions > Merge, the cluster ID of the cluster with the largest number of records automatically survives on the merged cluster. In the event of a tie, Tamr Core survives the ID of the first cluster (appears first in left navigation).
- When you merge clusters via drag and drop, the cluster ID of the destination cluster survives, while the cluster ID(s) of the dragged cluster(s) retires. See Merging Using Drag and Drop.
- When a cluster splits into two or more clusters, the cluster ID of the cluster that is split automatically survives on the new cluster with the largest number of records. In the event of a tie, Tamr Core survives on the cluster of the first split cluster. Tamr Core mints new cluster IDs for the other resulting clusters.
- When you split clusters via Move to new, the cluster ID of the given cluster survives on the remaining records, while a new cluster ID is minted for the cluster resulting from the moved records. See Moving Records to a New Cluster.
- Users with the curator or admin role can override and specify which ID should persist when merging and splitting clusters.
- When you publish clusters, Tamr Core publishes the resulting cluster IDs and they become the latest published clusters.
Reporting Cluster Changes
You can access the historical information about all past cluster versions through RESTful APIs:
- Fetch cluster history. See Get all versions of one or more published cluster.
- Fetch record history. See Get all versions of one or more published cluster given identifiers of records in them.
- Configure the time-to-live for clusters. See Update published clusters configuration.
- Obtain only the latest version of published clusters. See Using Low Latency Match with Published Clusters.
Updated over 2 years ago