Tamr Documentation

Publishing Clusters

Publishing reassigns persistent IDs to clusters to reflect your changes and is a key process in the clustering workflow. You can use recall and precision metrics to guide when to publish.

Publishing in the Cluster Workflow

The first time that you initiate an Apply feedback and update results or Update results only job in a mastering project, Tamr “publishes” the initial set of clusters by assigning persistent IDs.

When you use filters to find high-impact and representative clusters to do an initial review, you are likely to discover these types of issues:

  • Too many different clusters are created for the same entity. These clusters need to be merged together.
  • Clusters contain records that are really for different entities. These clusters need to be split into separate clusters, or records moved to the correct cluster.

You also can go back and update the blocking model or verify more record pairs.

As you work with clusters, you choose when to manually republish by running a Review and publish clusters job; this job assigns persistent IDs to any new clusters and deletes any empty clusters. Each time you republish, Tamr saves a snapshot of the clusters and recomputes recall and precision metrics.

The cluster workflow is a flexible, iterative process with many other variations possible. Understanding the importance of when you choose to publish during your workflow can reduce unnecessary or unanticipated rework.

Precision and Recall Metrics for Clusters

Tamr computes precision and recall metrics for clusters.

  • Precision is a measure of the effectiveness of finding relationships among data, expressed by how many matched candidates were correct. It is measured by (True Positives)/(True Positives + False Positives). For clusters in mastering projects, precision is the ratio of assignments that are verified in the current cluster to all assignments suggested by Tamr.
  • Recall is a measure of the effectiveness of finding relationships among data, expressed by how many real matches were returned. It is measured by (True Positives)/(True Positives + False Negatives). For clusters in mastering projects, recall is the ratio of correct Tamr assignments to all correct expert assignments.

For more information about these metrics, see Precision and Recall.

Reviewing Precision and Recall for Clusters

Tamr computes cluster metrics from test records. To ensure accurate estimates, this feature is available only after you verify records in a sufficient number of high-impact clusters (10 or more). Tamr automatically enables the Estimate cluster metrics option when you reach the minimum number of verifications necessary for the computation. Tamr recommends using the High-impact filter and curating all of the resulting clusters to meet this minimum.

To review the most recently computed precision and recall metrics:

  1. Navigate to the Clusters page of a mastering project.
  2. In the top right corner, select Estimate cluster metrics. The Estimate cluster metrics option is disabled while Tamr estimates metrics. When the estimate is available, the option changes to View cluster metrics.
    Note: If the Estimate cluster metrics option is disabled, verify additional high-impact clusters. Tamr enables the option when you reach the minimum number of verifications necessary for the computation.
  3. Select View cluster metrics. A dialog box opens with a graph showing changes in these metrics over time, with confidence intervals, followed by the current percentages.
Current values for precision and recall, trend over time, and confidence intervalsCurrent values for precision and recall, trend over time, and confidence intervals

Current values for precision and recall, trend over time, and confidence intervals

  1. Select Close when your review is complete.

Re-estimating Precision and Recall Metrics

Like other Tamr jobs, you must initiate re-computation of the precision and recall metrics.

To re-estimate precision and recall:

  1. Navigate to the Clusters page of a mastering project.
  2. In the top-right corner, select View cluster metrics. A dialog box opens.
  3. Select Re-estimate. Tamr starts the "Compute clusters accuracy" job.

Publishing Clusters

To publish clusters:

  1. Navigate to the Clusters page of a mastering project.
  2. In the top-right corner, select Review and publish clusters. A dialog box opens with metrics showing changes in your clusters over time.
Change metrics for clustersChange metrics for clusters

Change metrics for clusters

All cluster change and record movement metrics are compared to the most recent published version of clusters.

  1. Click Update published clusters.

Reviewing All Cluster Change Metrics

Note: Cluster change metrics are only available after the first execution of publishing clusters.

The cluster change metrics are counts of the number of clusters that had different types of changes.

Modified: The number of clusters that have had membership changes since the last time clusters were published. For example, merging two clusters together yields a count of 2 modified clusters (the count includes the both the empty and the surviving cluster).

New: The number of clusters that were newly created and that did not exist in the previous set of published clusters.

Empty: Empty clusters are clusters that contained records in the previous set of published clusters, but now have no member records. These clusters and their published IDs will be archived the next time you update published clusters.

Reviewing a Specific Cluster's Change Metrics

To review change metrics for a specific cluster:

  1. Navigate to the Clusters page of a mastering project.
  2. Select a cluster from the left-hand panel.
  3. Select Open details. A panel with information about the cluster, beginning with its change metrics, opens on the right side.

Reviewing Record Movement Metrics

The record movement metrics are counts of the number of records that had different types of changes.

New: The number of new records in the unified dataset since the last published clusters.

Moved: The number of records moved to another cluster from their last published clusters.

Stayed: The number of records that did not move to another cluster from their last published clusters.

Deleted: The number of records deleted from the unified dataset since the last published clusters.

Reviewing a Specific Cluster's Record Movement

Counts of the number of records added to, or removed from, a specific cluster appear below the change metrics when you select Open details for the cluster. The following alternative procedure shows record details.

To review records added or removed for a specific cluster:

  1. Navigate to the Clusters page of a mastering project.
  2. Select a cluster from the left-hand panel.
  3. Select View changes. This option applies color-coding to the table of records in the center of the page. Records added to the cluster are highlighted in blue and records removed from the cluster are shown in red.

Reviewing a Specific Record's Cluster Membership Over Time

To review a specific record's cluster membership over time:

  1. Navigate to the Clusters page of a mastering project.
  2. Select a record from the central table of records.
  3. Click Open details. A panel with information about the record, beginning with current cluster membership and membership at last publication, opens on the right side.

Reviewing a Specific Record's Verification History

To review a specific record's verification history over time:

  1. Navigate to the Clusters page of a mastering project.
  2. Select a record from the records panel.
  3. Click Open details. A panel with information about the record, including a section for Verification & History, opens on the right side.

The verification history includes:

  • Whether the record has ever been verified.
  • If verified, the user who verified with the timestamp.
  • The current Tamr suggestion for cluster membership, or a message that suggestions will either be auto-accepted or have been disabled.
  • If verification was removed, the user who removed the verification with the timestamp, and the cluster in which the record was previously verified.

For more information on the clustering workflow, see Assigning Clusters and Verifying Clusters.

Updated about a month ago



Publishing Clusters


Publishing reassigns persistent IDs to clusters to reflect your changes and is a key process in the clustering workflow. You can use recall and precision metrics to guide when to publish.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.