Publishing Clusters
Publishing reassigns persistent IDs to clusters to reflect your changes and is a key process in the clustering workflow. You can use recall and precision metrics to guide when to publish.
Publishing in the Cluster Workflow
The first time that you initiate an Apply feedback and update results or Update results only job in a mastering project, Tamr Core "publishes” the initial set of clusters by assigning persistent IDs.
When you use filters to find high-impact and representative clusters to do an initial review, you are likely to discover these types of issues:
- Too many different clusters are created for the same entity. These clusters need to be merged together.
- Clusters contain records that are really for different entities. These clusters need to be split into separate clusters, or moved to the correct cluster.
You also can go back and update the blocking model or verify more record pairs.
As you work with clusters, you choose when to manually republish by running a Review and publish clusters job; this job assigns persistent IDs to any new clusters and deletes any empty clusters. Each time you republish, Tamr Core saves a snapshot of the clusters and recomputes recall and precision metrics.
The cluster workflow is a flexible, iterative process with many other variations possible. Understanding the importance of when you choose to publish during your workflow can reduce unnecessary or unanticipated rework.
Precision and Recall Metrics for Clusters
Tamr Core computes precision and recall metrics for clusters:
- Precision is a measure of the effectiveness of finding relationships among data, expressed by how many matching and non-matching pairs were correct. It is measured by (True Positives)/(True Positives + False Positives). For clusters in mastering projects, precision is the ratio of assignments that are verified in the current cluster to all assignments suggested.
- Recall is a measure of the effectiveness of finding relationships among data, expressed by how many real matching pairs were returned. It is measured by (True Positives)/(True Positives + False Negatives). For clusters in mastering projects, recall is the ratio of correct assignments to all correct expert assignments.
For more information about these metrics, see Precision and Recall.
Reviewing Precision and Recall for Clusters
Tamr Core computes cluster metrics from test records. To ensure accurate estimates, this feature is available only after you verify records in a sufficient number of high-impact clusters (10 or more). Tamr Core automatically enables the Estimate cluster metrics option when you reach the minimum number of verifications necessary for the computation. Tamr recommends using the high-impact filter and curating all of the resulting clusters to meet this minimum.
To review the most recently computed precision and recall metrics:
- In a mastering project, select the Clusters page.
- In the top right corner, select Estimate cluster metrics. The Estimate cluster metrics option is disabled while Tamr Core estimates metrics. When the estimate is available, the option changes to View cluster metrics.
Note: If the Estimate cluster metrics option is disabled, verify additional high-impact clusters. Tamr Core enables the option when you reach the minimum number of verifications necessary for the computation. Only clusters verified using Verify and Enable Suggestions are considered when computing metrics. - Select View cluster metrics. A dialog box opens with a graph showing changes in these metrics over time, with confidence intervals, followed by the current percentages.
- Select Close when your review is complete.
Re-estimating Precision and Recall Metrics
Like other Tamr Core jobs, you must initiate re-computation of the precision and recall metrics.
To re-estimate precision and recall:
- In a mastering project, select the Clusters page.
- In the top-right corner, select View cluster metrics. A dialog box opens.
- Select Re-estimate. Tamr Core starts the "Compute clusters accuracy" job. See Monitoring Job Status.
Publishing Clusters
To publish clusters:
- In a mastering project, select the Clusters page.
- In the top-right corner, select Review and publish clusters. A dialog box opens with metrics showing changes in your clusters over time.
All cluster change and record movement metrics are compared to the most recent published version of clusters. - Click Update published clusters.
Reviewing All Cluster Change Metrics
Note: Cluster change metrics are only available after the first execution of publishing clusters.
The cluster change metrics are counts of the number of clusters that had different types of changes:
- Modified: The number of clusters that have had membership changes since the last time clusters were published. For example, merging two clusters together yields a count of 2 modified clusters (the count includes both the empty and surviving cluster).
- New: The number of clusters that were newly created and that did not exist in the previous set of published clusters.
- Empty: Empty clusters are clusters that contained records in the previous set of published clusters, but now have no member records. These clusters and their published IDs are archived the next time you publish clusters.
Reviewing a Specific Cluster's Change Metrics
To review change metrics for a specific cluster:
- In a mastering project, select the Clusters page.
- Select a cluster from the left side panel.
- Select Open details. A panel with information about the cluster, beginning with its change metrics, opens on the right side.
Reviewing Record Movement Metrics
The record movement metrics are counts of the number of records that had different types of changes:
- New: The number of new records in the unified dataset since last publish.
- Moved: The number of records moved to another cluster since last publish.
- Stayed: The number of records that did not move to another cluster since last publish.
- Deleted: The number of records deleted from the unified dataset since last publish.
Reviewing a Specific Cluster's Record Movement
Counts of the number of records added to, or removed from, a specific cluster appear below the change metrics when you select Open details for the cluster. The following alternative procedure shows record details.
To review records added or removed for a specific cluster:
- In a mastering project, select the Clusters page.
- Select a cluster from the left side panel.
- Select View changes. This option applies color-coding to the table of records in the center of the page. Records added to the cluster are highlighted in blue and records removed from the cluster are shown in red.
Reviewing a Specific Record's Cluster Membership Over Time
To review a specific record's cluster membership over time:
- In a mastering project, select the Clusters page.
- Select a record from the central table of records.
- Click Open details. A panel with information about the record, beginning with current cluster membership and membership at last publication, opens on the right side.
Reviewing a Specific Record's Verification History
To review a specific record's verification history over time:
- In a mastering project, select the Clusters page.
- Select a record from the records panel.
- Choose Open details. A panel with information about the record, including a section for verification and history, opens on the right side.
The verification history includes:
- Whether the record has ever been verified.
- If verified, the user who verified it with the timestamp.
- The current system suggestion for cluster membership, or a message that suggestions are auto-accepted or have been disabled.
- If verification was removed, the user who removed the verification with the timestamp, and the cluster in which the record was previously verified.
For more information on the clustering workflow, see Assigning Clusters and Verifying Clusters.
Updated about 1 year ago