When you apply feedback and update record pairs, Tamr also generates the first iteration of record clusters. A cluster can contain one or more records, all of which should represent the same distinct entity. In a given data mastering project, cluster size ranges from one record, known as a singleton cluster, to thousands or tens of thousands of records in a cluster.
To achieve one record cluster for each entity, containing all records for that entity and only records for that entity, you and your experts review a small number of important clusters and take the following actions to improve the Tamr model:
- Merge any clusters that contain records for the same entity.
- Move records from a cluster for a different entity into an existing cluster for that entity.
- Separate records into a new cluster for an entity that does not already have a cluster.
- Verify records as correctly belonging to a cluster. When you verify each record's membership in a cluster you can choose whether Tamr can use that verified membership to make suggestions about future cluster members.
After you review high-impact clusters to verify member records and make other changes, you can generate precision and recall metrics to help you track model accuracy over time.
Tip: The first time that you initiate an Apply feedback and update results or Update results only job in a mastering project, Tamr “publishes” the initial set of clusters by assigning persistent IDs. As you work with clusters, you choose when to manually republish by running a Review and publish clusters job; this job assigns persistent IDs to any new clusters and deletes any empty clusters. Each time you republish, Tamr saves a snapshot of the clusters and recomputes recall and precision metrics.
The iterative curation of important clusters allows Tamr to accurately cluster all records into distinct entities.
Both curators or verifiers can review, filter, assign, and verify clusters. See the following:
Updated 7 days ago