In a mastering project, you evaluate pairs of records to train Tamr Core to accurately and efficiently identify duplicate records.
In the first deduplication stage, you specify attributes that, when identical, reliably identify unique entities. Tamr Core groups multiple records together only if they have identical values for all of the specified attributes. See Grouping Obvious Duplicates.
While record grouping is an optional stage, it has the effect of reducing the number of possibly matching pairs that must be evaluated and increases the quality of the clusters in the later stages. This makes both machine learning jobs and the efforts of your team of experts more efficient.
In a mastering project that includes record grouping, the pair generation process finds potential matches that consist of:
- a record and a record
- a group and a group
- a record and a group
For information about how groups appear on the Pairs page, see Interpreting Lists of Values.
The goal of the blocking model is to help you efficiently identify matching pairs. A blocking model is composed of one-or-more blocking terms. They define matching conditions for unified attribute values. See Adding a Blocking Clause.
When you create a blocking model, you can estimate the number of pairs that Tamr Core will generate as you add blocking terms. For example, in a real-world dataset of 1M customer account records, the model can typically find 50M potentially matching pairs.
Estimating pair counts allows you to iterate quickly when defining blocking terms and see the effect of adjusting thresholds, tokenizers, and similarity functions before you run a full pair generation job.
Initially, a verifier, curator, or admin selects a handful of arbitrary pairs and labels them as Match or No Match. These initial matching and non-matching pairs provide the model with feedback to learn from so it can begin suggesting match and no match labels for pairs. See Training Initial Pairs and Reviewing Record Pairs.
When you run the model again, it suggests labels and identifies high-impact pairs for prioritized review. You assign high-impact and other pairs to reviewers for their feedback on the accuracy of Tamr Core’s match or no match suggestions. Curators and verifiers then verify the reviewer feedback and repeat the process to update results and improve system accuracy.
Both curators or verifiers can assign and verify pair labels. See the following:
The iterative curation of high-impact, learned, and other types of pairs allows Tamr Core to accurately classify all pairs as matching or non-matching.
Each time you make changes to the blocking model or apply verified feedback to update results, the pairs that appear on the Pairs page update. All previously reviewed and labeled pairs appear with the following exception: if updates are made to the source data and a record is no longer present in the unified dataset, pairs that include that record no longer appear regardless of whether a pair was previously reviewed and labeled. If you update the blocking model, the reviewer’s label is retained even if one or both records in the pair no longer satisfy the new blocking terms.
Updating grouping keys or making changes to clusters can also affect the labels and pairs that appear on the Pairs page.
- When you update grouping keys, the next pair generation job creates pairs with the newly formed groups. These pairs are likely to be entirely different than before. To preserve label verification efforts made by your team, Tamr Core tracks the previous group assignment for every record and translates any prior verified label to each record’s new group. These translated labels are called inferred pair labels.
- Cluster verification actions can also result in new and changed pair labels. Tamr Core can learn and apply these labels to the model for you when you enable learned pairs.
Updated 12 months ago