User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In

Working with Record Pairs

Understand record matching, and curating and reviewing high-impact record pairs in a mastering project.

A mastering project allows you to find similar records by generating record pairs from all datasets mapped to the unified schema. A record pair is defined as two records that form a potential match.

934

The iterative curation of record pairs allows the model to accurately classify all record pairs as a matching or non-matching pairs.

Generating Record Pairs

Blocking Model

After you create the unified dataset, the next step is to create a blocking model to generate record pairs that are a potential match. See Defining the Blocking Model.

The goal of the blocking model is to help you efficiently identify matching record pairs. A blocking model is composed of one-or-more blocking terms. They define matching conditions for unified attribute values. See Adding a Blocking Clause.

Estimating Pair Counts

In creating a blocking model it helps to estimate the number of record pairs that Tamr Core will generate. For example, in a real-world dataset of 1M customer account records, the model can typically find 50M potentially matching record pairs.

Estimating record pair counts allows you to iterate quickly when discovering blocking terms and see the effect of adjusting thresholds, tokenizers and similarity functions.

See Estimating Pair Counts and Blocks.

Curating and Reviewing Record Pairs

Initially, a handful of arbitrary record pairs are selected by a verifier, curator, or admin classified as Match or No Match. These initial matching and non-matching record pairs provide the model with the first feedback required to begin learning and allow the curator to initialize the entity resolution model. See Training Initial Pairs and Reviewing Record Pairs.

When you run the model again, it identifies high-impact record pairs for prioritized review. You assign high-impact and other record pairs to reviewers for their feedback about the system's accuracy in identifying each pair as match or no match. Curators and verifiers then verify the match/no match labels provided by the reviewers and repeat the process to update results and improve system accuracy.

Both curators or verifiers can assign and verify pair labels. See the following:

How Iteration Can Affect Record Pairs

The iterative curation of high-impact, learned, and other record pairs allows Tamr Core to accurately classify all record pairs as matching or non-matching.

Each time you make changes to the blocking model or apply verified feedback to update results, the pairs that appear on the Pairs page update. All previously reviewed and labeled pairs appear with the following exception: if updates are made to the source data and a record is no longer present in the unified dataset, pairs that include that record no longer appear regardless of whether a pair was previously reviewed and labeled.

Note: A pair that is reviewed and labeled will continue to appear on the Pairs page even if updates to the blocking model mean that one or both records in the pair no longer satisfy the new blocking terms.

Each time you make changes to clusters, those changes can also result in new and changed pair labels. Tamr Core can learn and apply these labels to the model for you when you enable learned pairs.