Defining the Blocking Model
Create a blocking model so that Tamr Core can generate pairs of possibly matching records and groups.
The purpose of the blocking model is to identify attributes that can help Tamr Core identify duplicate records. Typically, you use record grouping to identify obvious matches (records that you are 100% certain do match) first. You then use the blocking model to define similarity thresholds and comparison methods that find pairs that are likely to be a match, but could possibly be non-matching (less than 100% certainty). The blocking model helps make Tamr Core efficient by including all possible matching pairs of records (maximizing recall), while allowing some non-matching pairs to be included for experts to label.
Building a blocking model is an iterative process that involves:
- Adding one or more blocking clauses and configuring their terms
- Estimating the number of pairs and blocks that result
- (Optional) Excluding one or more datasets from pair generation, clustering, or both
- Generating pairs
Adding a Blocking Clause
A blocking clause consists of one or more terms which are logically connected by AND. Each complete clause is connected by OR.
To add a blocking clause:
- In a mastering project, select the Pairs page.
- Choose Manage pair generation. A template for the first term appears. See Configuring a Blocking Term.
- Select Add another row to add another term to the clause.
- To create a different clause, move your cursor over the area between two existing terms and select the "OR" separator that appears.
- To move a term within or between clauses, click its Handle at far left then drag and drop it.
Configuring a Blocking Term
To configure a blocking term you select an attribute and then define a similarity threshold and function. For text values you define a tokenizer and whether to use IDF weighting. While you also make these choices for unified attributes on the Schema Mapping page, the values you set here apply only to the blocking model. This gives you the flexibility to set up two clauses with the same attribute, but then set one to use Cosine similarity and the other to use Jaccard, or the Default tokenizer and Bigram, and so on.
Note: As a best practice, limit the number of terms per clause to two or three. Adding more than four terms results in slower pair generation.
To configure a blocking term:
- In a mastering project, select the Pairs page.
- Select Manage pair generation.
- Select the unified attribute name.
- Select a similarity threshold (%) and similarity function.
Tip: Avoid low similarity thresholds (that is, less than 0.5 for text similarity). - For text attributes, select a tokenizer and choose a token weighting option.
- Optionally, change the setting for
Is Null
.Is Null
defines whether or not to generate pairs from records that have an empty value for a text attribute.
- To generate pairs even when one or both of the records have an empty value for the attribute, toggle
Is Null
on. - To generate pairs based on similarity only between non-empty values, toggle
Is Null
off. The model uses your specified similarity threshold.
See Tokenizers and Similarity Functions.
Estimating Record Pair Counts and Blocks
Tamr Core uses the blocking model to arrange records into blocks so that potential duplicate records belong to the same block. A record can belong to more than one block. To help determine the efficiency of your model, you can estimate the number of records per block and the number of blocks per record. By default, the estimate uses a sample of 20M pairs or about 6,300 records.
The efficiency of a model is affected by the following:
- Number of entries per clause in the blocking model: blocks per record increase with the number of terms.
- Tokenizer type: trigram and bigram increase the number of blocks per record.
- Token weighting type: no weighting increases the number of blocks per record compared to IDF weighting.
- Similarity function: For longer text attributes, Jaccard can be a better option than cosine when there are more than 10 tokens per value.
- Similarity thresholds (%): lower thresholds increase the number of blocks per record.
To estimate pair counts and blocks:
- In a mastering project, select the Pairs page.
- Select Manage pair generation.
- Configure the blocking terms in one or more clauses.
- Select Estimate Counts. Estimate counts display for each clause. Overall estimates for the blocking model display at the bottom of the screen.
The format of the estimate counts is:
<x> from <y>
<n> blocks per record
where
<x>
is the estimated number of pairs that are expected to meet the clause or blocking model. These pairs will be written to the Pairs page for expert review and verification.
<y>
is the estimated number of pairs that Tamr Core will evaluate in order to find the <x>
pairs that meet the clause or blocking model. In other words, <y>
represents the amount of compute work that Tamr Core will perform in order to identify <x>
pairs.
<n>
is the average number of blocks per record. Tamr Core uses the blocking model to organize records into “blocks”, so that potential duplicates belong to the same block. When evaluating pairs, Tamr Core compares only records within the same block. A record can belong to more than one block; the amount of computation required increases when records belong to more than one block.
Guidelines for Interpreting Estimates
- An efficient model should result in no more than 100 blocks per record.
- For maximum efficiency, the model should result in no more than 10 blocks per record.
- The number of comparisons should be less than the number of records multiplied by 100.
Guidelines for Improving Estimates
If an Estimate Counts job takes a long time to run, review your blocking model for these possible causes and solutions. Making any of these changes can reduce the estimated number of pairs or improve system efficiency without an unacceptable decrease in recall.
Data Quality Issues
Cause 1: An attribute that includes frequently-occurring default values or mock data increases estimates.
For example, 1 million records have an identical address value of “123 My Address St” in your 10 million record dataset. If your blocking model creates a pair when two records match on “address”, the system attempts to create 1 trillion pairs.
Solution 1: Use your data knowledge to review the blocking model and replace any attributes that rely on default filler values.
Cause 2: Generating pairs within a source dataset that is already free of duplicates affects efficiency.
Solution 2: By default, Tamr Core generates pairs within each source dataset as well as across the different source datasets. Use your data knowledge to decide whether you can exclude one or more source datasets from intra-dataset pair generation. See Excluding Pair Generation within a Dataset.
Similarity Threshold Setting
Cause: Low thresholds for determining similarity increase estimates.
Solution: For your first Estimate Pairs job, set a high similarity threshold for each term to get a baseline of how long the job takes to run and the resulting estimate counts. Then, incrementally lower the similarity thresholds in one or more terms to include more pairs.
Tokenizer Selection
Cause 1: Selecting a tokenizer other than Default increases processing time.
Solution 1: Limit use of Bigram or Trigram to one term in a blocking model and to attributes with a limited maximum length. You can use a small data subset to evaluate the results of using Default compared to Bigram or Trigram before proceeding to the full dataset.
Cause 2: Selecting Bigram or Trigram on both the Schema Mapping and Pairs pages affects efficiency.
Solution 2: The machine learning tokenizer you specify for an attribute on the Schema Mapping page affects system training after pairs are generated (that is, after experts identify matching and non-matching pairs). It does not affect how the system generates initial pairs. Try using Bigram or Trigram as the machine learning setting, and Default on the Pairs page for the same attribute, and then adjust the similarity threshold to maximize for speed and recall. See Creating the Unified Dataset for Mastering.
Excluding Pair Generation within a Dataset
You can choose to exclude certain source datasets from being searched for matching and non-matching pairs. When you exclude a dataset, the blocking model does not generate any pairs from within that source. It does generate pairs of records from that source and another source. For example, you can exclude a dataset that is known to be free of duplicate records with this option. You can also exclude a dataset from clustering.
To exclude pair generation within a dataset:
- In a mastering project, select the Pairs page.
- Select Manage pair generation.
- Select Open exclusions.
- In the Exclude Pairs Within These Sources section, select + add source and choose a dataset to exclude.
Limiting Clustering for Records within a Dataset
You can choose to exclude records from certain source datasets from being clustered together. When you exclude a dataset from clustering, each cluster will have a maximum of one record from that dataset. For example, if a dataset is known to be free of duplicate records this option indicates that a different cluster needs to be created for any records from that dataset that do pass the blocking model. You can also exclude a dataset from pair generation.
Tip: If you exclude an input dataset from clustering, that dataset is automatically excluded from record grouping and vice versa. See Grouping Obvious Duplicates.
To exclude clustering within a dataset:
- In a mastering project, select the Pairs page.
- Select Manage pair generation.
- Select Open exclusions.
- In the Exclude Clustering Within These Sources section, select + add source and choose a dataset to exclude.
Generating Record Pairs
Note: After making any changes to your blocking model, be sure to re-estimate pair counts before you generate pairs.
To generate pairs:
- In a mastering project, select the Pairs page.
- Select Manage pair generation.
- Select Generate Pairs. See Monitoring Job Status.
Both curators and verifiers can contribute to training initial pairs and then assign and verify Tamr Core's suggested labels for pairs.
Updated almost 2 years ago