User GuidesAPI ReferenceRelease NotesEnrichment APIs
Doc HomeHelp CenterLog In

Defining the Blocking Model

Create a blocking model so that Tamr Core can generate matching and non-matching record pairs.

The purpose of the blocking model is to identify attributes that can help Tamr Core identify duplicate records. The blocking model helps make Tamr Core efficient by including all possible matching pairs of records (maximizing recall), while allowing some non-matching pairs to be included for experts to label.

Building a blocking model is an iterative process that involves:

  1. Adding one or more blocking clauses and configuring their terms
  2. Estimating the number of record pairs and blocks that result
  3. (Optional) Excluding one or more datasets from pair generation, clustering, or both
  4. Generating record pairs

Adding a Blocking Clause

A blocking clause consists of one or more terms which are logically connected by AND. Each complete clause is connected by OR.

To add a blocking clause:

  1. Navigate to the Pairs page.
  2. Choose Manage pair generation. A template for the first term appears. See Configuring a Blocking Term.
  3. Select Add another row to add another term to the clause.
  4. To create a different clause, move your cursor over the area between two existing terms and select the "OR" separator that appears.
  5. To move a term within or between clauses, click its Handle drag handledrag handle at far left then drag and drop it.
14401440

Separating two terms in a single clause to make two clauses.

Configuring a Blocking Term

To configure a blocking term you select an attribute and then define a similarity threshold and function. For text values you define a tokenizer and whether to use IDF weighting. While you also make these choices for unified attributes on the Schema Mapping page, the values you set here apply only to the blocking model. This gives you the flexibility to set up two clauses with the same attribute, but then set one to use Cosine similarity and the other to use Jaccard, or the Default tokenizer and Bigram, and so on.

Note: As a best practice, limit the number of terms per clause to two or three. Adding more than four terms results in slower pair generation.

To configure a blocking term:

  1. Navigate to the Pairs page.
  2. Select Manage pair generation.
  3. Select the unified attribute name.
  4. Select a similarity threshold (%) and similarity function.
    Tip: Avoid low similarity thresholds (that is, less than 0.5 for text similarity).
  5. For text attributes, select a tokenizer and choose a token weighting option.
  6. Optionally, change the setting for Is Null. Is Null defines whether or not to generate pairs from records that have a null or empty value for a text attribute.
  • To generate pairs even when one or both of the records have a null or empty value for the attribute, toggle Is Null on.
  • To generate pairs based on similarity between non-null values, toggle Is Null off. The model uses your specified similarity threshold.

See Tokenizers and Similarity Functions.

Estimating Record Pair Counts and Blocks

Tamr Core uses the blocking model to arrange records into blocks so that duplicate records belong to the same block. A record can belong to more than one block. To help determine the efficiency of your model, you can estimate the number of records per block and the number of blocks per record. By default, the estimate uses a sample of 20M pairs or about 6,300 records.

The efficiency of a model is affected by the following:

  • Number of entries per clause in the blocking model: blocks per record increase with the number of terms.
  • Tokenizer type: trigram and bigram increase the number of blocks per record.
  • Token weighting type: no weighting increases the number of blocks per record compared to IDF weighting.
  • Similarity function: For longer text attributes, Jaccard can be a better option than cosine when there are more than 10 tokens per value.
  • Similarity thresholds (%): lower thresholds increase the number of blocks per record.

To estimate record pair counts and blocks:

  1. Navigate to the Pairs page.
  2. Select Manage pair generation.
  3. Configure the blocking terms in one or more clauses.
  4. Select Estimate Counts. Estimate counts display for each clause. Overall estimates for the blocking model display at the bottom of the screen.

The format of the estimate counts is:
<x> from <y>
<n> blocks per record
where
<x> is the estimated number of pairs that are expected to meet the clause or blocking model. These pairs will be written to the Pairs page for subject matter review and verification.
<y> is the estimated number of pairs that Tamr Core will evaluate in order to find the <x> pairs that meet the clause or blocking model. In other words, <y> represents the amount of compute work that Tamr Core will perform in order to identify <x> pairs.
<n> is the average number of blocks per record. Tamr Core uses the blocking model to group records into “blocks”, so that duplicate records belong to the same block. When evaluating record pairs, Tamr Core compares only records within the same block. A record can belong to more than one block; the amount of computation required increases when records belong to more than one block.

Guidelines for Interpreting Estimates

  • An efficient model should result in no more than 100 blocks per record.
  • For maximum efficiency, the model should result in no more than 10 blocks per record.
  • The number of comparisons should be less than the number of records multiplied by 100.

Guidelines for Improving Estimates

If an Estimate Counts job takes a long time to run, review your blocking model for these possible causes and solutions. Making any of these changes can reduce the estimated number of pairs or improve system efficiency without an unacceptable decrease in recall.

Data Quality Issues

Cause 1: An attribute that includes frequently-occurring default values or mock data increases estimates.

For example, 1 million records have an identical address value of “123 My Address St” in your 10 million record dataset. If your blocking model creates a pair when two records match on “address”, the system attempts to create 1 trillion pairs.

Solution 1: Use your data knowledge to review the blocking model and replace any attributes that rely on default filler values.

Cause 2: Generating record pairs within a source dataset that is already free of duplicates affects efficiency.

Solution 2: By default, Tamr Core generates pairs within each source dataset as well as across the different source datasets. Use your data knowledge to decide whether you can exclude one or more source datasets from intra-dataset pair generation. See Excluding Pair Generation within a Dataset.

Similarity Threshold Setting

Cause: Low thresholds for determining similarity increase estimates.

Solution: For your first Estimate Pairs job, set a high similarity threshold for each term to get a baseline of how long the job takes to run and the resulting estimate counts. Then, incrementally lower the similarity thresholds in one or more terms to include more pairs.

Tokenizer Selection

Cause 1: Selecting a tokenizer other than Default increases processing time.

Solution 1: Limit use of Bigram or Trigram to one term in a blocking model and to attributes with a limited maximum length. You can use a small data subset to evaluate the results of using Default compared to Bigram or Trigram before proceeding to the full dataset.

Cause 2: Selecting Bigram or Trigram on both the Schema Mapping and Pairs pages affects efficiency.

Solution 2: The machine learning tokenizer you specify for an attribute on the Schema Mapping page affects system training after pairs are generated (that is, after experts identify matching and non-matching pairs). It does not affect how the system generates pairs. Try using Bigram or Trigram for machine learning and Default on the Pairs page for the same attribute, and then adjust the similarity threshold to maximize for speed and recall.

Excluding Pair Generation within a Dataset

You can choose to exclude certain source datasets from being searched for matching and non-matching pairs. When you exclude a dataset, the blocking model does not generate any record pairs from within that source. It does generate pairs of records from that source and another source. For example, you can exclude a dataset that is known to be free of duplicate records with this option. You can also exclude a dataset from clustering.

To exclude pair generation within a dataset:

  1. Navigate to the Pairs page.
  2. Select Manage pair generation.
  3. Select Open exclusions.
  4. In the Exclude Pairs Within These Sources section, select + add source and choose a dataset to exclude.

Limiting Clustering for Records within a Dataset

You can choose to exclude records from certain source datasets from being clustered together. When you exclude a dataset from clustering, each cluster will have a maximum of one record from that dataset. For example, if a dataset is known to be free of duplicate records this option indicates that a different cluster needs to be created for any records from that dataset that do pass the blocking model. You can also exclude a dataset from pair generation.

To exclude clustering within a dataset:

  1. Navigate to the Pairs page.
  2. Select Manage pair generation.
  3. Select Open exclusions.
  4. In the Exclude Clustering Within These Sources section, select + add source and choose a dataset to exclude.

Generating Record Pairs

Note: After making any changes to your blocking model, be sure to re-estimate record pair counts before you generate record pairs.

To generate record pairs:

  1. Navigate to the Pairs page.
  2. Select Manage pair generation.
  3. Select Generate Pairs. See Monitoring Job Status.

Both curators and verifiers can contribute to training initial pairs and then assign and verify record pairs.