Tamr Documentation

Defining the Blocking Model

Create a blocking model so that Tamr can generate pairs of records that match and pairs of records that don't match.

Adding a Blocking Clause

A blocking clause consists of one or more terms which are logically connected by AND. Each complete clause is connected by OR.

To add a blocking clause:

  1. Navigate to the Pairs page of a mastering project.
  2. Choose Manage pair generation. A template for the first term appears. See Configuring a Blocking Term.
  3. Select Add another row to add another term to the clause.
  4. Create a different clause. Mouse over the area between two existing terms and click the "OR" separator that appears.
Separate two terms in a single clause to make two clauses

Separate two terms in a single clause to make two clauses

Configuring a Blocking Term

Note: As a best practice, limit the number of terms per clause to two or three. Adding more terms can lead to slower pair generation.

To configure a blocking term:

  1. Navigate to the Pairs page of a mastering project.
  2. Select Manage pair generation.
  3. Select the unified attribute name.
  4. Select a similarity threshold (%) and similarity function.
  5. For text attributes, select a tokenizer.

See Tokenizers and Similarity Functions.

Estimating Record Pair Counts and Blocks

You can view the estimated number of record pairs and blocks to help you fine tune the performance of your blocking model.

For optimal performance, we recommend no more than 100 blocks per record. The number of blocks per record is affected by the following:

  • Number of entries per clause in the blocking model; blocks per record increases with the number of clauses.
  • Tokenizer type; trigram and bigram increase the number of blocks per record.
  • Token weighting type; no weighting increases the number of blocks per record compared to IDF weighting.
  • Similarity thresholds (%); lower thresholds increase the number of blocks per record.

To estimate record pair counts and blocks:

  1. Navigate to the Pairs page of a mastering project.
  2. Select Manage pair generation.
  3. Configure the blocking terms in one or more clauses.
  4. Select Estimate Counts. Estimate counts display for each clause. Overall estimates for the blocking model display at the bottom of the screen.

Excluding Pair Generation Within a Dataset

You can choose to exclude certain source datasets from being searched for match/no match pairs. When you exclude a dataset, record pairs are not generated from within that source, only between that source and other sources. For example, if a dataset is known to be free of duplicate records, you can indicate that the blocking model should exclude that dataset.

To exclude pair generation within a dataset:

  1. Navigate to the Pairs page of a mastering project.
  2. Select Manage pair generation.
  3. Select Open exclusions.
  4. Select + add source and choose a dataset to exclude.

Generating Record Pairs

Note: After making any changes to your blocking model, be sure to re-estimate record pair counts before you generate record pairs.

To generate record pairs:

  1. Navigate to the Pairs page of a mastering project.
  2. Select Manage pair generation.
  3. Select Generate Pairs.

Updated 2 months ago



Defining the Blocking Model


Create a blocking model so that Tamr can generate pairs of records that match and pairs of records that don't match.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.