Using the Blocking Model Learner

The problem the blocking-model learner solves

It’s hard to come up with a blocking model that doesn’t generate too many pairs but lets through all the pairs that it should let through (i.e. a blocking model with high recall).

What the blocking model learner does

The blocking model learner will find the most efficient blocking model (i.e. the one that generates the fewest pairs) while also maximizing the number of labeled matches that pass through the blocking model.

You will only get a good blocking model from the learner if you have a lot of labelled match pairs. More match pairs = better blocking model learner. You should start with at least 100 match pairs but you will get a better blocking model with 1000-10,000 match pairs. You can upload these pairs via API. Or you can create a very permissive blocking model and label pairs then run the blocking model learner.

Labeling more pairs does not increase the runtime of the blocking model learner. In fact, more pairs might produce a lower run time. A typical runtime is 1 hour.

Note that the blocking model learner will only consider the fields that are marked as ML in the schema mapping page, using the specified tokenizer / similarity function. In general, it is recommended to turn off this flag for attributes that are not useful for binning to reduce the runtime of the learning process.

How to use the blocking model learner (Instructions are for v2019.004-v2020.005)

You don’t have to have an existing blocking model to use the blocking model learner. You can upload pair labels via API.
Decide which match pairs to label. These pairs should not be high-impact pairs. Instead, they should be a random sample of all pairs to make sure that the recall value we compute and optimize is not biased. The pairs should cover the diversity of different types of matches that you expect to see. You should label as many match pairs as possible (at least 100, 1000 to 10000 is better). The number of non-matches doesn’t matter at this point since the blocking model learner only uses matches.
Label the pairs. There are two ways to do this:
1. Upload the pairs via API: PUT /api/dedup/pairs/label/{dataset}.
2. Or create a very permissive blocking model and label 100 pairs as a match in the UI. The permissive blocking model should ideally let through every pair that is a match. You may find that when you run “Estimate counts” this permissive blocking model is predicted to generate too many pairs for your machine. In which case you can choose to generate only a sample of pairs while you refine the blocking model (see below). If you’re having trouble creating a permissive blocking model, you can still label pairs that don’t pass the blocking model using the endpoint above.
Run the blocking model learner:
3. Call the API POST /api/recipe/recipes/{recipeId}/run/learnDnf
4. recipeId = id for dedup recipe for that project
The new blocking model will be auto-populated. You can still access the old blocking model in a field called “previousDnf” in the DedupInfo in the dedup recipe metadata.
Once you have an adequate blocking model, you can train the classifier. Note that the pairs that you should label to train a good classifier are different from the pairs labeled to train the blocking model learner (see note below). So it would be advisable to create a separate project to train the classifier where you use the blocking model that Tamr learned and then start labeling pairs from scratch.

Differences in ideal pair populations between the blocking model learner and the pairwise classifier

Ideal pairs for the blocking model learner:
- Lots of match pairs, no non-match pairs required, random sample.
Ideal pairs for the classifier:
- equal number of match and non-match pairs. Most pairs should be high-impact pairs.
  Number of non high-impact pairs should be <100.