If an Estimate Pairs job is running extremely slowly, it is usually the case that either:
(a) the pair generation rules are not strict enough; or
(b) there are data quality problems that, in combination with the existing pair rules, are “blowing up” the pair space.
Here, we will focus on (a). (For an example of how (b) can be a problem, imagine a rule that creates a pair when two records are an exact match on “address.” However, of our 10M record dataset, 1M of those records have an address value of “123 Fake Street”. This rule will create 1 trillion pairs, which is way too many for the system to handle. Here, the problem is not the rule, but the data quality, and is a common scenario.)
Best Practice #1. When experimenting with pair generation, a good rule of thumb generally is to start with a stricter (i.e. a higher similarity level) set of rules to get an idea of how many pairs are generated and how much time it might take. Then you can slowly begin to loosen those rules to include more pairs.
Best Practice #2. If possible, keep the default tokenizer. Running bigrams or trigrams (i.e. splitting up words into 2- or 3-character chunks) can take much more time than using the default tokenizer (which does not break up words into smaller tokens). The length of the values also can affect the duration of the estimate: the longer the value, the more bigrams to compare across records. Having multiple clauses using bigrams will increase that duration, so is generally not recommended except at small data volumes. It is often the case that the default tokenizer performs sufficiently well for pair generation.
Best Practice #3. The tokenizers used for machine learning (set up on the schema mapping page) are separate from those defined on the pair generator page. So, for example, setting the tokenizer to bigrams on the schema mapping page affects the ML of the pairs that pass through the pair generator, but does not affect how the pair generator runs. If you've already set the tokenizer to bigram on the schema mapping page for an attribute, try using the default tokenizer on the pair generation page with a somewhat looser similarity setting to maximize for speed and recall.
Best Practice #4. Consider whether it’s necessary or desirable to be performing pair generation within all of your sources. There are cases when you may be confident that a specific data source is already deduplicated/mastered (or you are simply not interested in performing this analysis), in which case it makes sense to choose to exclude searching for matches within that source. You will continue to generate pairs between records belonging to that source and other sources (as well as intra-source pairs for all of the non-excluded sources). This may dramatically reduce the pairs estimate, while also being aligned with the business goals of the project.
Following these best practices should help speed up the process of estimating pair counts. If the job still fails to run, there may be another issue, and you should contact [email protected] for assistance.
Updated over 1 year ago