Tokenizers and Similarity Functions
Tokenizers and similarity functions identify similarities and differences between data values.
The supervised learning models in Tamr projects use tokenizers and similarity functions to evaluate different types of data, make comparisons, and identify similarities and differences between data values.
- Tokenizers preprocess values with a data type of string to separate text into discrete pieces, or tokens, that are machine readable. Tamr machine learning models compare the tokens, rather than the original strings. Tokenizers can reduce the effect of misspellings, abbreviations, errors, and other irregularities. You define tokenizers in both categorization and mastering projects for unified attributes that are included in machine learning. See Tamr Tokenizers.
- Similarity functions give the Tamr machine learning model a way to score how different or alike two values are to each other so that it can find duplicates. Different similarity functions are available for values with a string, integer, or geospatial data type. You define similarity functions in mastering projects for unified attributes that are included in machine learning. See Tamr Similarity Functions.
In a mastering project, you can specify the tokenizer or similarity function to use for attributes in the unified schema and for attributes in the blocking model. In a categorization project, you can specify the tokenizer to use for attributes in the unified schema.
Tamr Tokenizers
To accommodate different types of string values, Tamr provides a set of different tokenizers.
Tokenizer | Suited for |
---|---|
Default | Data values with few or no errors or typos |
Stemming (English) | Data values in English that represent free text |
Bigram / Trigram | Data values in free text fields that contain misspellings or international characters |
Bi-Word | Data values that have a meaningful word order |
Spaces and special chars | Similar to Default, which is preferred |
Note: Tamr currently supports only English language tokenizers. Tamr corporate services can consult on matching within and across non-English languages, including French, Chinese, Korean, and Japanese.
Default Tokenizer
The Default tokenizer splits blocks of text into tokens on spaces and special characters (with the exception of underscores (_), and then lowercases the results.
The Default tokenizer recognizes and does not split:
- Numbers that contain decimal points
- URLs such as www.tamr.com
The Default tokenizer is useful for data values with few or no errors or typos in individual words, but that may have additional or missing words, such as company names.
Stemming (English) Tokenizer
The Stemming (English) tokenizer creates tokens in the same way as the Default tokenizer, and then reduces inflected or derived words to their root form. For example, original data values of Fishing, fished, and FISHER are all converted to the same token, fish.
The Stemming (English) tokenizer is useful for values in English that represent free text such as “description” and “review” fields. Fields that contain the names of people or companies are less likely to benefit from stemming.
Bigram and Trigram Tokenizers
The Bigram and Trigram tokenizers are n-grams that split blocks of text into tokens on spaces and then into sets of consecutive characters, and finally lowercases the results.
- The Bigram tokenizer results in sets of one or two consecutive characters.
- The Trigram tokenizer results in sets of one, two, or three consecutive characters.
For example, the Trigram tokens for "Google" are g, go, goo, oog, ogl, gle, le, e. If another record contained the word "goggle", several of its trigram tokens would match the ones for Google. In contrast, if you used the Default tokenizer, the tokens google and goggle would not match. The local order of pairs or triples of letters can help diminish the impact of misspellings or accented vs. unaccented characters.
The Bigram and Trigram tokenizers are useful for free text fields that contain misspellings or international characters such as à, ç, é, ñ, ö, and so on.
Note: These tokenizers produce significantly more tokens than other options. The performance cost of the additional processing can be a consideration.
Bi-Word Tokenizer
The Bi-Word tokenizer creates tokens in the same way as the Default tokenizer, and then also pairs consecutive words into tokens. Each individual word and each bi-word becomes a token.
For example, three fields contain the strings "Unsalted Almonds", "Almond Butter", "Unsalted butter". The Bi-word tokenizer creates the tokens almond, almond butter, almonds, butter, unsalted, unsalted almonds, unsalted butter. The bi-word tokens cannot be mistaken for each other: "unsalted almonds" and "unsalted butter" are not the same. In contrast, if you used the Default tokenizer, only individual words are emitted, making half of the words in the string "Unsalted almonds" match "Unsalted butter".
The Bi-Word tokenizer is useful when word order is important, such as for descriptions of parts or products.
Spaces and special chars Tokenizer
Like the Default tokenizer, the Spaces and special chars tokenizer splits blocks of text into tokens on spaces and special characters, and then also splits on underscore characters (_), periods used as decimal points in numbers, and periods used to separate the domain, subdomain, and so on in URLs, before lowercasing the results.
This option is available primarily to address a backward compatibility issue. In most cases, the Default tokenizer produces more useful results.
Tamr Similarity Functions
To evaluate how similar or different data values of different types are, Tamr offers the following similarity functions.
Similarity Function | Suited for |
---|---|
Cosine | Text values of 100 words or less |
Jaccard | Text values of over 100 words |
Absolute Diff | Numeric values |
Relative Diff | Numeric values |
Hausdorff | Geospatial values |
Cosine Similarity Function
This function applies to text values and represents the cosine similarity between two "bags of words", with a similarity range of [0, 1]. When used in a blocking model, it generates all pairs with a similarity greater than or equal to the specified threshold with no missing pairs.
Cosine uses TF-IDF for word weighting, with binary TF. That is, TF can only be 1 (for mentioned tokens) and 0 (for absent tokens). As a result, Tamr ignores the frequency of terms beyond a single mention.
Cosine is the default similarity function for text values.
Jaccard Similarity Function
This function applies to text values and represents the weighted Jaccard similarity, with a similarity range of [0, 1]. Like Cosine, Jaccard also uses TF-IDF with binary TF for word weighting.
When specified for an attribute in a blocking model, the allowed thresholds are in range [0.4, 1] with less than 10% chance of missing a pair.
Jaccard is useful for very long text fields (>100 words) as using Cosine for these values is typically slower.
Absolute Diff Similarity Function
This function applies to numeric values and represents the absolute difference between two numbers (X - Y) with a similarity range of [0, infinity].
When you specify this function for an attribute in the unified dataset, Tamr automatically casts strings to floats or doubles.
Relative Diff Similarity Function
This function applies to numeric values and represents the relative difference between two numbers (1 - | X - Y | / max(|X|, |Y|)) with a similarity range of [0, 1]. When x and y are both zero, similarity is 1.
Hausdorff Similarity Functions
These functions apply to geospatial values and represent the distance between two objects. See Similarity Metrics for Geospatial Data.
Updated over 4 years ago