Tokenizers and Similarity Functions
Tamr Core uses tokenizers and similarity functions to identify similarities and differences between data values.
The supervised learning models in projects use tokenizers and similarity functions to evaluate different types of data, make comparisons, and identify similarities and differences between data values.
- Tokenizers preprocess values with a data type of string to separate text into discrete pieces, or tokens, that are machine readable. Machine learning models compare the tokens, rather than the original strings. Tokenizers can reduce the effect of misspellings, abbreviations, errors, and other irregularities. You define tokenizers in both categorization and mastering projects for unified attributes that are included in machine learning. See Tamr Core Tokenizers.
- Similarity functions give the machine learning model a way to score how different or alike two values are to each other so that it can find duplicates. Different similarity functions are available for values with a string, integer, or geospatial data type. You define similarity functions in mastering projects for unified attributes that are included in machine learning. See Tamr Core Similarity Functions.
In a mastering project, you can specify the tokenizer or similarity function to use for each attribute in the unified schema and, separately, for attributes in the blocking model. In a categorization project, you can specify the tokenizer to use for each attribute in the unified schema.
Tip: Schema mapping projects use Regex tokenization only, which removes all special characters including hyphens.
Tamr Core Tokenizers
To accommodate different types of string values, Tamr Core provides a set of tokenizers.
Tokenizer | Suited for |
---|---|
Default | Data values with few or no errors or typos |
Stemming (English) | Data values in English that represent free text |
Bigram / Trigram | Data values in free text fields that contain misspellings or international characters |
Bi-Word | Data values that have a meaningful word order |
Spaces and special chars | Similar to Default, which is preferred |
Note: Tamr Core currently supports only English language tokenizers. Tamr corporate services can consult on matching within and across non-English languages, including French, Chinese, Korean, and Japanese.
Default Tokenizer
The Default tokenizer splits blocks of text into tokens on spaces and special characters (with the exception of underscore (_) characters), removes the special characters, and then lowercases the results.
Special characters are: ! " # $ % & ' ( ) \ * + , - . / : ; < = > ? @ [ \ ] ^ ` { | } ~
Whitespace characters are spaces and: \t \n \x0B \f \r
The Default tokenizer recognizes and does not split:
- Numbers that contain decimal points
- URLs such as www.tamr.com
The Default tokenizer is useful for data values with few or no errors or typos in individual words, but that may have additional or missing words, such as company names.
Stemming (English) Tokenizer
The Stemming (English) tokenizer creates tokens in the same way as the Default tokenizer, and then reduces inflected or derived words to their root form. For example, original data values of Fishing, fished, and FISHER are all converted to the same token, fish.
The Stemming (English) tokenizer is useful for values in English that represent free text such as “description” and “review” fields. Fields that contain the names of people or companies are less likely to benefit from stemming.
Bigram and Trigram Tokenizers
The Bigram and Trigram tokenizers are n-grams that split blocks of text into tokens on spaces and then into sets of consecutive characters, and finally lowercases the results.
- The Bigram tokenizer results in sets of one or two consecutive characters.
- The Trigram tokenizer results in sets of one, two, or three consecutive characters.
For example, the Trigram tokens for "Google" are g, go, goo, oog, ogl, gle, le, e. If another record contained the word "goggle", several of its trigram tokens would match the ones for Google. In contrast, if you used the Default tokenizer, the tokens google and goggle would not match. The local order of pairs or triples of letters can help diminish the impact of misspellings or accented vs. unaccented characters.
The Bigram and Trigram tokenizers are useful for free text fields that contain misspellings or international characters such as à, ç, é, ñ, ö, and so on.
Important: These tokenizers produce significantly more tokens than other options. The performance cost of the additional processing can be a consideration. When configuring the blocking model, estimate the number of pairs before regenerating pairs to understand the performance and processing impact.
Bi-Word Tokenizer
The Bi-Word tokenizer creates tokens in the same way as the Default tokenizer, and then also pairs consecutive words into tokens. Each individual word and each bi-word becomes a token.
For example, three fields contain the strings "Unsalted Almonds", "Almond Butter", "Unsalted butter". The Bi-word tokenizer creates the tokens almond, almond butter, almonds, butter, unsalted, unsalted almonds, unsalted butter. The bi-word tokens cannot be mistaken for each other: "unsalted almonds" and "unsalted butter" are not the same. In contrast, if you used the Default tokenizer, only individual words are emitted, making half of the words in the string "Unsalted almonds" match "Unsalted butter".
The Bi-Word tokenizer is useful when word order is important, such as for descriptions of parts or products.
Spaces and special chars Tokenizer
Like the Default tokenizer, the Spaces and special chars tokenizer splits blocks of text into tokens on spaces and special characters, and then also splits on underscore characters (_), periods used as decimal points in numbers, and periods used to separate the domain, subdomain, and so on in URLs, before lowercasing the results.
Special characters are: ! " # $ % & ' ( ) \ * + , - . / : ; < = > ? @ [ \ ] ^ ` { | } ~ _
Whitespace characters are spaces and: \t \n \x0B \f \r
This option is available primarily to address a backward compatibility issue. In most cases, the Default tokenizer produces more useful results.
Tamr Core Similarity Functions
To evaluate how similar or different data values of different types are, Tamr Core offers the following similarity functions.
Similarity Function | Suited for |
---|---|
Cosine | Text values of 100 words or less |
Absolute Cosine Similarity Function | Text values of 100 words or less |
Jaccard | Text values of over 100 words |
Absolute Diff | Numeric values |
Relative Diff | Numeric values |
Hausdorff | Geospatial values |
Cosine Similarity Function
This function applies to text values and represents the cosine similarity between two "bags of words", with a similarity range of [0, 1]. When used in a blocking model, it generates all pairs with a similarity greater than or equal to the specified threshold with no missing pairs.
You can select whether to use IDF (inverse document frequency) for word weighting, or to leave the tokens equally weighted.
- IDF is the default, with binary TF. That is, TF can only be 1 (for mentioned tokens) and 0 (for absent tokens). As a result, the machine learning model ignores the frequency of terms beyond a single mention.
- Equal weighting is most effective with tokenizers that produce tokens that are less specific than full words, such as Bigram and Trigram.
The next section includes examples that compare the results of cosine similarity to absolute cosine similarity.
Cosine is the default similarity function for text values.
Absolute Cosine Similarity Function
Like cosine similarity, this function applies to text values and represents the similarity between two "bags of words". However, this function does not normalize the resulting feature vectors, so the similarity range is [0, infinity).
As for cosine similarity, you can select whether to leave the tokens equally weighted or use IDF for word weighting.
Examples of how cosine and absolute cosine determine different scores for matching tokens follow.
Example: Equally Weighted Tokens
An example of how cosine and absolute cosine score the same tokens differently when you weight tokens equally follows.
When using cosine, matching a common and uninformative token like “Corp” gets a score of 1 (100% similar) when it is the only value in the string. Other exact matches, like “GlobalMegaCorp Inc”, receive the same score.
Matching Tokens | Function | Similarity |
---|---|---|
“Corp” and “Corp” | cosine | 1 |
absolute cosine | 1 | |
“GlobalMegaCorp Inc” and “GlobalMegaCorp Inc” | cosine | 1 |
absolute cosine | 2 |
Because absolute cosine does not normalize the calculation, the computed similarity is equal to the number of common tokens. As a result, an exact match to “GlobalMegaCorp Inc” receives a score that reflects that it is more similar than an exact match to “Corp”.
Example: IDF Word Weighting
An example of how cosine and absolute cosine score the same tokens differently when you apply IDF weighting follows.
For a city attribute, the California city of “Los Angeles” is a string that is likely to occur frequently in your data, while the Massachusetts populated place “Myricks” might occur rarely.
Matching Tokens | Weight | Function | Similarity |
---|---|---|---|
“Los Angeles” and “Los Angeles” | 0.5 and 1.2 | cosine | (0.5^2 + 1.2^2) / (0.5^2 + 1.2^2) = 1 |
absolute cosine | 0.5^2 + 1.2^2 = 1.69 | ||
“Myricks” and “Myricks” | 2.3 | cosine | (2.3^2) / (2.3^2) = 1 |
absolute cosine | 2.3^2 = 5.29 |
For these values, the result of using absolute cosine with IDF weighting is to mitigate the similarity of common tokens and amplify the similarity of rare tokens.
Jaccard Similarity Function
This function applies to text values and represents the weighted Jaccard similarity, with a similarity range of [0, 1]. Like Cosine, you can specify whether to use TF-IDF with binary TF for word weighting (the default) or leave words equally weighted.
When specified for an attribute in a blocking model, the allowed thresholds are in range [0.4, 1] with less than 10% chance of missing a pair.
Jaccard is useful for very long text fields (>100 words) as using Cosine for these values is typically slower.
Absolute Diff Similarity Function
This function applies to numeric values and represents the absolute difference between two numbers |X - Y| with a similarity range of [0, infinity].
When you specify this function for an attribute in the unified dataset, Tamr Core automatically casts strings to floats or doubles.
Relative Diff Similarity Function
This function applies to numeric values and represents the relative difference between two numbers (1 - | X - Y | / max(|X|, |Y|)) with a similarity range of [0, 1]. When x and y are both zero, similarity is 1.
Hausdorff Similarity Functions
These functions apply to geospatial values and represent the distance between two objects. See Similarity Metrics for Geospatial Data.
Similarity and Null Values
Tamr Core handles null values, such as pairs in a mastering project where the values being compared for an attribute are either (null, null) or (value, null), separately from cases in which the values being compared are not null.
The training that verifiers provide to supervised learning models determines how these cases are handled. For example:
- If you label a pair that has (null, null) or (value, null) for an attribute as a matching pair, the model predicts similar cases as matches.
- If you label a pair that has (null, null) or (value, null) for an attribute as a non-matching pair, the model predicts similar cases as no matches.
Note: The model does not distinguish between (null, null) and (value, null).
See Working with Pairs, Training Initial Pairs, Viewing and Verifying Pairs, and Curating Pairs.
Similarity and Arrays
Tamr Core computes the similarity between each element in the first array and every element in the second array, then takes the maximum of those similarities.
Updated 11 months ago