The supervised learning models in projects use tokenizers and similarity functions to evaluate different types of data, make comparisons, and identify similarities and differences between data values.
- Tokenizers preprocess values with a data type of string to separate text into discrete pieces, or tokens, that are machine readable. Machine learning models compare the tokens, rather than the original strings. Tokenizers can reduce the effect of misspellings, abbreviations, errors, and other irregularities. You define tokenizers in both categorization and mastering projects for unified attributes that are included in machine learning. See Tamr Core Tokenizers.
- Similarity functions give the machine learning model a way to score how different or alike two values are to each other so that it can find duplicates. Different similarity functions are available for values with a string, integer, or geospatial data type. You define similarity functions in mastering projects for unified attributes that are included in machine learning. See Tamr Core Similarity Functions.
In a mastering project, you can specify the tokenizer or similarity function to use for attributes in the unified schema and for attributes in the blocking model. In a categorization project, you can specify the tokenizer to use for attributes in the unified schema.
To accommodate different types of string values, Tamr Core provides a set of tokenizers.
Data values with few or no errors or typos
Data values in English that represent free text
Data values in free text fields that contain misspellings or international characters
Data values that have a meaningful word order
Similar to Default, which is preferred
Note: Tamr Core currently supports only English language tokenizers. Tamr corporate services can consult on matching within and across non-English languages, including French, Chinese, Korean, and Japanese.
The Default tokenizer splits blocks of text into tokens on spaces and special characters (with the exception of underscores (_), and then lowercases the results.
The Default tokenizer recognizes and does not split:
- Numbers that contain decimal points
- URLs such as www.tamr.com
The Default tokenizer is useful for data values with few or no errors or typos in individual words, but that may have additional or missing words, such as company names.
The Stemming (English) tokenizer creates tokens in the same way as the Default tokenizer, and then reduces inflected or derived words to their root form. For example, original data values of Fishing, fished, and FISHER are all converted to the same token, fish.
The Stemming (English) tokenizer is useful for values in English that represent free text such as “description” and “review” fields. Fields that contain the names of people or companies are less likely to benefit from stemming.
The Bigram and Trigram tokenizers are n-grams that split blocks of text into tokens on spaces and then into sets of consecutive characters, and finally lowercases the results.
- The Bigram tokenizer results in sets of one or two consecutive characters.
- The Trigram tokenizer results in sets of one, two, or three consecutive characters.
For example, the Trigram tokens for "Google" are g, go, goo, oog, ogl, gle, le, e. If another record contained the word "goggle", several of its trigram tokens would match the ones for Google. In contrast, if you used the Default tokenizer, the tokens google and goggle would not match. The local order of pairs or triples of letters can help diminish the impact of misspellings or accented vs. unaccented characters.
The Bigram and Trigram tokenizers are useful for free text fields that contain misspellings or international characters such as à, ç, é, ñ, ö, and so on.
Note: These tokenizers produce significantly more tokens than other options. The performance cost of the additional processing can be a consideration.
The Bi-Word tokenizer creates tokens in the same way as the Default tokenizer, and then also pairs consecutive words into tokens. Each individual word and each bi-word becomes a token.
For example, three fields contain the strings "Unsalted Almonds", "Almond Butter", "Unsalted butter". The Bi-word tokenizer creates the tokens almond, almond butter, almonds, butter, unsalted, unsalted almonds, unsalted butter. The bi-word tokens cannot be mistaken for each other: "unsalted almonds" and "unsalted butter" are not the same. In contrast, if you used the Default tokenizer, only individual words are emitted, making half of the words in the string "Unsalted almonds" match "Unsalted butter".
The Bi-Word tokenizer is useful when word order is important, such as for descriptions of parts or products.
Like the Default tokenizer, the Spaces and special chars tokenizer splits blocks of text into tokens on spaces and special characters, and then also splits on underscore characters (_), periods used as decimal points in numbers, and periods used to separate the domain, subdomain, and so on in URLs, before lowercasing the results.
This option is available primarily to address a backward compatibility issue. In most cases, the Default tokenizer produces more useful results.
To evaluate how similar or different data values of different types are, Tamr Core offers the following similarity functions.
This function applies to text values and represents the cosine similarity between two "bags of words", with a similarity range of [0, 1]. When used in a blocking model, it generates all pairs with a similarity greater than or equal to the specified threshold with no missing pairs.
You can select whether to use TF-IDF for word weighting, or to leave the tokens equally weighted.
- IDF is the default, with binary TF. That is, TF can only be 1 (for mentioned tokens) and 0 (for absent tokens). As a result, the machine learning model ignores the frequency of terms beyond a single mention.
- Equal weighting is most effective with tokenizers that produce tokens that are less specific than full words, such as Bigram and Trigram.
Cosine is the default similarity function for text values.
This function applies to text values and represents the weighted Jaccard similarity, with a similarity range of [0, 1]. Like Cosine, you can specify whether to use TF-IDF with binary TF for word weighting (the default) or leave words equally weighted.
When specified for an attribute in a blocking model, the allowed thresholds are in range [0.4, 1] with less than 10% chance of missing a pair.
Jaccard is useful for very long text fields (>100 words) as using Cosine for these values is typically slower.
This function applies to numeric values and represents the absolute difference between two numbers (X - Y) with a similarity range of [0, infinity].
When you specify this function for an attribute in the unified dataset, Tamr Core automatically casts strings to floats or doubles.
This function applies to numeric values and represents the relative difference between two numbers (1 - | X - Y | / max(|X|, |Y|)) with a similarity range of [0, 1]. When x and y are both zero, similarity is 1.
These functions apply to geospatial values and represent the distance between two objects. See Similarity Metrics for Geospatial Data.
Tamr Core handles null values, such as record pairs in a mastering project with either (null, null) or (value, null) separately from cases in which the values being compared are different or similar non-null values.
The training that curators provide to supervised learning models determines how these cases are handled. For example:
- If you label a record pair with (null, null) or (value, null) as a matching pair, the model predicts similar cases as matches.
- If you label a record pair with (null, null) or (value, null) as a non-matching pair, the model predicts similar cases as no matches.
Note: The model does not distinguish between (null , null) and (value, null).
Updated 2 months ago