Glossary
A
Absolute Difference
Refers to the absolute difference signal on numeric values. Used to find numbers that are "close" to one another in an absolute sense, for example, within a certain, preset range.
Attribute
An attribute is a data field in a record, such as age, name, or state. Attributes have values. One record has many attributes, and attributes can be used in machine learning.
B
Binning
Binning is the process by which Tamr reduces the number of pairwise comparisons. Instead of comparing all records to all records, Tamr uses a model to divide records into K bins and only comparing bins where matches are possible. The goal is to find all true matches and prune as many non-matches as possible.
C
Category
A category is one element in a taxonomy. For example, a taxonomy about food might include the categories fruit, meat, bread. Each of those categories might contain categories such as apple, orange, steak, or bagel.
Categorization
Categorization is matching records to a taxonomy. It identifies one category in the taxonomy as the "best" match for each record. In Tamr, categorization is learned exclusively from human-provided labels.
Classification
This term is the same as categorization.
Cluster
A consolidation of records and their attributes across your datasets that Tamr suggests are matches. Clusters may also be called Consolidated Entities.
Consolidated Entity
This term means the same as Cluster.
Cosine Similarity
Refers to cosine similarity signal on text values. Cosine binning on tokens measures the angle between two vectors, where each vector is made up of tokens. Tokens are weighted with IDF (inverse document frequency) weighting. Cosine similarity grows exponentially with the number of tokens, but is guaranteed to have perfect recall (with respect to the DNF).
Cross Validation
Each record is used the same number of times for training and exactly once for testing. To illustrate this method, suppose we partition the data into two equal-sized subsets. First, we choose one of the subsets for training and the other for testing. We then swap the roles of the subsets so that the previous training set becomes the test set and vice versa. This approach is called a two-fold cross-validation. The total error is obtained by summing up the errors for both runs. In this example, each record is used exactly once for training and once for testing. The k-fold cross-validation method generalizes this approach by segmenting the data into k equal-sized partitions. During each run, one of the partitions is chosen for testing, while the rest of them are used for training. This procedure is repeated k times so that each partition is used for testing exactly once. Again, the total error is found by summing up the errors for all k runs.
D
Dedup/Dedupe/De-dupe
Short for deduplication, another term for dataset mastering. This term means the same as Entity Consolidation.
Disjunctive Normal Form (DNF)
DNF is a logical formula that is a disjunction of conjuctive clauses. The binning model in Tamr is written in DNF.
E
Entity
A group of one or more records that correspond to the same real-world entity, such as a person or organization.
Entity Consolidation
The process by which Tamr combines similar records from your input datasets into clusters of related records to give you a unified view of your data.
External Data
External data is data that Tamr uses only for training. It is not included in the final output.
F
G
H
Hausdorff distance
Tthe maximum distance from a set to the nearest point in the other set. The closer two geometry objects are based on the Hausdorff distance, the more likely it is that they are similar. When creating pairs for matching geospatial type attributes, you can select these metric types on the Pairs Generation page:
- Hausdorff Distance measures how far two objects are from each other, within a metric space. This metric represents the absolute Hausdorff distance in meters between two geometries.
- Relative Hausdorff measures the ratio of the Hausdorff distance between two objects divided by the minimum of their diameters. This metric represents the degree of similarity between the two objects. It is useful when you need to determine possible similarity between two geographic objects that have different scale or sizes, such as small or large buildings. Possible values are between 0 and 1. Identical objects, such as objects of the same size that completely overlap have the Relative Hausdorff value equal to 1.0. Use the Relative Hausdoff metric for lineStrings and polygons. Do not use it for attributes with the point geospatial type.
Once you specify the metric, Tamr generates record pairs with similarity above a specified threshold. Tamr uses 64bit double precision for its calculations on geospatial data.
High Impact Feedback
High impact feedback can be provided in places where Tamr is least confident, and therefore each pair is high impact in terms of teaching Tamr something new.
I
J
Jaccard Similarity
Refers to jaccard similarity signal on text values. Jaccard similarity scales better to longer values but tends to miss some pairs. Recall for jaccard similarity is tunable, but is not perfect initially. The other similarity measures are guaranteed to have perfect recall (with respect to the binning model), but Jaccard similarity does not grow exponentially with the number of tokens.
K
L
Leaf Category
A category with no children.
M
Mastering
A mastering process creates clusters of matching records across multiple input datasets. Each cluster corresponds to a unique real-world entity.
Metadata
Data that describes or gives information about other data. In a Tamr project, common metadata consists of: which input dataset contained a particular record, who labeled the record, or profiling information.
N
O
P
Pair
A list of similarities between corresponding attributes in two records. The Tamr mastering model classifies pairs as a Match or No Match.
Pairwise Classification
In contrast to clustering, pairwise classification (or matching) process assigns labels on individual pairs of records as a "Match" or "Distinct". Each attribute in a pair is assigned a similarity score. Tamr uses these scores to learn how to classify records.
Precision
A measure of the effectiveness of finding relationships among data, expressed by how many matched candidates were correct. It is measured by (True Positives)/(True Positives + False Positives).
Primary Key
The unique ID for each record in the dataset. The unique ID for each record in the dataset. See Primary Key Management.
Q
R
Recall
A measure of the effectiveness of finding relationships among data, expressed by how many real matches were returned. It is measured by (True Positives)/(True Positives + False Negatives)
Record
A record is a collection of values (also known as a row in a table) that represent one particular item in a table. For example, in the table of customers, a record is a list of values about one specific customer.
Regression
Statistical process for estimating the relationships among variables. Regression is used to understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Relative Difference
Refers to a relative difference signal on numeric values. Used to find numbers that are "close" to one another using a quotient measure, such as, they are within an order of magnitude.
S
Schema Mapping
Different input datasets may have different attribute names for attributes that are essentially the same, for example, in one dataset, an attribute name is First Name, and in another it is Given Name. Schema mapping helps map attributes from the input datasets to the unified schema.
Signal
Measures of similarity between data values of the same type. Examples of signals are the numeric difference between two numbers, or the fuzzy string distance between two text values.
Similarity Measures
Methods used by Tamr for pair generation and for comparing values within unified attributes. The similarity measures that you can choose from are Cosine, Jaccard, Absolute Difference, and Relative Difference.
Source
Also known as an input dataset, is an independent collection of data that has been connected to Tamr.
Source Attribute
The name of the column of data found in an input dataset.
Source Record
A row of data values found in an input dataset.
T
Taxonomy
A hierarchical collection of categories. A taxonomy describesthe possible categories in your classification project.
Test Set
A set of records whose class labels are unknown.
Tokenizer
Tokenizers are used to split a string of characters into sections. All text attributes are tokenized by Tamr. There are a number of tokenizers that you can choose for each text attribute. These are English stemming, bigrams, trigrams, bi-words, and regular expressions.
Training Set
A set of records whose class labels are known.
U
Unified Attribute
A user-defined category of columns found across one or more input datasets.
Unified Schema
A set of attributes that are known to have the same meaning and representation. You use Tamr to create a unified schema to get a common view of an entity, such as a person, or organization, and its attributes across all input datasets.
V
Validation
Original training data is divided into two smaller subsets. One of the subsets is used for training, while the other, known as the validation set, is used for estimating the generalization error. Typically, two thirds of the training set are reserved for model building, while the remaining one third is used for error estimation.
Updated over 5 years ago