Glossary
A
Absolute Difference
A signal on numeric values that is used to find numbers that are "close" to one another in an absolute sense, for example, within a certain, preset range.
Attribute
A data field in a record, such as age, name, or state. Attributes have values. One record has many attributes, and attributes can be used in machine learning.
B
Binning
See Blocking.
Blocking
The process by which Tamr reduces the number of pairwise comparisons. Instead of comparing all records to all records, Tamr uses a model to divide records into K blocks and only compares blocks where matches are possible. The goal is to find all true matches and prune as many non matches as possible.
C
Category
An element in a taxonomy. For example, a taxonomy about food might include the categories fruit, meat, and bread. Each of those categories might contain categories such as apple, orange, steak, or bagel. See also node.
Categorization
The process of assigning records or entities to meaningful groups by matching them to entries in a taxonomy. Categorization identifies a single entry, or node, as the "best" match for each record. In a taxonomy with hierarchical levels, the matching node can be at any hierarchical level. In Tamr Categorization projects, this process is aided by Tamr machine learning algorithms that use supervised learning to categorize records. For more information, see Categorization.
Classification
A type of machine learning algorithm and model. For example, in Tamr Categorization projects, expert reviewers assign a small set of records to the nodes that match best in a taxonomy. Using these training examples, the model iteratively uses "classification" to propose nodes for more records. Through this iterative, supervised learning process you reach matches that meet a predefined level of accuracy.
Cluster
A consolidation of records and their attributes across your datasets that Tamr suggests are matches. Clusters may also be called Consolidated Entities.
Consolidated Entity
See Cluster.
Cosine Similarity
A signal on text values that Tamr uses as a metric when running its matching machine learning model. Finding the cosine similarity of tokens measures the angle between two vectors, where each vector is made up of tokens. Tokens are weighted with IDF (inverse document frequency) weighting. Cosine similarity grows exponentially with the number of tokens, but is guaranteed to have perfect recall (with respect to the DNF). See Tokenizers and Similarity Functions.
Cross validation
Cross validation is the principle where each record is used the same number of times for training and exactly once for testing. To illustrate this method, suppose we partition the data into two equal-sized subsets. First, we choose one of the subsets for training and the other for testing. We then swap the roles of the subsets so that the previous training set becomes the test set and vice versa. This approach is called a two-fold cross-validation. The total error is obtained by summing up the errors for both runs.
In this example, each record is used exactly once for training and once for testing. The k-fold cross-validation method generalizes this approach by segmenting the data into k equal-sized partitions. During each run, one of the partitions is chosen for testing, while the rest of them are used for training. This procedure is repeated k times so that each partition is used for testing exactly once. The total error is found by summing up the errors for all k runs.
D
Deduplication
Dataset mastering, also called Dedup/Dedupe/De-dupe or Entity Consolidation. It is the machine learning algorithm Tamr uses to combine similar records from your input datasets into clusters of related records to provide you with a unified view of your data.
Disjunctive Normal Form (DNF)
A type of Boolean expression summarized as “an OR of ANDs”. Tamr Mastering projects use rules in DNF to define characteristics that make it possible for pairs of records to match: a pair of records may be a match if values for (a AND b) OR (c AND d AND e) match.
Example: In a dataset of suppliers, a pair of records may be a match if values for (Vendor Name AND D-U-N-S) OR (Vendor Name AND Country AND City) match.
E
Entity
A group of one or more records that correspond to a single real-world business object such as a customer, product, employee, material, supplier, or vendor.
Entity Consolidation
A machine learning algorithm that Tamr uses to combine similar records from your input datasets into clusters of related records to provide you a unified view of your data.
External Data
Data that Tamr uses only for training. It is not included in the final output.
F
G
H
Hausdorff Distance
The maximum distance from a set to the nearest point in the other set. The closer two geometry objects are based on the Hausdorff distance, the more likely it is that they are similar. When creating pairs for matching geospatial type attributes, you can select these metric types on the Pairs Generation page:
- Hausdorff Distance measures how far two objects are from each other, within a metric space. This metric represents the absolute Hausdorff distance in meters between two geometries.
- Relative Hausdorff measures the ratio of the Hausdorff distance between two objects divided by the minimum of their diameters. This metric represents the degree of similarity between the two objects. It is useful when you need to determine possible similarity between two geographic objects that have different scale or sizes, such as small or large buildings. Possible values are between 0 and 1. Identical objects, such as objects of the same size that completely overlap have the Relative Hausdorff value equal to 1.0. Use the Relative Hausdorff metric for lineStrings and polygons. Do not use it for attributes with the point geospatial type.
Once you specify the metric, Tamr generates record pairs with similarity above a specified threshold. Tamr uses 64bit double precision for its calculations on geospatial data.
High Impact Feedback
High impact feedback can be provided in places where Tamr is least confident, and therefore each pair is high impact in terms of teaching Tamr something new.
I
Input dataset
The source data that you upload into a Tamr project. Typically, input datasets contain non-transactional data that is stored in flat, delimited data formats such as CSV (comma-separated value). Transactional data in JSON format can also be uploaded to Tamr (current support for this format is through an API).
J
Jaccard similarity index
The metric for text values that Tamr uses to determine similarity between two records in a pair. Jaccard similarity scales better to longer values but tends to miss some pairs. Recall for Jaccard similarity is tunable, but is not perfect initially. The other similarity functions are guaranteed to have perfect recall (with respect to the blocking model), but Jaccard similarity does not grow exponentially with the number of tokens. See Tokenizers and Similarity Functions.
K
L
Label
An action taken by an expert to train a supervised learning system. For example, in a Tamr Categorization project reviewers who assign records to the nodes in a taxonomy are labeling the records.
Leaf
A leaf node (or leaf category) in a hierarchical taxonomy is the most specific and has no children.
M
Machine learning (ML)
A type of AI (artificial intelligence) system that finds and uses patterns in data to achieve a goal. Unsupervised learning ML models apply the same algorithms consistently: an example in Tamr is the algorithm that groups similar records into clusters for Mastering projects. Supervised learning ML models apply human guidance about to improve the algorithm for future iterations: an example in Tamr is the ML classification model used in Categorization projects.
Mastering
The process of creating clusters of matching records across multiple input datasets. Each cluster corresponds to a unique real-world entity. For more information, see Mastering.
Metadata
Metadata describes or gives information about other data. In a Tamr project, common metadata consists of information about the input dataset that contain a particular record, the name of reviewer who labeled the record, or profiling information. For more information, see Profiling a Dataset.
N
Node
The terms in a taxonomy. Each record in a dataset can be assigned to one and only one node in a taxonomy. The node can be at any level in the taxonomy. See also category.
O
P
Pair
A list of similarities between corresponding attributes in two records. The Tamr mastering model classifies pairs as a Match or No Match.
Pairwise Classification
In contrast to clustering, pairwise classification (or matching) process assigns labels on individual pairs of records as a "Match" or "Distinct". Each attribute in a pair is assigned a similarity score. Tamr uses these scores to learn how to classify records.
Precision
A measure of the effectiveness of finding relationships among data, expressed by how many matched candidates were correct. It is measured by (True Positives)/(True Positives + False Positives).
Primary Key
The unique ID for each record in a dataset. See Primary Key Management.
Projects
Tamr feature sets that address master data management tasks. Schema mapping, Categorization, Mastering, and Golden Records are Tamr project types.
Q
R
Recall
A measure of the effectiveness of finding relationships among data, expressed by how many real matches were returned. It is measured by (True Positives)/(True Positives + False Negatives)
Record
A collection of one or more values corresponding to a row in a relational database table. A record contains information about a single item; for example, in the customer table each record contains values about one specific customer.
Regression
A statistical process for estimating the relationships among variables. Regression is used to understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Relative Difference
A signal for numeric values, used to find numbers that are "close" to one another using a quotient measure, such as, they are within an order of magnitude.
S
Schema Mapping
The process of assigning attributes in input datasets to attributes in the unified schema. This is a required step in data mastering projects. Schema mapping allows you identify the same attributes in different input datasets that have different attribute names for attributes that are essentially the same. For example, in one dataset, an attribute name is First Name, and in another it is Given Name. For more information, see Schema Mapping.
Signal
A measure of similarity between data values of the same type. Examples of signals are the numeric difference between two numbers, or the fuzzy string distance between two text values.
Similarity Measures
Methods used by Tamr for pair generation and comparing values within unified attributes. See Tokenizers and Similarity Functions.
Source
See input dataset. An independent collection of data that has been connected to Tamr.
Source Attribute
A source attribute is the name of the column of data found in an input dataset.
Source Record
A source record, or a record in an input dataset is a row of data values found in an input dataset.
T
Taxonomy
A set of standardized terms, arranged into a hierarchical, tree-like structure, that you use to organize records. Using a taxonomy to categorize records improves data standardization for use in downstream systems and enhances business intelligence capabilities. For more information, see Taxonomy.
Test Set
A test set represents records whose class labels are unknown.
Tokenizer
Tokenizers are used to split a string of characters into sections. All text attributes are tokenized by Tamr. See Tokenizers and Similarity Functions.
Training Set
In a Tamr Categorization project, the set of records that are initially assigned to categories in a taxonomy, as the first stage of training the supervised learning model.
U
Unified Attribute
A user-defined set of columns found across one or more input datasets.
Unified dataset
A dataset that combines all input data sources withIn the Tamr data mastering solution. The goal of a Schema Mapping project is a unified dataset. The unified dataset that can then be used in a Categorization or Mastering project.
Unified Schema
A set of attributes that are known to have the same meaning and representation, and that include all of the data you need to solve problems and answer questions. You use Tamr to create a unified schema to get a common view of an entity, such as a person, or organization, and its attributes across all input datasets.
V
Validation
A process in which original training data is divided into two smaller subsets. One of the subsets is used for training, while the other, known as the validation set, is used for estimating the generalization error. Typically, two thirds of the training set are reserved for model building, while the remaining one third is used for error estimation.
Updated over 4 years ago