Glossary
A
Absolute Difference
A signal on numeric values that is used to find numbers that are "close" to one another in an absolute sense, for example, within a certain, preset range.
Admin
A user role for Tamr Core, admins are the project leaders who manage deliverables and user access through policies and roles. See User Roles and Tamr Core Documentation and Admin Tasks and Responsibilities.
Attribute
A data field in a record, such as age, name, or state. Attributes have values. One record has many attributes, and attributes can be used in machine learning.
Author
A role that can be assigned to users or groups in policies. Authors can create projects and add them to policies, and upload datasets and add them to policies. See User Roles and Tamr Core Documentation and [Author Tasks and Responsibilities] (doc:author-tasks-and-responsibilities).
B
Binning
See Blocking.
Blocking
The process by which a mastering project reduces the number of pairwise comparisons made in a mastering project. Instead of comparing all records to all records, the project uses this model to divide records into K blocks, and only compares blocks where matches are possible. The goal of blocking is to find all true matches and prune as many non-matching pairs as possible.
Bootstrapping
The process of creating a unified attribute with the same name as the input attribute. See Approaches to Creating a Unified Schema.
Tip: Bootstrapping in Tamr Core does not refer to statistical sampling.
C
Category
An element in a taxonomy. For example, a taxonomy about food might include the categories fruit, meat, and bread. Each of those categories might contain categories such as apple, orange, steak, or bagel. See also node.
Categorization
The process of assigning records or entities to meaningful groups by matching them to entries in a taxonomy. Categorization identifies a single entry, or node, as the "best" match for each record. In a taxonomy with hierarchical levels, the matching node can be at any hierarchical level. In categorization projects, this process is aided by machine learning models that use supervised learning to categorize records. For more information, see Categorization.
Classification
A type of machine learning algorithm and model. For example, in categorization projects, expert reviewers assign a small set of records to the nodes that match best in a taxonomy. Using these training examples, the model iteratively uses "classification" to propose nodes for more records. Through this iterative, supervised learning process you reach matches that meet a predefined level of accuracy.
Cluster
A consolidation of records and their attributes across your datasets that a mastering project suggests are matches. Clusters can also be called consolidated entities. Tamr Core computes precision and recall metrics for clusters.
For more information, see Precision and Recall.
Consolidated Entity
See Cluster.
Cosine Similarity
A signal on text values that Tamr Core uses as a metric when running its matching machine learning model. Finding the cosine similarity of tokens measures the angle between two vectors, where each vector is made up of tokens. Tokens are weighted with IDF (inverse document frequency) weighting by default, or you can leave the tokens equally weighted. Cosine similarity grows exponentially with the number of tokens, but is guaranteed to have perfect recall (when used in the blocking model of a mastering project). See Tokenizers and Similarity Functions.
Cross Validation
Cross validation is the principle where each record is used the same number of times for training and exactly once for testing. For example, suppose you partition the data into two equal-sized subsets. First, you choose one of the subsets for training and the other for testing. You then swap the roles of the subsets so that the previous training set becomes the test set and vice versa. This approach is called a two-fold cross-validation. The total error is obtained by summing up the errors for both runs.
In this example, each record is used exactly once for training and once for testing. The k-fold cross-validation method generalizes this approach by segmenting the data into k equal-sized partitions. During each run, one of the partitions is chosen for testing, while the rest of them are used for training. This procedure is repeated k times so that each partition is used for testing exactly once. The total error is found by summing up the errors for all k runs.
Curator
A user role for projects, curators compose the blocking model, provide initial data expertise for mastering projects, and assign review tasks to subject matter experts in projects. See User Roles and Tamr Core Documentation and Curator Tasks and Responsibilities.
D
Deduplication
Dataset mastering, also called dedup/dedupe/de-dupe or entity consolidation. It is the machine learning algorithm that a mastering project uses to combine similar records from your input datasets into clusters of related records to provide you with a unified view of your data.
Disjunctive Normal Form (DNF)
See Blocking.
E
Entity
A group of one or more records that correspond to a single real-world business object such as a customer, product, employee, material, supplier, or vendor.
Entity Consolidation
A machine learning algorithm that a mastering project uses to combine similar records from your input datasets into clusters of related records to provide you a unified view of your data.
External Data
Data that Tamr Core uses only for training. It is not included in the final output.
F
G
Golden Record
A golden record is the goal of data deduplication: each entity is represented by a single data record that contains the best data values available. See Golden Records Projects.
Groups
See Record Grouping.
H
Hausdorff Distance
The maximum distance from a set to the nearest point in the other set. The closer two geometry objects are based on the Hausdorff distance, the more likely it is that they are similar. When creating pairs for matching geospatial type attributes, you can select these metric types on the Pairs Generation page:
- Hausdorff Distance measures how far two objects are from each other, within a metric space. This metric represents the absolute Hausdorff distance in meters between two geometries.
- Relative Hausdorff measures the ratio of the Hausdorff distance between two objects divided by the minimum of their diameters. This metric represents the degree of similarity between the two objects. It is useful when you need to determine the possible similarity between two geographic objects that have different scale or sizes, such as small or large buildings. Possible values are between 0 and 1. Identical objects, such as objects of the same size that completely overlap have the Relative Hausdorff value equal to 1.0. Use the Relative Hausdorff metric for lineStrings and polygons. Do not use it for attributes with the point geospatial type.
Once you specify the metric, the machine learning model generates pairs with similarity above a specified threshold. Tamr Core uses 64bit double precision for its calculations on geospatial data.
High-Impact Feedback
High-impact feedback can be provided in places where a machine learning model is least confident in suggested pairs, clusters, or categorizations, and therefore each expert response is high-impact in terms of teaching the model something new.
I
Input Dataset
The source data that you upload into a project. Typically, input datasets contain non-transactional data that is stored in flat, delimited data formats such as CSV (comma-separated value). Transactional data in JSON format can also be uploaded to Tamr Core (current support for this format is through an API).
J
Jaccard Similarity Index
The metric for text values that machine learning models use to determine the similarity between two records. Jaccard similarity scales better to longer values but tends to miss some pairs. Recall for Jaccard similarity is tunable, but is not perfect initially. The other similarity functions are guaranteed to have perfect recall (with respect to the blocking model), but Jaccard similarity does not grow exponentially with the number of tokens. See Tokenizers and Similarity Functions.
K
L
Label
An action taken by an expert to train a supervised learning system. For example, in a categorization project reviewers who assign records to the nodes in a taxonomy are labeling the records. In a mastering project, identifying a pair as Match or No Match labels that pair.
Leaf
A leaf node (or leaf category) in a hierarchical taxonomy is the most specific and has no children.
Learned pairs
These are pairs and pair labels that Tamr Core learns from your expert cluster verifications, reducing the need to manually label pairs. Tamr Core does not override any expert feedback already provided on the Pairs page. See Learned pairs.
M
Machine Learning (ML)
A type of AI (artificial intelligence) system that finds and uses patterns in data to achieve a goal. Unsupervised learning ML models apply the same algorithms consistently: an example is the model that groups similar records into clusters for mastering projects. Supervised learning ML models apply human guidance to improve the algorithm for future iterations: an example is the ML classification model used in categorization projects.
Mastering
The process of creating clusters of matching records across multiple input datasets. Each cluster corresponds to a unique real-world entity. For more information, see Mastering.
Materialize
A unified dataset is materialized when you commit the changes made by schema mapping or transformations to the unified dataset by selecting Update Unified Dataset. Tamr Core stores the materialized unified dataset in HBase.
Metadata
Metadata describes or gives information about other data. In a project, common metadata consists of information about the input dataset that contains a particular record, the name of the reviewer who labeled the record, or profiling information. For more information, see Profiling a Dataset.
N
Node
The terms in a taxonomy. Each record in a dataset can be assigned to one and only one node in a taxonomy. The node can be at any level in the taxonomy. See also category.
O
P
Pair
A list of similarities and differences between corresponding attributes in two records. A mastering model identifies pairs as matching or non-matching, and learns to identify duplicate records from expert feedback. Tamr Core computes precision and recall metrics for pairs.
For more information, see Precision and Recall.
Pairwise Classification
In contrast to clustering, the pairwise classification (or matching) process assigns labels on individual pairs of records as matching or non-matching. Each attribute in a pair is assigned a similarity score. Tamr uses these scores to learn how to classify records.
Precision
Mastering projects compute precision metrics for pairs and clusters. See pair or cluster.
Pregroupby
See Record Grouping.
Primary Key
The unique ID for each record in a dataset. See Primary Key Management and Understanding Primary Keys.
Projects
Tamr Core feature sets that address master data management tasks. The project types are schema mapping, categorization, mastering, and golden records.
Q
R
Recall
Mastering projects compute recall metrics for pairs and clusters. See pair or cluster.
Record
A collection of one or more values corresponding to a row in a relational database table. A record contains information about a single item; for example, in the customer table each record contains values about one specific customer.
Record Grouping
An optional stage in the mastering workflow that identifies obviously-matching records and groups them together, which can make the remaining stages of the mastering workflow more efficient.
Regression
A statistical process for estimating the relationships among variables. Regression is used to understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Relative Difference
A signal for numeric values, used to find numbers that are "close" to one another using a quotient measure, such as, they are within an order of magnitude.
Reviewer
A user role for projects, reviewers have deep knowledge of project data and can act as subject matter experts when completing assigned tasks. See User Roles and Tamr Core Documentation and Reviewer Tasks and Responsibilities.
S
Schema
The term schema refers to the organization of data as a blueprint of how the dataset is constructed.
Schema Mapping
The process of assigning attributes in input datasets to attributes in the unified schema. This is a required step in data mastering projects. Schema mapping allows you to identify the same attributes in different input datasets that have different attribute names for attributes that are essentially the same. For example, in one dataset, an attribute name is First Name
, and in another it is Given Name
. For more information, see Schema Mapping.
Signal
A measure of similarity between data values of the same type. Examples of signals are the numeric difference between two numbers, or the fuzzy string distance between two text values.
Similarity Measures
Methods used for pair generation and comparing values within unified attributes. See Tokenizers and Similarity Functions.
Source
See input dataset. An independent collection of data that is input to Tamr Core.
Source Attribute
A source attribute is the name of the column of data found in an input dataset.
Source Record
A source record, or a record in an input dataset, is a row of data values found in an input dataset.
T
Tag
A metadata value that describes an identifying characteristic of input datasets. Tags can help you organize and locate datasets. See Managing Dataset Tags.
Taxonomy
A set of standardized terms, arranged into a hierarchical, tree-like structure, that you use to organize records. Using a taxonomy to categorize records improves data standardization for use in downstream systems and enhances business intelligence capabilities. See Taxonomy.
Test Set
A test set represents records whose class labels are unknown.
Tokenizer
Tokenizers are used to split a string of characters into sections. All text attributes are tokenized by Tamr Core. See Tokenizers and Similarity Functions.
Training Set
In a categorization project, the set of records that are initially assigned to categories in a taxonomy, as the first stage of training the supervised learning model.
Transformation
Data transformation changes the format, structure, or values of data in a unified dataset. See the Transformations Overview.
U
Unified Attribute
A user-defined set of columns found across one or more input datasets.
Unified Dataset
A dataset that combines all input data sources withIn the Tamr Core data mastering solution. The goal of a schema mapping project is a unified dataset. The unified dataset can then be used in a categorization or mastering project.
Unified Schema
A set of attributes that are known to have the same meaning and representation, and that include all of the data you need to solve problems and answer questions. You create a unified schema to get a common view of an entity, such as a person, or organization, and its attributes across all input datasets.
User-defined Signal
In mastering projects, a customized expression that acts on data for pairs of records to provide additional input to the matching model. For example, you want a pair of records that both have an empty value for a specified attribute to receive a high similarity score of 1 for that attribute (instead of the default low score of 0). Your additional knowledge about this similarity in the data can be written in an expression and calculated for all pairs.
If present, user-defined signals appear as columns on the Pairs page in mastering projects and as a single value for both records in the Compare details window.
Note: This feature is available as a custom service only. For more information, contact your Tamr account representative.
V
Validation
A process in which original training data is divided into two smaller subsets. One of the subsets is used for training, while the other, known as the validation set, is used for estimating the generalization error. Typically, two-thirds of the training set are reserved for model building, while the remaining one-third is used for error estimation.
Verifier
A user role for projects, verifiers contribute data knowledge to projects and manage the assignment of review tasks. See User Roles and Tamr Core Documentation and Verifier Tasks and Responsibilities.
Updated almost 2 years ago