Mastering
Understand the basics of a mastering project, such as curation of record pairs and clusters.

The process of data mastering
A Mastering Project helps you find records that refer to the same entity within and across input datasets. This task is often referred to as data mastering, entity resolution, or record linkage.
Data mastering is one of the major workflows you can use in:
- Data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM)
- Matching projects, such as entity detection and enrichment.
High-Level Steps in a Mastering Workflow
In a mastering project, your goal is to create a unified dataset, generate pairs, run Tamr machine learning to idenfity matches, and then review the results and publish clusters. Each of these high-level steps is introduced in this topic.
- Step 1: Begin creating a unified dataset
- Step 2: Generate record pairs to identify-matches
- Step 3: Curate and review record clusters
- Step 4: Publish clusters
Step 1: Begin Creating a Unified Dataset
The first step in a mastering project is to add one or more input datasets to the project's datasets. The datasets that you add will be the ones that will be mastered. A project's input datasets are focused on a single logical entity, such as customers or products.
Once you add the datasets to the project, you map them to a single unified dataset and initially configure them for Tamr machine learning to generate attribute recommendations. See Working with the Unified Dataset.
Step 2: Generate Record Pairs to Identify Matches
After you create the unified dataset, create a blocking model to generate record pairs that are a potential match. A “pair” in Tamr is a list of similarities between corresponding attributes in two records. The mastering model classifies pairs as Match or No Match. See Generating Records Pairs.
After a Curator initially classifies a handful of arbitrary record pairs as matching or non-matching, Tamr begins learning and identifies high impact record pairs for Reviewer feedback. See Curating and Reviewing Record Pairs.
Step 3: Curate and Review Record Clusters
The next step in the data set mastering process is known as clustering records. Multiple records may refer to the same real-world entity, such as a customer, supplier, person, or organization. It is useful to create clusters to hold all matching records. The clustering process:
- Identifies when two or more records refer to the same real-world entity.
- Puts pairs of matching record into clusters.
The end result of the clustering process is clusters of matching records that correspond to unique entities.
You can:
- Merge the records in each cluster to form a single, merged record that describes an entity. See Golden Records.
- Use the cluster information as a key in other systems. See Curating and Reviewing Record Clusters.
Step 4: Publish Clusters
After you generate, curate, and review clusters, you can publish them. Publishing saves the current clusters as the latest version visible to downstream consumers, creating a dataset snapshot. This snapshot allows you to track changes to published clusters over time. Tamr captures cluster change metrics.
You can:
- Obtain the cluster metrics, such as the number of clusters with new members, between the current clustering results and the latest published clusters. See Reviewing Clusters.
- Verify records in the cluster. See Curating Clusters.
- Access historical cluster change metrics through RESTful APIs. See Publishing Clusters.
Updated over 4 years ago