User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In
User Guides

Mastering

Understand the basics of a mastering project, such as curation of record pairs and clusters.

2078

The process of data mastering

A Mastering Project helps you find records that refer to the same entity within and across input datasets. This task is often referred to as data mastering, entity resolution, or record linkage.
Data mastering is one of the major workflows you can use in:

  • Data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM)
  • Matching projects, such as entity detection and enrichment.

High-Level Steps in a Mastering Workflow

In a mastering project, your goal is to create a unified dataset, generate pairs, run Tamr machine learning to idenfity matches, and then review the results and publish clusters. Each of these high-level steps is introduced in this topic.

Step 1: Begin Creating a Unified Dataset

The first step in a mastering project is to add one or more input datasets to the project's datasets. The datasets that you add will be the ones that will be mastered. A project's input datasets are focused on a single logical entity, such as customers or products.

Once you add the datasets to the project, you map them to a single unified dataset and initially configure them for Tamr machine learning to generate attribute recommendations. See Working with the Unified Dataset.

Step 2: Generate Record Pairs to Identify Matches

After you create the unified dataset, create a blocking model to generate record pairs that are a potential match. A “pair” in Tamr is a list of similarities between corresponding attributes in two records. The mastering model classifies pairs as Match or No Match. See Generating Records Pairs.

After a Curator initially classifies a handful of arbitrary record pairs as matching or non-matching, Tamr begins learning and identifies high impact record pairs for Reviewer feedback. See Curating and Reviewing Record Pairs.

Step 3: Curate and Review Record Clusters

The next step in the data set mastering process is known as clustering records. Multiple records may refer to the same real-world entity, such as a customer, supplier, person, or organization. It is useful to create clusters to hold all matching records. The clustering process:

  • Identifies when two or more records refer to the same real-world entity.
  • Puts pairs of matching record into clusters.

The end result of the clustering process is clusters of matching records that correspond to unique entities.

You can:

Step 4: Publish Clusters

After you generate, curate, and review clusters, you can publish them. Publishing saves the current clusters as the latest version visible to downstream consumers, creating a dataset snapshot. This snapshot allows you to track changes to published clusters over time. Tamr captures cluster change metrics.

You can:

  • Obtain the cluster metrics, such as the number of clusters with new members, between the current clustering results and the latest published clusters. See Reviewing Clusters.
  • Verify records in the cluster. See Curating Clusters.
  • Access historical cluster change metrics through RESTful APIs. See Publishing Clusters.