A Mastering Project helps you find records that refer to the same entity within and across input datasets. This task is often referred to as data mastering, entity resolution or record linkage.
Data mastering is one of the major workflows you can use in:
- Data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM)
- Matching projects, such as entity detection and enrichment.
In a mastering project, your goal is to create a unified dataset, generate pairs, run Tamr machine learning to idenfity matches, and then review the results and publish clusters. The steps are roughly as follows. Eash of these high-level steps is introduced in this topic.
The first step in a mastering project is to add one or more input datasets to the project's datasets. The datasets that you add will be the ones that will be mastered. A project's input datasets are focused on a single logical entity, such as customers or products.
Once you add the datasets to the project, you map them to a single unified dataset and initially configure them for Tamr machine learning to understand. See Working with the Unified Dataset.
After you create the unified dataset, create a binning model to generate record pairs that are a potential match. See Generating Records Pairs.
After a Curator initially classifies a handful of arbitrary record pairs as matching or non-matching, Tamr begins learning and identifies high impact record pairs for Reviewer feedback.
A “pair” in Tamr is a list of similarities between corresponding attributes in two records. The mastering model classifies pairs as Match or No Match. See Curating and Reviewing Record Pairs.
The process of mastering a dataset uses clustering. Multiple records may refer to the same entity. It is useful to create clusters to hold all matching records.
Clustering is the concept of identifying when two or more records refer to the same real-world entity, where an entity is a customer, supplier, person or organization. The clustering process puts pairs of matching record into clusters.
The end result is clusters of matching records that correspond to unique entities. You can merge the records in each cluster to form a single, merged record that describes an entity. See Golden Records.
You can also use the cluster information as a key in other systems. See Curating and Reviewing Record Clusters.
After you generate, curate and review clusters, you can publish them. Publishing saves the current clusters as the latest version visible to downstream consumers, creating a dataset snapshot. This snapshot allows you to track changes to published clusters over time. Tamr captures cluster change metrics.