A Mastering Project solves the task of finding records that refer to the same entity within and across input datasets. This task is often referred to as data mastering, entity resolution or record linkage.
Data mastering is one of the major workflows used in data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM), and matching projects such as entity detection and enrichment.
The first step in a mastering project is to add one or more input datasets to the project's datasets. The datasets that you add will be the one that will be mastered. A project's input datasets are focused on a single logical entity, such as customers or products.
Once you add the datasets to the project, you map them to a single unified dataset and initially configure them for Tamr machine learning to understand. See Working with the Unified Dataset.
After you create the unified dataset, create a binning model to generate record pairs that are a potential match. See Generating Records Pairs.
After a Curator initially classifies a handful of arbitrary record pairs as matching or non-matching, Tamr begins learning and identifies high impact record pairs for Reviewer feedback.
A “pair” in Tamr is a list of similarities between corresponding attributes in two records. The mastering model classifies pairs as Match or No Match.
The process of mastering a data set uses clustering. Multiple records may refer to the same entity. It is useful to create clusters to hold all matching records.
Clustering is the concept of identifying when two or more records refer to the same real-world entity, where an entity is a customer, supplier, person or organization. The clustering process puts pairs of matching record into clusters.
The end result is clusters of matching records that correspond to unique entities. You can merge the records in each cluster to form a single, merged record that describes an entity.
You can also use the cluster information as a key in other systems. See Curating and Reviewing Record Clusters.
After you generate, curate, and review clusters, you can publish them. Publishing saves the current clusters as the latest version visible to downstream consumers, creating a dataset snapshot. This snapshot allows you to track changes to published clusters over time.
Tamr captures cluster change metrics, such as the number of clusters with new members, between the current clustering results and the latest published clusters. You can locate these metrics in the clustering Curator and Review workflows.
You can access historical cluster change metrics through RESTful APIs. See Publishing Clusters.
Updated about a year ago
|Working with the Unified Dataset|
|Working with Record Pairs|
|Curating and Reviewing Record Clusters|