Tamr Documentation

Mastering

Understand the basics of a mastering project such as record pair and cluster curation.

A Mastering Project solves the task of finding records that refer to the same entity within and across data sources. This task, often referred to as data mastering, entity resolution or record linkage, is principally employed in data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM), and matching projects such as entity detection and enrichment.

Working with the Unified Dataset

The first step in a mastering project is to add one-or-more sources to be mastered from Tamr's registered datasets to the project's datasets. A project's sources are focused on a single logical entity, e.g. customers or products.

Once added to the project, sources are mapped to a single unified dataset and initially configured for Tamr's machine learning to understand (Working with the Unified Dataset).

Working with Record Pairs

Once the unified dataset is created, the next step is to create a binning model to generate record pairs that are a potential match (Generating Records Pairs ).

After a Curator has initially classified a handful of arbitrary record pairs as matching or non-matching, Tamr begins learning and identifies high impact record pairs for Reviewer feedback (Curating and Reviewing Record Pairs).

Curating and Reviewing Record Clusters

The process of mastering a data set uses clustering. Clustering is the concept of identifying when two or more records refer to the same real-world entity, where an entity is a customer, supplier, or other person or organization. Clustering groups pairs of matching record into clusters. Multiple records may refer to the same entity, so clusters are created to hold all matching records.

The end result is clusters of matching records that correspond to unique entities. The records in each cluster can be merged to form a single, merged record describing an entity; or the cluster information can be used as a key in other systems (Curating and Reviewing Record Clusters).