Tamr Documentation


Understand the basics of a mastering project such as record pair and cluster curation.

A Mastering Project solves the task of finding records that refer to the same entity within and across data sources. This task, often referred to as data mastering, entity resolution or record linkage, is principally employed in data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM), and matching projects such as entity detection and enrichment.

Working with the Unified Dataset

The first step in a mastering project is to add one-or-more sources to be mastered from Unify's registered datasets to the project's datasets. A project's sources are focused on a single logical entity, e.g. customers or products.

Once added to the project, sources are mapped to a single unified dataset and initially configured for Unify's machine learning to understand (Working with the Unified Dataset).

Working with Record Pairs

Once the unified dataset is created, the next step is to create a binning model to generate record pairs that are a potential match (Generating Records Pairs ).

After a Curator has initially classified a handful of arbitrary record pairs as matching or non-matching, Unify begins learning and identifies high impact record pairs for Reviewer feedback (Curating and Reviewing Record Pairs).

Curating and Reviewing Record Clusters

The process of mastering a data set uses clustering. Clustering is the concept of identifying when two or more records refer to the same real-world entity, where an entity is a customer, supplier, or other person or organization. Clustering groups pairs of matching record into clusters. Multiple records may refer to the same entity, so clusters are created to hold all matching records.

The end result is clusters of matching records that correspond to unique entities. The records in each cluster can be merged to form a single, merged record describing an entity; or the cluster information can be used as a key in other systems (Curating and Reviewing Record Clusters).

Publishing Clusters

After clusters have been generated, curated, and reviewed, they can be published. Publishing saves the current clusters as the latest version visible to downstream consumers, and creates a snapshot of the current state of clusters in Unify. This snapshot is used to track changes over time.

Cluster change metrics, e.g. the number of clusters with new members, are dynamically captured between the current clustering and the latest published clusters and presented in the clustering Curator and Review workflows.

Historical cluster change metrics can be accessed through RESTful APIs (Publishing Clusters).