The process of data mastering
A Mastering Project helps you find records that refer to the same entity within and across input datasets. This task is often referred to as data mastering, entity resolution, or record linkage.
Data mastering is one of the major workflows you can use in:
- Data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM)
- Matching projects, such as entity detection and enrichment.
In a mastering project, your goal is to create a unified dataset, generate pairs, run Tamr machine learning to idenfity matches, and then review the results and publish clusters. The steps are roughly as follows. Eash of these high-level steps is introduced in this topic.
The first step in a mastering project is to add one or more input datasets to the project's datasets. The datasets that you add will be the ones that will be mastered. A project's input datasets are focused on a single logical entity, such as customers or products.
Once you add the datasets to the project, you map them to a single unified dataset and initially configure them for Tamr machine learning to generate attribute recommendations. See Working with the Unified Dataset.
After you create the unified dataset, create a binning model to generate record pairs that are a potential match. A “pair” in Tamr is a list of similarities between corresponding attributes in two records. The mastering model classifies pairs as Match or No Match. See Generating Records Pairs.
After a Curator initially classifies a handful of arbitrary record pairs as matching or non-matching, Tamr begins learning and identifies high impact record pairs for Reviewer feedback. See Curating and Reviewing Record Pairs.
The next step in the data set mastering process is known as clustering records. Multiple records may refer to the same real-world entity, such as a customer, supplier, person, or organization. It is useful to create clusters to hold all matching records. The clustering process:
- Identifies when two or more records refer to the same real-world entity.
- Puts pairs of matching record into clusters.
The end result of the clustering process is clusters of matching records that correspond to unique entities.
After you generate, curate, and review clusters, you can publish them. Publishing saves the current clusters as the latest version visible to downstream consumers, creating a dataset snapshot. This snapshot allows you to track changes to published clusters over time. Tamr captures cluster change metrics.
- Obtain the cluster metrics, such as the number of clusters with new members, between the current clustering results and the latest published clusters. See Reviewing Clusters.
- Verify records in the cluster. See Curating Clusters.
- Access historical cluster change metrics through RESTful APIs. See Publishing Clusters.
Updated 2 months ago
|Working with the Unified Dataset|
|Working with Record Pairs|
|Curating and Reviewing Record Clusters|