In a mastering project, you "deduplicate" the unified dataset by identifying characteristics that indicate record similarity or difference, then review and validate pairs and clusters of records. This effort is often referred to as data mastering, entity resolution, or record linkage.
Data mastering is one of the major workflows you can use in:
- Data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM).
- Matching projects, such as entity detection and enrichment.
In a mastering project, you create a unified dataset and generate an initial set of record pairs that you identify as matching or non-matching pairs. You then iteratively run the machine learning model to improve the accuracy of the resulting pairs and clusters, and then refine the clusters.
This topic introduces these high-level steps:
- Step 1: Begin creating a unified dataset
- Step 2: Generate initial record pairs to identify matches
- Step 3: Train the matching model by labelling pairs
- Step 4: Curate and review record clusters
Note: If enabled, you can also include data enrichment as part of the mastering workflow. See Managing Enrichment Projects.
The initial stage of a mastering project is similar to that of a Schema Mapping Projects: an admin creates the project and uploads one or more input datasets. As in a schema mapping project, curators then map attributes in the input datasets to attributes in the unified schema.
Curators then complete additional configuration in the unified schema that is specific to mastering projects:
- Identify the single logical entity, such as people, customers, or products, for mastering.
- Indicate which of the unified attributes contribute to finding similarities and differences among the records.
- Specify the tokenizers and similarity functions for machine learning models to use when comparing data values and finding similarities and differences.
- Optionally set up transformations for the data in the unified dataset.
Curators use their knowledge of the data to create a "blocking model." A blocking model uses one or more of the unified attributes to filter out pairs of records that obviously do not match each other. An example for mastering data for individuals might be,
last_name must be at least 90% similar and
phone must be at least 75% similar for two records to be a potential match. See Defining the Blocking Model.
From the blocking model, Tamr Core generates an initial set of record pairs that pass the filter. A verifier or curator evaluates a random sample of the pairs and labels them as being either a Match or No Match. This labeling effort provides additional input for the machine learning model to use in finding similarities between records. See Training Initial Pairs.
After a curator provides the initial training sample and updates the model with this feedback, the machine learning model generates additional sets of record pairs that include a match or no match suggestion and a level of confidence for that suggestion. Curators or verifiers assign these record pairs to reviewers so that they can contribute their expertise in identifying correct and incorrect matching pairs and correct and incorrect non-matching pairs. Curators and verifiers then validate the work of the reviewers and accept or reject their input. See Curating Record Pairs (curators) and Viewing and Verifying Record Pairs (verifiers).
The first time it runs to apply the initial training sample and generate suggestions for record pairs, the machine learning model also generates record clusters. Tamr Core automatically assigns a baseline cluster ID to each cluster. Step 4 can begin while steps 2 and 3 are ongoing.
Because more than two records can refer to the same real-world entity, the next step in a mastering project uses all of the input about pairs of records to cluster all matching records together. See Working with Clusters.
The clustering process:
- Identifies records that refer to the same real-world entity.
- Puts all of those records, and only those records, into a cluster.
Curators and verifiers review and validate clusters, merging them together if they contain records for the same entity, or splitting them if they contain records for different entities. The result is clusters of records that each correspond to a different unique entity. See Verifying Clusters.
Curators periodically “publish” clusters to reflect changes to the blocking model, pair labels, or the clusters themselves. This process updates the cluster IDs and allows for comparison to the initial, baseline clusters and subsequent published snapshots. Curators can use cluster metrics for precision and recall to evaluate whether the machine learning model is improving as a result of cluster curation. See Precision and Recall Metrics.
When you are confident that your mastering project groups records into clusters accurately, you can:
- Create a single record for each unique entity from the best available data in records with the same cluster ID. See Golden Records Projects.
- Use the cluster information as a key in other systems.
For more information, see User Roles and Tamr Core Documentation.
Updated 6 months ago