In a Tamr mastering project, you "deduplicate" the unified dataset by identifying characteristics that indicate record similarity or difference, then review and validate pairs and clusters of records. This effort is often referred to as data mastering, entity resolution, or record linkage.
Data mastering is one of the major workflows you can use in:
- Data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM).
- Matching projects, such as entity detection and enrichment.
In a mastering project, you create a unified dataset, generate an initial set of record pairs to identify correct and incorrect matches and non-matches, iteratively run the Tamr model to improve the accuracy of the resulting pairs and clusters, and then refine the clusters.
Each of these high-level steps is introduced in this topic.
- Step 1: Begin creating a unified dataset
- Step 2: Generate initial record pairs to identify matches
- Step 3: Train the matching model by labelling pairs
- Step 4: Curate and review record clusters
Note: If enabled, you can also include data enrichment as part of the mastering workflow. See Managing Enrichment Projects.
The initial stage of a mastering project is similar to that of a schema mapping project: an admin creates the project and uploads one or more input datasets. As in a schema mapping project, curators then map attributes in the input datasets to attributes in the unified schema.
Curators then complete additional configuration in the unified schema that is specific to mastering projects:
- Identify the single logical entity, such as people, customers, or products, for mastering.
- Indicate which of the unified attributes contribute to finding similarities and differences among the records.
- Specify the tokenizers and similarity functions for Tamr's supervised learning models to use when comparing data values and finding similarities and differences.
- Optionally set up transformations for the data in the unified dataset.
Curators use their knowledge of the data to create a "blocking model." A blocking model uses one or more of the unified attributes to filter out pairs of records that obviously do not match each other. An example for mastering data for individuals might be,
last_name must be at least 90% similar and
phone must be at least 75% similar for two records to be a potential match. See Defining the Blocking Model.
From the blocking model, Tamr generates an initial set of record pairs that pass the filter. A curator evaluates a random sample of the pairs and labels them as being either a match or no-match. This labeling effort provides additional input for the Tamr matching model to use in finding similarities between records. See Generating Record Pairs.
After a curator provides the initial training sample and updates the model with this feedback, the Tamr matching model generates additional sets of record pairs that include a match or no-match suggestion and a level of confidence for that suggestion. Curators or verifiers assign these record pairs to reviewers so that they can contribute their expertise in identifying correct and incorrect matches and correct and incorrect no-matches. Curators and verifiers then validate the work of the reviewers and accept or reject their input. See Curating and Reviewing Record Pairs (curators) and Viewing and Verifying Record Pairs (verifiers).
The first time it runs to apply the initial training sample and generate suggestions for record pairs, the Tamr matching model also generates record clusters. Tamr automatically assigns a baseline cluster ID to each cluster. Step 4 can begin while steps 2 and 3 are ongoing.
Because more than two records can refer to the same real-world entity, the next step in a mastering project uses all of the input about pairs of records to cluster all matching records together. See Working with Clusters.
The clustering process:
- Identifies records that refer to the same real-world entity.
- Puts all of those records, and only those records, into a cluster.
Curators and verifiers review and validate clusters, merging them together if they contain records for the same entity, or splitting them if they contain records for different entities. The result is clusters of records that each correspond to a different unique entity. See Verifying Clusters.
Curators periodically “publish” clusters to reflect changes to the blocking model, pair labels, or the clusters themselves. This process updates the cluster IDs and allows for comparison to the initial, baseline clusters and subsequent published snapshots. Curators can use cluster metrics for precision and recall to evaluate whether the Tamr matching model is improving as a result of cluster curation. See Precision and Recall Metrics.
When you are confident that your mastering project is accurately grouping records into clusters, you can:
- Create a single record for each unique entity from the best available data in records with the same cluster ID. See Golden Records Projects.
- Use the cluster information as a key in other systems.
For more information, see User Roles and Tamr Documentation.
Updated about 1 month ago