In a Tamr mastering project, you "de-duplicate" the unified dataset by identifying characteristics that indicate record similarity or difference, then review and validate pairs and clusters of records. This effort is often referred to as data mastering, entity resolution, or record linkage.
Data mastering is one of the major workflows you can use in:
- Data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM)
- Matching projects, such as entity detection and enrichment.
In a mastering project, you create a unified dataset, generate an initial set of record pairs to identify correct and incorrect matches and non-matches, iteratively run Tamr machine learning to improve the accuracy of the resulting pairs and clusters, and then publish and refine the clusters.
The process of data mastering
Each of these high-level steps is introduced in this topic.
- Step 1: Begin creating a unified dataset
- Step 2: Generate record pairs to identify matches
- Step 3: Curate and review record clusters
- Step 4: Publish clusters
The initial stage of a mastering project is similar to that of a schema mapping project: an admin creates the project and uploads one or more input datasets. As in a schema mapping project, curators then map attributes in the input datasets to attributes in the unified schema.
Curators then complete additional configuration in the unified schema that is specific to mastering projects:
- Identify the single logical entity, such as people, customers, or products, that is being mastered.
- Indicate which of the unified attributes will contribute to finding similarities and differences among the records.
- Specify the tokenizers and similarity functions for Tamr's supervised learning models to use when comparing data values and finding similarities and differences.
- Optionally set up transformations for the data in the unified dataset.
Curators use their knowledge of the data to create a "blocking model". A blocking model uses one or more of the unified attributes to filter out pairs of records that obviously do not match each other. An example for mastering data for individuals might be,
last_name must be at least 90% similar and
phone must be at least 75% similar for two records to be a potential match. See Defining the Blocking Model.
From the blocking model, Tamr generates an initial set of record pairs that pass the filter. A curator evaluates a random sample of the pairs and labels them as being either a match or no-match. This labeling effort provides additional input for the Tamr matching model to use in finding similarities between records. See Generating Record Pairs.
The Tamr matching model can then generate additional sets of record pairs that include a match or no-match suggestion and a level of confidence for that suggestion. Curators assign these record pairs to reviewers so that they can contribute their expertise in identifying correct and incorrect matches and correct and incorrect no-matches. Curators then validate the work of the reviewers and accept or reject their input. See Curating and Reviewing Record Pairs.
The Tamr matching model also generates record clusters, so step 3 can begin while step 2 is ongoing.
Because more than two records can refer to the same real-world entity, the next step in a mastering project uses all of the input about what makes a pair of records match or not to find and cluster all matching records together. See Curating and Reviewing Record Clusters.
The clustering process:
- Identifies records that refer to the same real-world entity.
- Puts all of those records, and only those records, into a cluster.
Curators review and validate clusters, merging them together if they contain records for the same entity, or splitting them if they contain records for different entities. The result is clusters of records that each correspond to a different unique entity. See Curating Clusters.
Curators can use cluster metrics for precision and recall to evaluate whether the Tamr matching model is improving as a result of cluster curation. See Precision and Recall Metrics.
Publishing assigns persistent IDs to clusters, so step 4 can be initiated concurrently with step 3 or at any time afterward. If you change the blocking model, record pair labels, or cluster verification or membership before you publish clusters for the first time, any changes made to the initial clusters are lost. See Publishing Clusters.
Publishing saves the current clusters as the latest version visible to downstream consumers, creating a dataset snapshot. This snapshot allows you to track changes to published clusters over time.
- Merge the records in each cluster to form a single, merged record that describes an entity. See Golden Records.
- Use the cluster information as a key in other systems.
For more information, see User Roles and the Tamr Documentation.
Updated about a month ago
|Solving Data Quality Challenges with Tamr|
|Golden Records Projects|
|Schema Mapping Projects|