A Mastering Project solves the task of finding records that refer to the same entity within and across data sources. This task, often referred to as data mastering, entity resolution or record linkage, is principally employed in data unification projects, such as Customer Data Integration (CDI) and Master Data Management (MDM), and matching projects such as entity detection and enrichment.
The first step in a mastering project is to add one-or-more sources to be mastered from Unify's registered datasets to the project's datasets. A project's sources are focused on a single logical entity, e.g. customers or products.
Once added to the project, sources are mapped to a single unified dataset and initially configured for Unify's machine learning to understand (Working with the Unified Dataset).
Once the unified dataset is created, the next step is to create a binning model to generate record pairs that are a potential match (Generating Records Pairs ).
After a Curator has initially classified a handful of arbitrary record pairs as matching or non-matching, Unify begins learning and identifies high impact record pairs for Reviewer feedback (Curating and Reviewing Record Pairs).
The process of mastering a data set uses clustering. Clustering is the concept of identifying when two or more records refer to the same real-world entity, where an entity is a customer, supplier, or other person or organization. Clustering groups pairs of matching record into clusters. Multiple records may refer to the same entity, so clusters are created to hold all matching records.
The end result is clusters of matching records that correspond to unique entities. The records in each cluster can be merged to form a single, merged record describing an entity; or the cluster information can be used as a key in other systems (Curating and Reviewing Record Clusters).
After clusters have been generated, curated, and reviewed, they can be published. Publishing saves the current clusters as the latest version visible to downstream consumers, creating a dataset snapshot used in tracking changes to published clusters over time.
Cluster change metrics, e.g. the number of clusters with new members, are dynamically captured between the current clustering and the latest published clusters. These metrics are presented in the clustering Curator and Review workflows.
Historical cluster change metrics can be accessed through RESTful APIs (Publishing Clusters).