User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In

Creating the Unified Dataset for Mastering

Working with the unified dataset of a mastering project.

The unified dataset of a mastering project represents the logical entity you wish to master, such as customers, products, contacts, or other entities important to your business. The unified attributes of the dataset are the attributes that best describe this entity across all input datasets.

The unified dataset accomplishes three important tasks for mastering:

Creating a Unified Dataset

Once you upload data, you can create your unified dataset.

To Create a Unified Dataset:

  1. Navigate to the Schema Mapping page.
  2. On the right side of your screen, in the white box name your unified dataset, then select Create Unified Dataset.

Next Steps

Once you create your unified dataset, you can continue on to schema mapping. To commit changes to the unified dataset, in the top right corner, select Update Unified Dataset.

Important important You run the Update Unified Dataset job to apply your changes and view the unified dataset.

After you run Update Unified Dataset for the first time, you can:

Defining a Single Schema

The unified dataset defines a single schema across all source datasets. The unified schema is the largest set of attributes common to all input datasets in the project. The task in schema mapping is to map each common source attribute to this single schema.

Profiling Datasets

Profiling is an optional step that you can run at any point in the workflow. Profiling attributes can help you understand the data at multiple points in the project. Profiling a dataset or individual attribute creates derived metadata about a given dataset. Useful metadata includes counts and histograms of distinct values, as well as inferred data types. For more information, see Profiling a Dataset

Configuring Machine Learning

Toggle Inclusion in Machine Learning

To the right of each unified attribute on the Schema Mapping page is a Machine learning attribute include unified attribute that can be de-selected if you do not want that attribute to be included in machine learning. The default setting is for this to be on (and thus used in machine learning). This setting indicates to the model whether to use the unified attribute for training a model. If it is toggled off, then the attribute will be displayed as a column on the records page, but will not be used in machine learning algorithms.

If you change the attributes included in machine learning, you must select Update the Unified Dataset to apply the changes.

1341

The machine learning icons are surrounded by a red box. The blue icons indicate that the attribute is included in machine learning, while the white icon indicates that it is not.

Choosing a Tokenizer or Similarity Function

To compare values in a unified attribute when determining potential record matches, Tamr Core uses tokenizers and similarity functions.

To specify a tokenizer or similarity function for a unified attribute:

  1. Open the Schema Mapping page.
  2. Locate a unified attribute that is included in machine learning.
  3. Select More More tricolon icon > Advanced.
  4. Use the dropdown to change the Similarity function.
  5. Use the dropdown to change the Tokenizer.
389

Two dropdowns appear for similarity settings after you select More > Advanced for a unified attribute.

Identifying Required Attributes

Configuring the Cluster Name Unified Attribute

This setting allows you to aggregate values for the unified attribute by the most common value in the set of records. The name is defined at project creation to answer, What are you mastering? and can reflect sites, suppliers, or customers.

389

Selecting the cluster name attribute to be required.

To configure a unified attribute as the cluster name attribute:

  1. Select More More tricolon icon.
  2. From the dropdown, choose the cluster identifier. For example, Customer.

Configuring the Spend Unified Attribute

The Spend setting allows you to specify a unified attribute of numeric data type to be aggregated per cluster. This makes it possible to navigate the clusters by Spend.

To configure a unified attribute as Spend:

  1. Select More More tricolon icon.
  2. From the dropdown, choose Spend.
387

Configuring the Spend unified attribute.

Configuring User Preferences

Configure Sorting and Searching Preferences

Select More More tricolon icon to toggle whether the attribute is treated as a numeric value or a string for sorting and searching purposes.