Tamr Documentation

Working with the Unified Dataset

Understanding and working with the unified dataset of a mastering project.

The unified dataset of a mastering project represents the logical entity you wish to master, such as customers, products, contacts, or other entities important to your business. The unified attributes of the dataset are the attributes that best describe this entity across all input datasets.

The unified dataset accomplishes three important tasks for mastering. It:

  • Defines a single schema, also known as data format, across many, potentially thousands, dissimilar input datasets. This schema is used for processing.
  • Powers entity resolution by Tamr machine learning with attribute-specific configurations.
  • Allows team members to provide feedback to record pairs and record clusters that Tamr suggests.

Defining a Single Schema

The unified dataset defines a single schema across all source datasets to be categorized. The unified schema is the largest set of attributes common to all input datasets in the project. The task in schema mapping is to map each common source attribute to this single schema.

Dataset Profiling

Profiling is an optional step that you can run at any point in the workflow. Profiling attributes can help you understand the data at multiple points in the project. Profiling a dataset or individual attributes creates derived metadata about a given dataset. Useful metadata includes:

  • Counts and histograms of distinct values.
  • Inferred data types.

For more information, see Profiling a Dataset

Configuring Machine Learning

Toggle Inclusion in Machine Learning

To the right of each unified attribute is a diagonal bar that can be de-selected if the user does not wish that attribute to be included in machine learning. The default setting is for this to be on (and thus used in machine learning). This setting indicates to Tamr whether to use the unified attribute for training a model. If it is toggled off, then the attribute will be displayed as a column on the records page, but will not be used by Tamr in its machine learning algorithms.

The machine learning icon is surrounded by a red box. The blue icons indicate that the attribute will be included in machine learning, while the white icon indicates that it will not.The machine learning icon is surrounded by a red box. The blue icons indicate that the attribute will be included in machine learning, while the white icon indicates that it will not.

The machine learning icon is surrounded by a red box. The blue icons indicate that the attribute will be included in machine learning, while the white icon indicates that it will not.

Choose a Tokenizer or Similarity Function

To compare values in a unified attribute when determining potential record matches, Tamr uses tokenizers and similarity functions.

To specify a tokenizer or similarity function for a unified attribute:

  1. Open the Schema Mapping page.
  2. Locate a unified attribute that is included in machine learning.
  3. Select the More menu (⁝ tricolon icon).
  4. Select Advanced.
  5. Use the drop-down list to change the Similarity function.
  6. Use the drop-down list to change the Tokenizer.
Two drop down lists appear for similarity settings after you select More and then Advanced for a unified attributeTwo drop down lists appear for similarity settings after you select More and then Advanced for a unified attribute

Two drop down lists appear for similarity settings after you select More and then Advanced for a unified attribute

Tip: You can also change how Tamr [sorts attribute values][doc:creating-unified-attributes-admin#section-changing-how-a-unified-attribute-is-sorted] in tables on subsequent pages of the project.

Identifying Required Attributes

Configure the Cluster Name Unified Attribute

The Supplier setting allows you to aggregate values for the unified attribute by the most common value in the set of records.

To configure a unified attribute as Supplier

  1. Select More (), the three vertical dots icon.
  2. Choose Supplier from the dropdown menu.

Configure the Spend Unified Attribute

The Spend setting allows you to specify a unified attribute of numeric data type to be aggregated per cluster. This makes it possible to navigate the clusters by Spend.

To configure a unified attribute as Spend

  1. Select More (), the three vertical dots icon.
  2. Choose Spend from the dropdown menu.
Configuring the Spend unified attribute.Configuring the Spend unified attribute.

Configuring the Spend unified attribute.

Configuring User Preferences

Configure Sorting and Searching Preferences

Select More () to toggle whether the attribute is treated as a numeric value or a string for sorting and searching purposes.

Updated 2 months ago



Working with the Unified Dataset


Understanding and working with the unified dataset of a mastering project.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.