HomeGuidesAPI ReferenceChangelog
HomeGuidesTamr API ReferenceTutorialsEnrichment API ReferenceSupport Help CenterLog In

Working with the Unified Dataset

Learn about the unified dataset and how to work with it in a categorization project.

In a categorization project, you begin by following the Schema Mapping Workflow to add input datasets and create a unified dataset. The records in the unified dataset contain information about the logical entity you wish to categorize, such as customers, products, parts, or another entity that is important in your business.

The attributes in the unified dataset are those attributes from multiple input datasets that best describe this entity across all input datasets.

The unified dataset accomplishes three tasks in the categorization workflow. It:

  • Defines a single schema across a large number of dissimilar datasets. This schema is used for Tamr processing.
  • Powers the Tamr categorization machine learning process with attribute-specific configurations.
  • Allows reviewers and verifiers to provide feedback on the categorizations Tamr suggests.

Defining a Single Schema

The unified dataset defines a single schema across all input datasets that will be categorized. Because of this, the unified schema is often thought of as the largest set of attributes common to all input datasets in the project. The task is then to map each common input dataset attribute to this single schema. For more information, see Unified Dataset Management.

Dataset Profiling

Profiling a dataset or individual attributes creates derived metadata about a given dataset. The metadata includes counts and histograms of distinct values, and inferred data types. Profiling can be run at any point in the workflow to help you understand the data at multiple points in the project. For more information, see Profiling a Dataset.

Configuring Inclusion of Attributes in Machine Learning

Tamr relies on attributes from the unified dataset when it runs its classification model for categorizing records into categories in the taxonomy. You decide which attributes should be used in machine learning and which should be excluded.

  • If an attribute is included in machine learning, Tamr uses the attribute to compare records that have verified categorizations with records that do not and generate suggestions. By default, all attributes in the unified dataset are included in machine learning. This is indicated by a blue barbell or diagonal bar icon to the right of each attribute.
  • If an attribute is excluded from machine learning, Tamr does not use it in its machine learning algorithms. To exclude an attribute from machine learning, de-select the barbell or diagonal bar icon to the right of each attribute. The attribute continues to display as a column on the categorized records page.

In the following screenshot, the "include/exclude" icons are indicated by the box. The blue icons indicate that the attribute is included in machine learning, while the white icon indicates that the attribute is excluded from it.

Including and excluding attributes from machine learning.Including and excluding attributes from machine learning.

Including and excluding attributes from machine learning.

Choosing a Tokenizer

To compare text values, Tamr uses tokenizers. Based on your knowledge of the data, you can choose the most appropriate tokenizer for Tamr to use for each unified attribute.

To specify a tokenizer in a categorization project:

  1. Open the Schema Mapping page.
  2. Locate a unified attribute that is included in machine learning.
  3. Select the More menu (⁝ tricolon icon).
  4. Select Advanced.
  5. Use the drop-down list to change the Tokenizer.
The Tokenizer dropdown appears after you select More and then Advanced.The Tokenizer dropdown appears after you select More and then Advanced.

The Tokenizer dropdown appears after you select More and then Advanced.

Tip: You can also change how Tamr [sorts attribute values][doc:creating-unified-attributes-admin#section-changing-how-a-unified-attribute-is-sorted] in tables on subsequent pages of the project.

Identifying Required Attributes

The Spend setting allows you to specify a unified attribute with a numeric data type to be aggregated per taxonomy node. This is useful, for example, if your records represent monetary transactions and you would like to navigate the transactions by the amount spent.

Tip: Identifying an attribute as Spend is optional.

To configure an attribute as Spend

  1. Select the More menu (⁝ tricolon icon).
  2. Choose Spend from the dropdown menu.
Configuring the Spend unified attribute.Configuring the Spend unified attribute.

Configuring the Spend unified attribute.


Did this page help you?