Working with the Unified Dataset

The unified dataset is the logical entity you wish to categorize, such as customers, products, parts, or other entity that is important in your business.
The attributes of the unified dataset are those attributes from multiple input datasets that best describe this entity across all input datasets.

The unified dataset accomplishes three tasks in the categorization worfklow. It:

Defines a single schema across a large number of dissimilar datasets. This schema is used for Tamr processing.
Powers the Tamr categorization machine learning process with attribute-specific configurations.
Allows Curators and Reviewers to provide feedback to categorizations Tamr produces.

Defining a Single Schema

The unified dataset defines a single schema across all input datasets that will be categorized. Because of this, the unified schema is often thought of as the largest set of attributes common to all input datasets in the project. The task is then to map each common input dataset attribute to this single schema. For more information, see Unified Dataset Management.

Dataset Profiling

If your input datasets are profiled, it is often helpful when you begin a Categorization project that will assign records to categories within a taxonomy.

Profiling a dataset or individual attributes creates derived metadata about a given dataset. Useful metadata includes counts and histograms of distinct values, and inferred data types. Profiling is an optional step that you can run at any point in the workflow. Profiling attributes can help you understand the data at multiple points in the project. For more information, see Profiling a Dataset.

Configuring Inclusion of Attributes in Machine Learning

Tamr relies on attributes from the unified dataset when it runs its classification model for categorizing records into categories in the taxonomy. You can decide which attributes should be used in machine learning and which should be excluded.

If an attibute is included in machine learning, Tamr uses the attribute for training when generating suggestions for categorizations. By default, all attributes in the unified dataset are included in machine learning. This is indicated by a blue diagonal bar icon to the right of each attribute.
If an attribute is excluded from machine learning, Tamr does not use it in its machine learning algorithms. To exclude an attribute from machine learning, de-select a diagonal bar icon to the right of each attribute. The attribute continues to display as a column on the records page.

In the following screenshot, the "inclusion into machine learning icon" is surrounded by a box. The blue icons indicate that the attribute is included in machine learning, while the white icon indicates that the attribute is excluded from it.

Choose a Tokenizer

To compare text values, Tamr uses tokenizers.

To specify tokenizers in a categorization project:

Open the Schema Mapping page.
Locate a unified attribute that is included in machine learning.
Select the More menu (⁝ tricolon icon).
Select Advanced.
Use the drop-down list to change the Tokenizer.

401 — The Tokenizer dropdown appears after you select More and then Advanced.

Configuring User Preferences

You can configure the following characteristics for the attributes in the unified dataset:

Attribute type can be numeric or string. This is useful for sorting and searching purposes.
A numeric type attribute can be configured as Spend. The Spend setting indicates to Tamr that the numeric type attribute is required and must be aggregated per taxonomy node. This is useful, for example, if your records represent transactions and you would like to navigate the transactions by Spend.

To configure the type of the attribute in the unified dataset:

Select the vertical three dots icon.
Choose one of the two types: numeric or string.

To configure the attribute as Spend:

Select the vertical three dots icon.
Choose Spend from the dropdown menu.

387 — Configuring the *Spend* unified attribute.