Creating the Unified Dataset for Categorization
Learn about the unified dataset and how to work with it in a categorization project.
In a categorization project, you begin by following the Schema Mapping Workflow to add input datasets and create a unified dataset. The records in the unified dataset contain information about the logical entity you wish to categorize, such as customers, products, parts, or another entity that is important in your business.
The attributes in the unified dataset are populated from multiple input datasets that best describe this entity across all input datasets.
The unified dataset accomplishes three tasks in the categorization workflow. It:
- Defines a single schema across a large number of dissimilar datasets.
- Powers the categorization machine learning process with attribute-specific configurations.
- Allows reviewers and verifiers to provide feedback on the categorizations the model suggests.
Defining a Single Schema
The unified dataset defines a single schema across all input datasets to be categorized. Because of this, the unified schema is often thought of as the largest set of attributes common to all input datasets in the project. The task is then to map each common input dataset attribute to this single schema. For more information, see Mapping Unified Attributes.
Profiling Datasets
Profiling a dataset or individual attributes creates derived metadata about a given dataset. The metadata includes counts and histograms of distinct values, as well as inferred data types. You can run profiling at any point in the workflow to help you understand your data. For more information, see Profiling a Dataset.
Configuring Inclusion of Attributes in Machine Learning
Tamr Core relies on attributes from the unified dataset when it runs its model for categorizing records into categories in the taxonomy. You decide which attributes are to be used in machine learning and which are to be exclude.
Configure inclusion of attributes in machine learning on the Schema Mapping page. If you change the attributes included in machine learning, you must select Update the Unified Dataset to apply the changes.
- If an attribute is included in machine learning, Tamr Core uses the attribute to compare records that have verified categorizations with records that do not and generate suggestions. By default, all attributes in the unified dataset are included in machine learning. This is indicated by a Machine learning attribute icon to the right of each attribute.
- If an attribute is excluded from machine learning, Tamr does not use it in its machine learning algorithms. To exclude an attribute from machine learning, click the Machine learning attribute icon to toggle inclusion off. The attribute continues to display as a column on the categorized records page.
In the following screenshot, the "include/exclude" icons are indicated by the box. The blue icons indicate that the attribute is included in machine learning, while the white icon indicates that the attribute is excluded from it.
Choosing a Tokenizer
To compare text values, Tamr Core uses tokenizers. Based on your knowledge of the data, you can choose the most appropriate tokenizer to use for each unified attribute.
To specify a tokenizer:
- Open the Schema Mapping page.
- Locate a unified attribute that is included in machine learning.
- Select More > Advanced.
- Use the dropdown to change the tokenizer.
Identifying Required Attributes
The Spend setting allows you to specify a unified attribute with a numeric data type to be aggregated per taxonomy node. This is useful, for example, if your records represent monetary transactions and you would like to navigate the transactions by the amount spent.
Tip: Identifying an attribute as Spend is optional.
To configure an attribute as Spend:
- Select More .
- Choose Spend from the dropdown menu.
Updated almost 2 years ago