Categorization

Typical examples of categorization projects include designing or improving of an existing taxonomy and expert sourcing.

To begin a categorization project, add some data and an existing taxonomy, in any order.

The categorization workflow consists of the following stages:

Create a unified dataset.
Load a taxonomy and tweak it based on reviewer feedback.
Launch the Tamr machine learning process for categorization.
Revise the resulting taxonomy, curate the feedback from reviewers, and re-launch Tamr to improve the categorization.

We describe each stage in this topic.

Step 1: Create a Unified Dataset for the Categorization Project

You can begin a categorization project by first creating a unified dataset.
To create a unified dataset:

Add one or more input datasets to the project. Input datasets typically represent a single logical entity, such as customers, or products. Tamr converts the datasets registered with Tamr into the project's datasets. This process is known as indexing.
Begin mapping attributes in the schema in input datasets to attributes in a single unified dataset. When you are ready, you can launch a Tamr machine learning process. The machine learning jobs consume your schema mappings to produce results. For detailed steps, see Working with the Unified Dataset.

Step 2: Load and Tweak the Taxonomy

Alternatively, you could begin a categorization project by loading your existing taxonomy into the project.
As you make changes to the taxonomy, and periodically launch a Tamr machine learning process, Tamr improves the categorization results. The process repeats, as Tamr classifies records with increasing confidence and reviewers add feedback to the existing taxonomy. For detailed steps, see Working with the Taxonomy.

Step 3: Launch Tamr to Put Records into Categories in the Taxonomy

The next step in the categorization project is to begin categorizing records into the taxonomy. Once Tamr has categorized a minimum of five records, Tamr can begin to identify matches between words or tokens contained within values of each dataset record and words or tokens already associated with each category of the taxonomy. This enables Tamr to suggest a classification for each record based on the initial categorization model it generates. For more information, see Categorizing Records.

Step 4: Include Reviewer Feedback and Curate the Results

When Tamr runs its machine learning process to identify matches between records and their categories, it produces simple high-impact questions for reviewers to answer. Tamr asks reviewers whether or not certain records are categorized appropriately. Note that Tamr asks questions about a representative subset of records from a large portion of records in the unified dataset.

For example, if Tamr has low confidence regarding whether or not a record pertaining to “1 inch turbine bolts” belongs to the “Bolt” category within the organization’s taxonomy, Tamr asks a Reviewer for their feedback.

As you rerun the categorization process and regenerate the categorization model, Tamr incorporates the reviewer's feedback into the dataset and its model. This drives the accuracy of the categorization model and enhances future automation for categorizing vast numbers of records with high accuracy. For more information, see Curator and Reviewer Categorizations.