Tamr Documentation

Categorization

Understand the basics of a categorization project such as taxonomy design and expert sourcing.

A Categorization Project solves the task of placing records into categories. It is a top-down organizational project designed to classify individual records into a collection of hierarchical categories, referred to as a taxonomy.

A project can be begun by adding data and adding a taxonomy, with either task being completed first.

Working with the Unified Dataset

One first step in a categorization project is to add one-or-more sources to be categorized from Tamr's registered datasets to the project's datasets. A project's sources are focused on a single logical entity, e.g. customers or products. Once added to the project, sources must be mapped to a single unified dataset and initially configured for Tamr's machine learning to understand (Working with the Unified Dataset).

Working with the Taxonomy

The alternative first step is to load the target taxonomy into the project. Understanding and working with the taxonomy as Tamr classifies records with increasing confidence and reviewers create more and more feedback is crucial to the success of the project (Working with the Taxonomy).

Categorizing Records

The third step is to begin categorizing records into the taxonomy. Once a minimum of 5 records have been categorized, Tamr can begin to identify matches between words / tokens contained within values of each dataset record and words / tokens already associated with each category of the taxonomy. This enables Tamr to suggest a classification for each record based on the initial model it generates (Categorizing Records).

Curator and Reviewer Categorizations

Tamr then produces simple high-impact questions regarding whether or not certain records, that are representative of a large portion of the unified dataset records, are categorized appropriately. For example, if Tamr has low confidence regarding whether or not a record pertaining to “1 inch turbine bolts” is in fact part of the “Bolt” category within the organization’s taxonomy, it will ask a Reviewer for their feedback - driving accuracy and enhancing future automation. The reviewer's feedback is then incorporated into the dataset and Tamr’s models (Curator and Reviewer Categorizations).