Categorization projects help you realize value from organizing equivalent data records into distinct, unique categories. An example use is spend management, where categorizing records can help you gain insights into the volume of spend and price variance on specific parts or types of parts, across suppliers or from the same supplier.
Typical examples of categorization projects include designing or improving an existing taxonomy and expert sourcing.
Categorization project workflow
The categorization workflow consists of the following steps:
- Create a unified dataset.
- Load a taxonomy and optionally tweak it based on reviewer feedback. (Steps 1 and 2 can be completed in any order.)
- Launch the Tamr machine learning model for categorization.
- Curate the feedback from reviewers and re-launch Tamr to improve record categorization.
This topic provides additional detail about each step.
The initial stage of a categorization project is similar to that of a schema mapping project: an admin creates the project and uploads one or more input datasets. As in a schema mapping project, curators then map attributes in the input datasets to attributes in the unified schema. See Working with the Unified Dataset.
Curators then complete additional configuration in the unified schema that is specific to categorization projects:
- Indicate which of the unified attributes will contribute to finding similarities and differences among the records.
- Specify the tokenizers for Tamr's supervised learning models to use when comparing text values and finding similarities and differences.
- Optionally, identify a numeric type unified attribute as the "spend" attribute. Tamr uses this attribute to compute a total amount per category which you can use to sort project data.
- Optionally set up transformations for the data in the unified dataset.
After creating the unified dataset, an admin loads an existing taxonomy file into the project. Curators can make additions and other changes to tweak the taxonomy and adapt it to meet your organization's needs.
As curators iteratively make changes to the taxonomy to incorporate reviewer feedback, and periodically launch a Tamr machine learning process, the Tamr model suggests categories for records with increasing confidence. See Working with the Taxonomy.
The next step in the categorization project is to begin labeling records with the appropriate "node" in the taxonomy. Curators provide this initial training, finding and labeling at least one record for every node in the taxonomy. The result is a representative sample for the Tamr model to use.
The Tamr model identifies matches between words or tokens in dataset records and words or tokens already associated with each category of the taxonomy. This enables Tamr to suggest a classification for each record based on the initial categorization model it generates. See Categorizing Records.
After curators complete initial training and the Tamr model generates suggested categories for records, curators assign records to one or more experts with the reviewer role for their feedback.
Reviewers can upvote or downvote labels proposed by Tamr or by other reviewers, and propose different labels to categorize records more accurately. To help make this effort more efficient, Tamr identifies certain suggestions as high-impact. For example, if Tamr has low confidence regarding whether or not a record pertaining to “1 inch turbine bolts” belongs to the “Bolt” category in the taxonomy, Tamr marks that record as being high-impact so that a reviewer can provide feedback.
Work done by reviewers does not change the Tamr model until after a curator has verified and accepted the labels and re-run the categorization model.
As curators validate reviewer feedback and relaunch the categorization model, Tamr adds the feedback to its model. This drives the accuracy of the categorization model and enhances future automation for categorizing vast numbers of records with high accuracy. See Curator and Reviewer Categorizations.
For more information, see User Roles and the Tamr Documentation.
Updated 2 months ago
|Solving Data Quality Challenges with Tamr|
|Schema Mapping Projects|
|Golden Records Projects|