Categorization projects help you realize value from organizing equivalent data records into distinct, unique categories. An example use is spend management, where categorizing records can help you gain insights into the volume of spend on specific parts or types of parts, across suppliers or from the same supplier.
Typical examples of categorization projects include designing or improving an existing taxonomy and expert sourcing.
The categorization workflow consists of the following steps:
- Create a unified dataset.
- Load a taxonomy and optionally adjusadjusadjust it based on reviewer feedback. (Steps 1 and 2 can be completed in any order.)
- Label an initial set of records then launch the Tamr model for categorization.
- Review the categories suggested by Tamr, provide feedback, and re-launch Tamr to improve record categorization.
The initial stage of a categorization project is similar to that of a schema mapping project: an admin creates the project and uploads one or more input datasets. As in a schema mapping project, curators then map attributes in the input datasets to attributes in the unified schema. See Working with the Unified Dataset.
Curators then complete additional configuration in the unified schema that is specific to categorization projects:
- Specify the tokenizers for Tamr's supervised learning models to use when comparing text values.
- Optionally, identify a numeric unified attribute as the "spend" attribute. Tamr uses this attribute to compute a total amount per category which you can use to sort project data.
- Optionally, set up transformations for the data in the unified dataset.
After creating the unified dataset, an admin loads an existing taxonomy file into the project. Curators can make additions and other changes to adjust the taxonomy and adapt it to meet your organization's needs.
As curators iteratively make changes to the taxonomy to incorporate feedback, the Tamr model suggests categories for records with increasing confidence. See Navigating a Taxonomy.
The next step in the categorization project is to begin labeling a small set of records with the appropriate "node" in the taxonomy. Verifiers and curators provide this initial training, finding and labeling at least one record for every node in the taxonomy. When curators apply feedback and update results, Tamr uses this representative sample to identify similarities between values that are now associated with each category of the taxonomy and values in other records to suggest categories. See Updating Categorization Results.
After curators and verifiers complete initial training and the Tamr model generates suggested categories, curators and verifiers assign records to one or more experts with the reviewer (or other) role for their feedback. See Assigning Records in Categorization Projects.
Reviewers upvote or downvote the categories proposed by Tamr or by other team members, and can propose different categories to label records more accurately. See Reviewing Categorizations. To make this effort more efficient, Tamr identifies certain suggestions as high-impact. For example, if Tamr has low confidence regarding whether or not a record pertaining to “1 inch turbine bolts” belongs to the “Bolt” category, Tamr marks that record as high-impact so that a reviewer can provide feedback. Team members with any user role can upvote or downvote labels, and remember you can utilize whichever roles make the most sense to your organization. See Categorizing Records.
Work done by reviewers does not change the Tamr model until after a verifier or curator verifies the categorizations for one or more records and a curator updates the Tamr model with the results. Tamr adds the verified expert feedback to its model, improving its accuracy and enhancing future automation for categorizing vast numbers of records with high accuracy. See Verifying Record Categorizations (verifiers) and Updating Categorization Results (curators).
For more information, see User Roles and Tamr Documentation.
Updated 21 days ago