Tamr Documentation

Categorization Pipeline

Run Tamr continuously from dataset update through categorization and export.

Automate the running of Tamr jobs and common user tasks in a Categorization project.

Note: You can also use the Tamr Toolbox to automate running Tamr jobs and common user tasks. Tamr Toolbox is a python library created to provide a simple interface for common interactions with Tamr and common data workflows that include Tamr. See the Tamr Toolbox documentation.

Checklist before proceeding

Categorization Pipeline

  1. Find the id of your input dataset: GET /v1/datasets
    When this API is used with a filter parameter such as name==my-dataset.csv, it returns the dataset definition of the named dataset. From the API response, capture the numeric value after datasets/ from the id of the desired dataset to use as the datasetId in subsequent steps.
  2. (Optional) Delete records in the input dataset: DELETE /v1/datasets/{datasetId}/records
    This optional step should be taken when you wish to remove all records currently in your input dataset and only include source records added during the next step. Use the datasetId you obtained in step 1.
  3. Update existing records and add new records to the input dataset: POST /v1/datasets/{datasetId}:updateRecords?header=false
    Update the records of the dataset {datasetId} using the command CREATE for new records and record updates and using the command DELETE to delete individual records. Use the datasetId you obtained in step 1.

After updating your input datasets:

  1. Find the id of your categorization project dataset: GET /v1/projects
    When this API is used with a filter parameter such as name==my-project, it returns the project definition of the named project. From the API response, capture the numeric value after the projects/ from the id to use as the project id in subsequent steps and the unifiedDatasetName of the desired project.
  2. Refresh unified dataset: POST /v1/projects/{project}/unifiedDataset:refresh
    Update the unified dataset of the project using its project id {project}. Additionally, capture the
    id from the response. Use the project id obtained in step 1.
  3. Wait for operation: GET /v1/operations/{operationId}
    Using the captured id from the previous step, poll the status state of the operation until status.state=SUCCEEDED is received.
  4. (Optional) Train categorization model: POST /v1/projects/{project}/categorizations/model:refresh
    If users added manual categorizations and you would like to update the categorization model to incorporate this information, run this step. If you would prefer to use a previously trained model, skip this step. Use the project id obtained in step 1.
    Additionally, capture the id of the submitted operation from the response.
  5. (Optional) Wait for operation: GET /v1/operations/{operationId}
    If you chose to complete the previous step, complete this step too. Using the captured id from the previous step, poll the status state of the operation until status.state=SUCCEEDED is received.
  6. Categorize records: POST /v1/projects/{project}/categorizations:refresh
    Apply the categorization model for the project {project}. Additionally, capture the id of the submitted operation from the response. Use the project id obtained in step 1.
  7. Wait for operation: GET /v1/operations/{operationId}
    Using the captured id from the previous step, poll the status state of the operation until status.state=SUCCEEDED is received.
  8. Find the id of the output dataset: GET /v1/datasets
    When this API is used with a filter parameter such as name==my-dataset.csv, it returns the
    dataset definition of the named dataset. For the export dataset, find the dataset with a name
    that matches <unifed_dataset_name>_classifications_with_data, using the unified dataset
    name of your project found in step 1. From the API response, capture the numeric value after
    datasets/ from the id of the desired dataset to use as the output dataset id in subsequent
    steps.
  9. Refresh the output dataset: POST /v1/datasets/{datasetId}:refresh
    To refresh the final output dataset, use the dataset id you captured in the previous step.
  10. Wait for operation: GET /v1/operations/{operationId}
    Using the captured id from the previous step, poll the status state of the operation until status.state=SUCCEEDED is received.
  11. Stream the records of the output dataset: GET /v1/datasets/datasetId/records
    Obtain the records of the output dataset as JSON or AVRO, using the id of the output dataset.

Alternate Methods for Data Ingestion and Export

One method of data ingestion and data export have been shown here. For other methods, see Exporting a Dataset to a Local File System and Uploading a Dataset into a Project .

Updated 21 days ago


Categorization Pipeline


Run Tamr continuously from dataset update through categorization and export.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.