User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In

Mastering Pipeline

Automate running Tamr Core jobs and common user tasks in a mastering project.

Before You Begin

Verify the following before completing the procedures in this topic:

Updating the Input Datasets

Complete this procedure for each input dataset in your project.

To update an input dataset:

  1. Find the id of your input dataset: GET /v1/datasets
    When this API is used with a filter parameter such as name==my-dataset.csv, it returns the dataset definition of the named dataset. From the API response, capture the numeric value after datasets/ from the id of the desired dataset to use as the datasetId in subsequent steps.
  2. (Optional) Delete (truncate) records in the input dataset: DELETE /v1/datasets/{datasetId}/records
    Complete this step if you want to remove all records currently in your input dataset and only include source records added during the next step. Use the datasetId you obtained in step 1.
  3. Update dataset: POST /v1/datasets/{datasetId}:updateRecords
    Update the records of the dataset {datasetId} using the command CREATE for new records and record updates and using the command DELETE to delete individual records. Use the datasetId you obtained in step 1. If you completed step 1, all records in this step are effectively inserted. In other words, no updates occur.

Generating, Publishing, and Exporting Clusters

Complete this procedure after updating your input datasets.

To generate, publish, and export clusters:

  1. Find the id of your project dataset: GET /v1/projects
    When this is used with a filter parameter such as name==my-project, it returns the project definition of the named project. From the API response, capture the numeric value after the projects/ from the id to use as the project id in subsequent steps, and the unifiedDatasetName of the desired project.
  2. Update unified dataset: POST /v1/projects/{project}/unifiedDataset:refresh
  3. Wait for operation: GET /v1/operations/{operationId}
    Using the captured id from Step 2, poll the status state of the operation until status.state=SUCCEEDED is received.
  4. Generate pairs: POST /v1/projects/{project}/recordPairs:refresh
    Generate pairs using the project’s {projectId}.
  5. (Optional) Generate high-impact pairs: POST /v1/projects/{project}/highImpactPairs:refresh
  6. Wait for operation: GET /v1/operations/{operationId}
  7. Predict matching pairs: POST /v1/project/{project}/recordPairsWithPredictions:refresh
  8. Wait for operation: GET /v1/operations/{operationId}
  9. Generate record clusters: POST /v1/projects/{project}/recordClusters:refresh
    Apply the latest mastering model for the project {project} and generate clusters.
  10. Wait for operation: GET /v1/operations/{operationId}
  11. (Optional) To continuously monitor model performance:
    a. Generate test records and clusters for users to curate: POST v1/projects/{project}/testRecords:refresh
    Tamr Core will use the cluster verification on these test records to compute cluster accuracy metrics.
    b. Wait for operation: GET /v1/operations/{operationId}
    c. Generate high-impact training clusters for curation: POST v1/projects/{project}/trainingClusters:refresh
    Users can filter to high-impact clusters on the UI to review and verify them.
    d. Wait for operation: GET /v1/operations/{operationId}
    e. Compute cluster accuracy metrics: POST v1/projects/{project}/clustersAccuracy:refresh
    Use this endpoint after users provide feedback to most of the test records. (Learn more about cluster accuracy metrics).
    f. Wait for operation: GET /v1/operations/{operationId}
  12. Publish clusters: POST /v1/projects/{project}/publishedClusters:refresh
  13. Wait for operation: GET /v1/operations/{operationId}
  14. Download cluster records: GET /v1/projects/{project}/publishedClustersWithData/records
    Download as CSV if text/csv is specified for the “Accept” header. Other supported
    types are avro/binary to get an Avro file, or application/json to get JSON.

Alternate Methods of Data Ingestion and Export

This topic provides one method of data ingestion and data export. For other methods, see Exporting a Dataset to a Local File System and Uploading a Dataset into a Project.