User GuidesAPI ReferenceRelease NotesEnrichment APIs
Doc HomeSupportLog In

Mastering Pipeline

Automate running Tamr Core jobs and common user tasks in a mastering project.

Before You Begin

Verify the following before completing the procedures in this topic:

Updating the Input Datasets

Complete this procedure for each input dataset in your project.

To update an input dataset:

  1. Find the id of your input dataset: GET /v1/datasets
    When this API is used with a filter parameter such as name==my-dataset.csv, it returns the dataset definition of the named dataset. From the API response, capture the numeric value after datasets/ from the id of the desired dataset to use as the datasetId in subsequent steps.
  2. (Optional) Delete records in the input dataset: DELETE /v1/datasets/{datasetId}/records
    Complete this step if you want to remove all records currently in your input dataset and only include source records added during the next step. Use the datasetId you obtained in step 1.
  3. Update dataset: POST /v1/datasets/{datasetId}:updateRecords
    Update the records of the dataset {datasetId} using the command CREATE for new records and record updates and using the command DELETE to delete individual records. Use the datasetId you obtained in step 1. If you completed step 1, all records in this step are effectively inserted. In other words, no updates occur.

Generating, Publishing, and Exporting Clusters

Complete this procedure after updating your input datasets.

To generate, publish, and export clusters:

  1. Find the id of your project dataset: GET /v1/projects
    When this API is used with a filter parameter such as name==my-project, it returns the project definition of the named project. From the API response, capture the numeric value after the projects/ from the id to use as the project id in subsequent steps, and the unifiedDatasetName of the desired project.
  2. Update Unified Dataset: POST /v1/projects/{project}/unifiedDataset:refresh
  3. Wait for operation: GET /v1/operations/{operationId}
    Using the captured id from Step 2, poll the status state of the operation until status.state=SUCCEEDED is received.
  4. Generate Record Pairs: POST /v1/projects/{project}/recordPairs:refresh
    Generate record pairs using the project’s {projectId}.
  5. Wait for operation: GET /v1/operations/{operationId}
  6. Predict matching pairs: POST /v1/project/{project}/recordPairsWithPredictions:refresh
  7. Wait for operation: GET /v1/operations/{operationId}
  8. Generate record clusters: POST /v1/projects/{project}/recordClusters:refresh
    Apply the latest mastering model for the project {project} and generate clusters.
  9. Wait for operation: GET /v1/operations/{operationId}
  10. Publish clusters: POST /v1/projects/{project}/publishedClusters:refresh
  11. Wait for operation: GET /v1/operations/{operationId}
  12. Download cluster records: GET /v1/projects/{project}/publishedClustersWithData/records
    Download as CSV if text/csv is specified for the “Accept” header. Other supported
    types are avro/binary to get an Avro file, or application/json to get JSON.

Alternate Methods of Data Ingestion and Export

This topic provides one method of data ingestion and data export. For other methods, see Exporting a Dataset to a Local File System and Uploading a Dataset into a Project.

Did this page help you?