Mastering Pipeline
Automate running Tamr Core jobs and common user tasks in a mastering project.
Before You Begin
Verify the following before completing the procedures in this topic:
- At least one mastering project exists (Creating a Project).
- The project includes at one dataset, and you have performed schema mapping on the project’s unified dataset (Adding a Dataset).
- You have run the Update Unified Dataset, Generate Pairs, and Apply Feedback and Update Results jobs at least once.
- You have run the Review and Update Clusters job at least once.
Updating the Input Datasets
Complete this procedure for each input dataset in your project.
To update an input dataset:
- Find the id of your input dataset: GET /v1/datasets
When this API is used with a filter parameter such asname==my-dataset.csv
, it returns the dataset definition of the named dataset. From the API response, capture the numeric value afterdatasets/
from theid
of the desired dataset to use as thedatasetId
in subsequent steps. - (Optional) Delete (truncate) records in the input dataset: DELETE /v1/datasets/{datasetId}/records
Complete this step if you want to remove all records currently in your input dataset and only include source records added during the next step. Use thedatasetId
you obtained in step 1. - Update dataset: POST /v1/datasets/{datasetId}:updateRecords
Update the records of the dataset{datasetId}
using the commandCREATE
for new records and record updates and using the commandDELETE
to delete individual records. Use thedatasetId
you obtained in step 1. If you completed step 1, all records in this step are effectively inserted. In other words, no updates occur.
Generating, Publishing, and Exporting Clusters
Complete this procedure after updating your input datasets.
To generate, publish, and export clusters:
- Find the id of your project dataset: GET /v1/projects
When this is used with a filter parameter such asname==my-project
, it returns the project definition of the named project. From the API response, capture the numeric value after theprojects/
from theid
to use as the project id in subsequent steps, and theunifiedDatasetName
of the desired project. - Update unified dataset: POST /v1/projects/{project}/unifiedDataset:refresh
- Wait for operation: GET /v1/operations/{operationId}
Using the captured id from Step 2, poll the status state of the operation until status.state=SUCCEEDED
is received. - Generate record pairs: POST /v1/projects/{project}/recordPairs:refresh
Generate record pairs using the project’s{projectId}
. - (Optional) Generate high-impact pairs: POST /v1/projects/{project}/highImpactPairs:refresh
- Wait for operation: GET /v1/operations/{operationId}
- Predict matching pairs: POST /v1/project/{project}/recordPairsWithPredictions:refresh
- Wait for operation: GET /v1/operations/{operationId}
- Generate record clusters: POST /v1/projects/{project}/recordClusters:refresh
Apply the latest mastering model for the project{project}
and generate clusters. - Wait for operation: GET /v1/operations/{operationId}
- (Optional) To continuously monitor model performance:
a. Generate test records and clusters for users to curate: POST v1/projects/{project}/testRecords:refresh
Tamr Core will use the cluster verification on these test records to compute cluster accuracy metrics.
b. Wait for operation: GET /v1/operations/{operationId}
c. Generate high-impact training clusters for curation: POST v1/projects/{project}/trainingClusters:refresh
Users can filter to high-impact clusters on the UI to review and verify them.
d. Wait for operation: GET /v1/operations/{operationId}
e. Compute cluster accuracy metrics: POST v1/projects/{project}/clustersAccuracy:refresh
Use this endpoint after users provide feedback to most of the test records. (Learn more about cluster accuracy metrics).
f. Wait for operation: GET /v1/operations/{operationId} - Publish clusters: POST /v1/projects/{project}/publishedClusters:refresh
- Wait for operation: GET /v1/operations/{operationId}
- Download cluster records: GET /v1/projects/{project}/publishedClustersWithData/records
Download as CSV iftext/csv
is specified for the “Accept” header. Other supported
types areavro/binary
to get an Avro file, orapplication/json
to get JSON.
Alternate Methods of Data Ingestion and Export
This topic provides one method of data ingestion and data export. For other methods, see Exporting a Dataset to a Local File System and Uploading a Dataset into a Project.
Updated over 2 years ago