How to Access the Out-Sample Model Statistics?

In addition to the in-sample pairwise model statistics, you can also compute and track out-sample model statistics in Tamr using a set of test records. This article describes how you can access the out-sample precision and recall metrics using both the Tamr UI and programmatically using the Tamr API. For more information on how these metrics are computed, please see the Tamr documentation on Precision and Recall Metrics for Clusters.


Note:

  • The out-sample precision and recall metrics are available in the Tamr UI from v2020.020.1.
  • The example code makes use of non-versioned APIs and therefore it is not guaranteed to be stable across future versions of Tamr.

Prerequisite:

  • A curator has to review and verify at least one set of high-impact clusters for the metrics to be available.

How to:

To compute and review the out-sample precision and recall metrics in the Tamr UI, please see the Tamr documentation on Reviewing Precision and Recall for Clusters. The latest metric is captured in the dataset <unified_dataset_name>_dedup cluster_accuracy_metrics which can be exported from the UI.

Alternatively, you can use the Tamr API to access the full history of these metrics using the following steps:

  1. Navigate to the Tamr swagger docs interface at <hostname>:<port>/docs.
  2. Click on the dataset service.
  3. Click to expand the POST /datasets/{name}/recordVersions API endpoint and specify the name and body fields as shown in the screenshot below. Replace ud in the name field with the name of your unified dataset.

  1. Click Try Out!

Alternatively, Use the Tamr Python Client:

You may also choose to use the Tamr Python Client to retrieve these metrics programmatically. To do so:

  1. Create an authentication provider and an authenticated Tamr client.
from tamr_unify_client.auth import UsernamePasswordAuth
from tamr_unify_client.client import Client

import os

username = os.environ['TAMR_USERNAME'] # replace with your username environment variable name
password = os.environ['TAMR_PASSWORD'] # replace with your password environment variable name
host = "localhost"                     # replace with your host

auth = UsernamePasswordAuth(username, password)
tamr = Client(auth, host=host)

For more details on passing your credentials securely, please see here.

  1. Retrieve the mastering project.
project_id = "my_project_id"           # replace with your project ID
project = tamr.projects.by_resource_id(project_id)
project = project.as_mastering()
  1. Retrieve the metrics.
unified_dataset_name = project.unified_dataset().name
metrics_dataset_name = f"{unified_dataset_name}_dedup_cluster_accuracy_metrics"
response = tamr.post(
   f"/api/dataset/datasets/{metrics_dataset_name}/recordVersions", json={"id": 0}
)
metrics = response.json()["versions"]