How to Interact Programmatically with Tamr using Tamr Toolbox

The following article describes how to get started using the Tamr Toolbox for interacting programmatically with an instance of Tamr.

Connecting To Tamr

Start by creating a yaml file with connection information and set environment variable TAMR_PASSWORD to your users Tamr password:

tamr_instance:
 host: <host>
 protocol: "http"
 port: "9100"
 username: <username>
 password: $TAMR_PASSWORD

Next import tamr_toolbox utils, import yaml config file and use that to create an authenticated client:

from tamr_toolbox as tbox

config = tbox.utils.config.from_yaml("<path to yaml file>")
tamr = tbox.utils.client.create(**config["tamr_instance"])

Streaming Out Datasets from Tamr

tamr-toolbox provides functions to create a pandas Dataframe from a Tamr dataset. All columns that are arrays in Tamr will be converted to lists in the pandas Dataframe. Alternatively, you can optionally flatten the dataset converting all arrays into string delimited by a specified delimiter.

# Create dataset object by resource ID from Tamr
dataset = tamr.datasets.by_name("my_tamr_dataset_name")
# default will stream all rows and not apply any flattening
df = tbox.data_io.dataframe.from_dataset(dataset)
# get with lists flattened to strings and a subset of columns and rows
flattened_df = tbox.data_io.dataframe.from_dataset(
  dataset, flatten_delimiter="|", columns=["tamr_id", "last_name", "first_name"], nrows=5
)
# if the Tamr dataset is not streamable, pass this option to allow refreshing it
refreshed_df = tbox.data_io.dataframe.from_dataset(dataset, nrows=5, allow_dataset_refresh=True)

# a dataframe can also be flattened after creation
# default will attempt to flatten all columns
flattened_all_df = tbox.data_io.dataframe.flatten(df)
# flatten only a subset of columns, and force non-string inner array types to strings
flattened_last_name_df = tbox.data_io.dataframe.flatten(df, delimiter="|", columns=["last_name"], force=True)

Interacting with Tamr Projects

Tamr-toolbox adds functionality to easily add datasets to Tamr and perform Schema Mapping. The helper functions allow you to either bootstrap attributes or provide your own mappings.

To Bootstrap:

import tamr_toolbox as tbox

# grab project and source dataset
project = client.projects.by_name(project_name)
source_dataset = client.datasets.by_name(source_dataset_name)
tbox.project.mastering.schema.bootstrap_dataset(
 project, source_dataset=source_dataset, force_add_dataset_to_project=True
)

Or provide mappings:

import tamr_toolbox as tbox

# grab project and source dataset
project = client.projects.by_name(project_name)
source_dataset = client.datasets.by_name(source_dataset_name)

mappings = "source_attr1,unified_attr1;source_attr2,unified_attr2"

mapping_tuples = [(x.split(",")[0], x.split(",")[1]) for x in mappings.split(";")]
for (source_attr, unified_attr) in mapping_tuples:
 tbox.project.mastering.schema.map_attribute(
     project,
     source_attribute_name=source_attr,
     source_dataset_name=source_dataset.name,
     unified_attribute_name=unified_attr,
)

Tamr Workflows

Running Project Workflows

Tamr-toolbox is built to make common actions in Tamr simple and flexible via python. This allows users to easily build out custom workflows interacting with Tamr through the toolbox. The most common task in Tamr is running the workflow of a project and that is made simple for all project types in the toolbox.

Schema Mapping:

# Retrieve the project
my_project = client.projects.by_resource_id(schema_mapping_project_id)

#Run all jobs in project workflow
operations = tbox.project.schema_mapping.jobs.run(my_project)

Categorization:

# Retrieve the project
my_project = client.projects.by_resource_id(categorization_project_id)
my_project = my_project.as_categorization()

#Run all jobs in project workflow
operations = tbox.project.categorization.jobs.run(my_project, run_apply_feedback=False)

Mastering:

# Retrieve the project
my_project = client.projects.by_resource_id(mastering_project_id)
my_project = my_project.as_mastering()

#Run all jobs in project workflow
operations = tbox.project.mastering.jobs.run(
  my_project, run_apply_feedback=False, run_estimate_pair_counts=False
)

Golden Records:

# Retrieve the project
my_project = client.projects.by_resource_id(golden_records_project_id)

#Run all jobs in project workflow
operations = tbox.project.golden_records.jobs.run(my_project)

For more complex workflows or to run a project step by step refer to the toolbox documentation.

Backup and Restore

Tamr-toolbox provides an easy utility to create/cleanup backups and restore Tamr from a backup. The toolbox can be very helpful in creating a workflow that creates a backup and cleans up old backups on a scheduled basis.

To create a backup:

op = tbox.workflow.backup.initiate_backup(tamr)
backup_id = op.json()["relativeId"]
state = op.json()["state"]
print(f"Completed backup with state {state} and relative ID {backup_id}")

To restore from a backup:

backup_id = "1"  # update with the relativeID of your desired backup file
op = tbox.workflow.backup.initiate_restore(tamr, backup_id)
state = op.json()["state"]
print(f"Completed restore to backup file with ID {backup_id} with state {state}")

Conclusion

Tamr-Toolbox is built to provide a simple interface for common interactions with Tamr and common data workflows that include Tamr. The toolbox is a great tool to simplify workflows that interact with Tamr.

To see further details on any of the steps above or for further details on the toolbox please reference the public docs.