HomeTamr Core GuidesTamr Core API Reference
Tamr Core GuidesTamr Core API ReferenceTamr Core TutorialsEnrichment API ReferenceSupport Help CenterLog In

Backup Configuration

To back up Tamr instances, create a backup directory, and back up various parts of the product.

Preparing for a Backup

Note: Server snapshots are not a replacement for Tamr application backups. Therefore, do not take server snapshots with the intention of using them as Tamr backups. Server snapshots do not provide the correct backups of Tamr configuration. Additionally, if Tamr is running, taking a server snapshot can lead to a corrupt HBase configuration if you later attempt to restore from the snapshot. Instead, take Tamr application backups before introducing any changes.

To store backups, Tamr uses the local filesystem directory by default, ${TAMR_UNIFY_HOME}/tamr/backups.

Before you run a backup procedure, you need to decide where to store backup files, and also decide which components you are going to back up. A backup location in Tamr can be a location on the local filesystem, GCP, AWS S3, or HDFS. Tamr recommends using a distributed filesystem instead of the local filesystem for storing the backup files. In this way, you will not need to manually copy the backup files to the destination server on which you restore from a backup.

You can configure Tamr to store the backup in a local filesystem, GCS, AWS S3, or HDFS. For information about configuring a backup location, see the following topics in this section:

Additionally, you can configure backups for PostgreSQL, Elasticsearch and some configuration variables used in Tamr. See these topics in this section:

Finally, in cloud environments, Tamr can use cloud-native APIs to make the backup process faster and more efficient. See Configuring GCP native backup for details related to Google Cloud.

Configuring a Filesystem Backup Location

To configure a local filesystem backup location:

Set the value of the configuration variable TAMR_UNIFY_BACKUP_URI to a local filesystem directory using the Tamr administration utility. See Creating or Updating a Configuration Variable.

Configuring a Google Cloud Storage (GCS) Backup Location

To configure a GCS backup location:

  1. Set TAMR_UNIFY_BACKUP_URI to the path to the backup and restore directory in this format: gs://<bucket>/<path/to/backup>, such as: gs://backup-bucket/backup1.
  2. Set TAMR_GOOGLE_APPLICATION_CREDENTIALS to an absolute local path to the service account credentials JSON file, such as: /tmp/gcs/creds.json. For more information, see Creating or Updating a Configuration Variable.
  3. Restart Tamr and its dependencies. See Restart Tamr and its dependencies.

Configuring an AWS S3 Backup Location

See AWS Backup and Restore.

Configuring an HDFS or ADLS Backup Location

See Azure Backup and Restore.

Configuring Postgres Backup and Restore Binaries

To configure Postgres backup and restore binaries:

  1. Set TAMR_PG_DUMP_BINARY to /usr/pgsql-12/bin/pg_dump and TAMR_PG_RESTORE_BINARY to /usr/pgsql-12/bin/pg_restore. See Creating or Updating a Configuration Variable.
  2. Restart Tamr and its dependencies. See Restart Tamr and its dependencies.

Configuring Elasticsearch Backup

To configure Elasticsearch backup:

  1. Configure the TAMR_UNIFY_BACKUP_ES configuration variable using the Tamr administration utility. See Creating or Updating a Configuration Variable.
  • If set to true (default), the generated backup file includes a complete copy of all data in Tamr ElasticSearch instance. Upon restore, the Elasticsearch instance is automatically restored from this copy.
  • If set to false, the generated backup file does not include a copy of data in the Tamr Elasticsearch instance. Upon restore, the Elasticsearch instance is not automatically restored. Restoring Elasticsearch requires running the re-indexing process, which may take several hours. Consult the Tamr knowledge base for details on re-indexing Elasticsearch.
  1. Restart Tamr and its dependencies. See Restart Tamr and its dependencies.

Cloud native backup

GCP native backup

When running on GCP services, Tamr uses native features to power its backup/restore function. This applies specifically to data stored in Bigtable, Cloud SQL, and Google Cloud Storage. Details about configuration for each service are below.

Bigtable

When Tamr is configured to run on Cloud Bigtable, it can use Bigtable's native backup API. When the amount of data managed by Tamr is large, the native backup API performs significantly faster than the export-based alternative.

The native backup API has limitations:

  • Backups may only be restored into the same Bigtable instance
  • Backups expire after a set period, maximum 30 days
  • Backups must be restored into new tables

If these limitations are unacceptable, native backup can be disabled by setting TAMR_BIGTABLE_BACKUP_NATIVE_ENABLED to false (default is true). The shelf-life of native backups can be configured via the variable TAMR_BIGTABLE_BACKUP_NATIVE_TTL`, which is counted in days, has a minimum of 1, a maximum of 30, and a default of 14.

Note that because backups are restored into new tables, Tamr restores into a new "namespace" and automatically updates TAMR_HBASE_NAMESPACE accordingly. Meanwhile, the old namespace is left alone. In this way the previous state (and backups) remain present as a fallback. This means, however, that to avoid the additional storage costs, users need to clean up the old namespace manually, when appropriate. In addition, if you are using a yaml file to set Tamr configuration, be sure to update the value of TAMR_HBASE_NAMESPACE in that (if it is set at all) before re-applying config from the file.

Cloud SQL

When Tamr is configured to run on Cloud SQL PostgreSQL, it can use Cloud SQL's native Admin API to perform backup. Backup and restore operations are typically faster with this API than with pg_dump, and the API does not require the pg_dump binary to be available.

If you prefer, Cloud SQL native backups can be disabled in favor of pg_dump by setting TAMR_BACKUP_CLOUD_SQL_ENABLED to false (default true).

Google Cloud Storage

When using GCS for Tamr's filesystem and/or as the backup filesystem, Tamr will use gsutil to copy files efficiently. The benefits of gsutil are parallelism and direct copying between GCS locations (without downloading/uploading data via an intermediary).

To be used, gsutil must be present on the Tamr VM and on the PATH of Tamr services. It can be disabled by setting TAMR_BACKUP_GSUTIL_ENABLED to true (default false).

If necessary, you can pass command line options to gsutil by setting TAMR_BACKUP_GSUTIL_EXTRA_ARGS.

Configuring Additional Configuration Variables for Backup

Tamr restores values for configuration variables that have one of these settings:

  • The variable has the Tamr-supplied setting of machineSpecific: false. See the Configuration Variable Reference.
  • The variable is included in the list stored by the TAMR_UNIFY_BACKUP_EXTRA_CONFIG_PROPS configuration variable. You can use this variable to specify additional configuration variables to restore from a backup. See below.

Note: Contact Tamr Support if you are not sure whether you need to back up any additional configuration variables.

To configure additional configuration variables for backup:

  1. Set the value of the configuration variable TAMR_UNIFY_BACKUP_EXTRA_CONFIG_PROPS to a comma-separated list of Tamr configuration variables that you want to back up using the administration utility. For example, TAMR_UNIFY_BACKUP_EXTRA_CONFIG_PROPS can be set to ["TAMR_DEDUP_NUM_QUESTIONS", "TAMR_ES_MAX_CLAUSE_COUNT"]. See Creating or Updating a Configuration Variable.
  2. Restart Tamr and its dependencies. See Restart Tamr and its dependencies.
${TAMR_UNIFY_HOME}/tamr/utils/unify-admin.sh 
config:set 
TAMR_UNIFY_BACKUP_EXTRA_CONFIG_PROPS='["TAMR_DEDUP_NUM_QUESTIONS"]'

Configuration Variables That Are Always Restored

For up-to-date information about which configuration variables Tamr always restores, see the Configuration Variable Reference for variables that have machineSpecific: false.


Did this page help you?