User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In
User Guides

Configuring Tamr Core Backup

To back up Tamr Core instances, create a backup directory and then back up various parts of the product.

important Important: Server snapshots are not a replacement for Tamr Core application backups. Therefore, do not take server snapshots with the intention of using them as Tamr Core backups. Server snapshots do not provide the correct backups of Tamr Core configuration. Additionally, if Tamr Core is running, taking a server snapshot can lead to a corrupt HBase configuration if you later attempt to restore from the snapshot. Instead, take Tamr Core application backups before introducing any changes.

Selecting a Backup and Restore Approach

The following options are available for backing up and restoring Tamr Core instances.

  • To restore to an instance on the same hosting platform and deployment modality, on the instance you are backing up verify that the TAMR_STORAGE_DRIVER_DATA_STORE_BACKUP_ENABLED configuration variable is set to its default value of false.
  • To do a datastore-agnostic backup for the purpose of migrating from a single-node instance hosted on GCP to a scaled-out instance hosted on GCP, on the instance you are backing up change the setting of TAMR_STORAGE_DRIVER_DATA_STORE_BACKUP_ENABLED to true.
    No other changes need to be made to your backup configuration, and you continue to use the same API calls.

See [Setting Configuration Variables] (doc:configuration-configuring-tamr#setting-configuration-variables) and Migrating to a Scale-out GCP Instance.

Selecting a Backup Location

By default, Tamr Core stores backup files in the local filesystem directory: ${TAMR_UNIFY_HOME}/tamr/backups. Depending on your deployment, you can choose to store the backup files on the local filesystem, Google Cloud Platform (GCP), or AWS S3.

Tamr recommends using a distributed filesystem instead of the local filesystem for storing the backup files. In this way, you will not need to manually copy the backup files to the destination server on which you restore from a backup.

See Configuring a Backup Location, below, for instructions.

Selecting Components to Back Up

In addition to the Tamr Core application, you can configure backups for:

Backup Options for Deployments on GCP

In GCP cloud environments, Tamr Core can use cloud-native APIs to make the backup process faster and more efficient. See GCP Native Backup.

Configuring a Backup Location

Depending on your deployment type, configure a backup location on one of the following:

Configuring a Filesystem Backup Location

To configure a local filesystem backup location:

Set the value of the configuration variable TAMR_UNIFY_BACKUP_URI to a local filesystem directory using the administration utility. See Setting Configuration Variables.

Configuring an Azure ADLS Filesystem Backup Location and Settings

To configure ADLS filesystem backup settings:

  1. Create a yaml file at <tamr-home-dir>/custom-conf/config.yaml based on the example below. Replace instances of <REPLACE_ME> with the appropriate values for your deployment.
# -- ADLS Filesystem --
TAMR_UNIFY_BACKUP_URI: "https://<storage account name>.dfs.core.windows.net/<path to backups>"
TAMR_BACKUP_FS_EXTRA_CONFIG: "\nadls.gen2.container.name: tamrteamcity\nadls.gen2.client.secret:\
  \ <service account secret>\nadls.gen2.client.id: <service account id>\n\
  adls.gen2.account.name: <storage account name>\nadls.gen2.tenant.id: <storage account tenant id>\n"
TAMR_UNIFY_BACKUP_HDINSIGHT_STORAGE_ACCOUNT_NAME: "<storage account name>"
TAMR_UNIFY_BACKUP_HDINSIGHT_STORAGE_ACCOUNT_KEY: "<storage account key>"
  1. Upload the resulting config.yaml configuration file to Tamr Core:
    <tamr-home-dir>/tamr/utils/unify-admin.sh config:set --file <tamr-home-dir>/custom-conf/config.yaml.

Configuring a Google Cloud Storage (GCS) Backup Location

To configure a GCS backup location:

  1. Set TAMR_UNIFY_BACKUP_URI to the path to the backup and restore directory in this format: gs://<bucket>/<path/to/backup>, such as: gs://backup-bucket/backup1.
  2. Set TAMR_GOOGLE_APPLICATION_CREDENTIALS to an absolute local path to the service account credentials JSON file, such as: /tmp/gcs/creds.json. For more information, see Setting Configuration Variables.
  3. Restart Tamr Core and its dependencies. See Restarting Tamr Core.

Configuring PostgreSQL Backup and Restore Binaries

To configure PostgreSQL backup and restore binaries:

  1. Set TAMR_PG_DUMP_BINARY to /usr/pgsql-12/bin/pg_dump and TAMR_PG_RESTORE_BINARY to /usr/pgsql-12/bin/pg_restore. See Setting Configuration Variables.
  2. Restart Tamr Core and its dependencies. See Restarting Tamr Core.

Configuring Elasticsearch Backup

To configure Elasticsearch backup:

  1. Configure the TAMR_UNIFY_BACKUP_ES configuration variable using the Tamr administration utility. See Setting Configuration Variables.
  • If set to true (default), the generated backup file includes a complete copy of all data in Tamr ElasticSearch instance. Upon restore, the Elasticsearch instance is automatically restored from this copy.
  • If set to false, the generated backup file does not include a copy of data in the Tamr Elasticsearch instance. Upon restore, the Elasticsearch instance is not automatically restored. Restoring Elasticsearch requires running the re-indexing process, which may take several hours. Consult the Tamr Core Help Center for details on re-indexing Elasticsearch.
  1. Restart Tamr Core and its dependencies. See Restarting Tamr Core.

Configuring Additional Configuration Variables for Backup

When restoring from backup, Tamr Core always restores variables that have the Tamr-supplied setting of machineSpecific: false. For up-to-date information about which configuration variables have this setting, see the Configuration Variable Reference.

You can specify additional configuration variables to restore from backup, using the TAMR_UNIFY_BACKUP_EXTRA_CONFIG_PROPS configuration variable.

Note: Contact Tamr Support at [email protected] if you are not sure whether you need to back up any additional configuration variables.

To configure additional configuration variables for backup:

  1. Set the value of the configuration variable TAMR_UNIFY_BACKUP_EXTRA_CONFIG_PROPS to a comma-separated list of configuration variables that you want to back up using the administration utility, as show in the example below. See Setting Configuration Variables.
  2. Restart Tamr Core and its dependencies. See Restarting Tamr Core .

Example:

${TAMR_UNIFY_HOME}/tamr/utils/unify-admin.sh 
config:set 
TAMR_UNIFY_BACKUP_EXTRA_CONFIG_PROPS='["TAMR_DEDUP_NUM_QUESTIONS", "TAMR_ES_MAX_CLAUSE_COUNT"]'

GCP Native Backup

When running on GCP services, Tamr Core uses native features to power its backup/restore function. This applies specifically to data stored in Bigtable, Cloud SQL, and Google Cloud Storage. Details about configuration for each service are below.

Bigtable

When Tamr Core is configured to run on Cloud Bigtable, it can use Bigtable's native backup API. When Tamr Core manages a large amount of data, the native backup API performs significantly faster than the export-based alternative.

The native backup API has the following limitations:

  • Backups can only be restored into the same Bigtable instance.
  • Backups expire after a set period, maximum 30 days.
  • Backups must be restored into new tables.

If needed, disable native backup by setting TAMR_BIGTABLE_BACKUP_NATIVE_ENABLED to false (default is true).

You can configure the expiration time (in days) of each backup using the variable TAMR_BIGTABLE_BACKUP_NATIVE_TTL. The minimum allowed is 1 day, and the maximum is 30 days. The default is 14 days.

important Important: Because backups are restored into new tables, Tamr Core restores into a new "namespace" and automatically updates TAMR_HBASE_NAMESPACE accordingly. The old namespace is left alone. In this way, the previous state (and backups) remain present as a fallback. To avoid the additional storage costs, clean up the old namespace manually, when appropriate. In addition, if you are using a yaml file to set Tamr Core configuration, be sure to update the value of TAMR_HBASE_NAMESPACE (if set) before re-applying configuration from the file.

Cloud SQL

When Tamr Core is configured to run on Cloud SQL PostgreSQL, it can use Cloud SQL's native Admin API to perform backup. Backup and restore operations are typically faster with this API than with pg_dump, and the API does not require the pg_dump binary to be available.

If needed, disable Cloud SQL native backups in favor of pg_dump by setting TAMR_BACKUP_CLOUD_SQL_ENABLED to false (default true).

Google Cloud Storage

When using GCS for the Tamr Core filesystem and/or backup filesystem, Tamr Core uses gsutil to copy files efficiently. gsutil provides parallelism and allows direct copying between GCS locations (without downloading/uploading data via an intermediary).

To use gsutil, it must be present on the Tamr Core VM and on the PATH of Tamr Core services.

By default, gsutil is disabled. Enable gsutil by setting TAMR_BACKUP_GSUTIL_ENABLED to true.

If necessary, you can pass command line options to gsutil by setting TAMR_BACKUP_GSUTIL_EXTRA_ARGS.

Migrating to a GCP Scale-Out Instance

The following procedure outlines the migration process from a GCP single-node source instance to a scale-out destination instance that uses GCP native services. See GCP Native Backup.

To use backup and restore to migrate from GCP single-node to GCP scale-out:

  1. On the source instance, set TAMR_STORAGE_DRIVER_DATA_STORE_BACKUP_ENABLED=true. See Configuring Tamr Core Backup.
  2. Restart the source Tamr Core instance, and then create the [backup]doc:backup-tamr) file for it.
  3. Verify that the backup manifest has a tamrStorageDriverBackup entry, and then copy the backup file to a location that is accessible from the destination instance and unzip it.
  4. On the destination instance, create the service infrastructure including Bigtable, Google Cloud Storage, and Cloud SQL.
  5. Start the destination Tamr Core instance and its dependencies and validate that Tamr Core is up and running.
  6. Find the value for the TAMR_PERSISTENCE_DB_URL. For example:
    tamr/utils/unify-admin.sh config:get TAMR_PERSISTENCE_DB_URL
    returns a value like
    jdbc:postgresql://google/doit?sslmode=disable&socketFactory=com.google.cloud.sql.postgres.SocketFactory&cloudSqlInstance=tamr-internal-scale:us-east1:brt2-test-3
  7. Change the setting for the TAMR_PERSISTENCE_DB_URL by changing the "jdbc:postgresql://google" hostname to "jdbc:postgresql://localhost". For example:
    tamr/utils/unify-admin.sh config:set ‘TAMR_PERSISTENCE_DB_URL=jdbc:postgresql://localhost/doit?sslmode=disable&socketFactory=com.google.cloud.sql.postgres.SocketFactory&cloudSqlInstance=tamr-internal-scaletest:us-east1:brt2-test-3'
    Note: Be sure to use single quotation marks (’) around the 'TAMR_PERSISTENCE_DB_URL=<new value>'.
  8. Using your own login (that is, not the tamr functional user) stop the local PostgreSQL service.
    sudo service postgresql stop
  9. Disable the local PostgreSQL service so that it doesn't start up with other dependencies:
    sudo service postgresql disable
  10. Restart the destination Tamr Core instance and use the unify.log to verify that it started successfully and is connecting to the Cloud SQL service.
  11. Download the Cloud SQL proxy and then make it executable:
    wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
    chmod +x cloud_sql_proxy
  12. Start the Cloud SQL proxy, pointing it at the same instance as in your TAMR_PERSISTENCE_DB_URL:
    ./cloud_sql_proxy -instances=tamr-internal-scaletest:us-east1:brt2-test-3=tcp:5432 >proxy.log 2>&1 &
  13. Restore Tamr Core from backup by running POST /v1/instance/restore. See Initiate an asynchronous restore operation.
    Note: The destination instance will now have the same username and password as the old instance. This can cause problems with some workflows.
  14. On the destination instance, verify that the values for the following configuration variables point to the new resources.
  • TAMR_BIGTABLE_CLUSTER_ID
  • TAMR_BIGTABLE_INSTANCE_ID
  • TAMR_FS_URI
  • TAMR_PERSISTENCE_DB_URL
  • TAMR_UNIFY_BACKUP_URI
  • TAMR_UNIFY_DATA_DIR