Configuring Tamr Core Backup
To back up Tamr Core instances, create a backup directory and then back up various parts of the product.
Important: Server snapshots are not a replacement for Tamr Core application backups. Therefore, do not take server snapshots with the intention of using them as Tamr Core backups. Server snapshots do not provide the correct backups of Tamr Core configuration. Additionally, if Tamr Core is running, taking a server snapshot can lead to a corrupt HBase configuration if you later attempt to restore from the snapshot. Instead, take Tamr Core application backups before introducing any changes.
Selecting a Backup and Restore Approach
The following options are available for backing up and restoring Tamr Core instances.
- To restore to an instance on the same hosting platform and deployment modality, on the instance you are backing up verify that the
TAMR_STORAGE_DRIVER_DATA_STORE_BACKUP_ENABLED
configuration variable is set to its default value offalse
. - To do a datastore-agnostic backup for the purpose of migrating from any single-node instance to a scale-out instance hosted on GCP, on the instance you are backing up change the setting of
TAMR_STORAGE_DRIVER_DATA_STORE_BACKUP_ENABLED
totrue
.
No other changes need to be made to your backup configuration, and you continue to use the same API calls.
See Setting Configuration Variables and Migrating to a Scale-out GCP Instance.
Selecting a Backup Location
By default, Tamr Core stores backup files in the local filesystem directory: ${TAMR_UNIFY_HOME}/tamr/backups
. Depending on your deployment, you can choose to store the backup files on the local filesystem, Google Cloud Platform (GCP), or AWS S3.
Tamr recommends using a distributed filesystem instead of the local filesystem for storing the backup files. In this way, you will not need to manually copy the backup files to the destination server on which you restore from a backup.
See Configuring a Backup Location, below, for instructions.
Selecting Components to Back Up
In addition to the Tamr Core application, you can configure backups for:
- PostgreSQL
- (Optional) Elasticsearch
- (Optional) Additional Configuration Variables
Backup Options for Deployments on GCP
In GCP cloud environments, Tamr Core can use cloud-native APIs to make the backup process faster and more efficient. See GCP Native Backup.
Configuring a Backup Location
Depending on your deployment type, configure a backup location on one of the following:
Configuring a Filesystem Backup Location
To configure a local filesystem backup location:
Set the value of the configuration variable TAMR_UNIFY_BACKUP_URI
to a local filesystem directory using the administration utility. See Setting Configuration Variables.
Configuring an Azure ADLS Filesystem Backup Location and Settings
To configure ADLS filesystem backup settings:
- Create a yaml file at
<tamr-home-dir>/custom-conf/config.yaml
based on the example below. Replace instances of<REPLACE_ME>
with the appropriate values for your deployment.
# -- ADLS Filesystem --
TAMR_UNIFY_BACKUP_URI: "https://<storage account name>.dfs.core.windows.net/<path to backups>"
TAMR_BACKUP_FS_EXTRA_CONFIG: "\nadls.gen2.container.name: tamrteamcity\nadls.gen2.client.secret:\
\ <service account secret>\nadls.gen2.client.id: <service account id>\n\
adls.gen2.account.name: <storage account name>\nadls.gen2.tenant.id: <storage account tenant id>\n"
TAMR_UNIFY_BACKUP_HDINSIGHT_STORAGE_ACCOUNT_NAME: "<storage account name>"
TAMR_UNIFY_BACKUP_HDINSIGHT_STORAGE_ACCOUNT_KEY: "<storage account key>"
- Upload the resulting
config.yaml
configuration file to Tamr Core:
<tamr-home-dir>/tamr/utils/unify-admin.sh config:set --file <tamr-home-dir>/custom-conf/config.yaml
.
Configuring a Google Cloud Storage (GCS) Backup Location
To configure a GCS backup location:
- Set
TAMR_UNIFY_BACKUP_URI
to the path to the backup and restore directory in this format:gs://<bucket>/<path/to/backup>
, such as:gs://backup-bucket/backup1
. - Set
TAMR_GOOGLE_APPLICATION_CREDENTIALS
to an absolute local path to the service account credentials JSON file, such as:/tmp/gcs/creds.json
. For more information, see Setting Configuration Variables. - Restart Tamr Core and its dependencies. See Restarting Tamr Core.
Configuring PostgreSQL Backup and Restore Binaries
To configure PostgreSQL backup and restore binaries:
- Set
TAMR_PG_DUMP_BINARY
to/usr/pgsql-12/bin/pg_dump
andTAMR_PG_RESTORE_BINARY
to/usr/pgsql-12/bin/pg_restore
. See Setting Configuration Variables. - Restart Tamr Core and its dependencies. See Restarting Tamr Core.
Configuring Elasticsearch Backup
Important: Elasticsearch backup is not supported when backing up to an Azure ADLS filesystem; the TAMR_UNIFY_BACKUP_ES
configuration variable must be set to false
when backing up to Azure ADLS.
To configure Elasticsearch backup:
- Configure the
TAMR_UNIFY_BACKUP_ES
configuration variable using the Tamr administration utility. See Setting Configuration Variables.
- If set to
true
(default), the generated backup file includes a complete copy of all data in Tamr ElasticSearch instance. Upon restore, the Elasticsearch instance is automatically restored from this copy. - If set to
false
, the generated backup file does not include a copy of data in the Tamr Elasticsearch instance. Upon restore, the Elasticsearch instance is not automatically restored. Restoring Elasticsearch requires running the re-indexing process, which may take several hours. Consult the Tamr Core Help Center for details on re-indexing Elasticsearch.
- Restart Tamr Core and its dependencies. See Restarting Tamr Core.
Configuring Additional Configuration Variables for Backup
When restoring from backup, Tamr Core always restores variables that have the Tamr-supplied setting of machineSpecific: false
. For up-to-date information about which configuration variables have this setting, see the Configuration Variable Reference.
You can specify additional configuration variables to restore from backup, using the TAMR_UNIFY_BACKUP_EXTRA_CONFIG_PROPS
configuration variable.
Note: Contact Tamr Support at [email protected] if you are not sure whether you need to back up any additional configuration variables.
To configure additional configuration variables for backup:
- Set the value of the configuration variable
TAMR_UNIFY_BACKUP_EXTRA_CONFIG_PROPS
to a comma-separated list of configuration variables that you want to back up using the administration utility, as show in the example below. See Setting Configuration Variables. - Restart Tamr Core and its dependencies. See Restarting Tamr Core .
Example:
${TAMR_UNIFY_HOME}/tamr/utils/unify-admin.sh
config:set
TAMR_UNIFY_BACKUP_EXTRA_CONFIG_PROPS='["TAMR_DEDUP_NUM_QUESTIONS", "TAMR_ES_MAX_CLAUSE_COUNT"]'
GCP Native Backup
When running on GCP services, Tamr Core uses native features to power its backup/restore function. This applies specifically to data stored in Bigtable, Cloud SQL, and Google Cloud Storage. Details about configuration for each service are below.
Bigtable
When Tamr Core is configured to run on Cloud Bigtable, it can use Bigtable's native backup API. Tamr supports only native backup for Bigtable. When Tamr Core manages a large amount of data, the native backup API performs significantly faster than the export-based alternative.
The native backup API has the following limitations:
- Backups can only be restored into the same Bigtable instance.
- Backups expire after a set period, maximum 30 days.
- Backups must be restored into new tables.
You can configure the expiration time (in days) of each backup using the variable TAMR_BIGTABLE_BACKUP_NATIVE_TTL
. The minimum allowed is 1 day, and the maximum is 30 days. The default is 14 days.
Important: Because backups are restored into new tables, Tamr Core restores into a new "namespace" and automatically updates
TAMR_HBASE_NAMESPACE
accordingly. The old namespace is left alone. In this way, the previous state (and backups) remain present as a fallback. To avoid the additional storage costs, clean up the old namespace manually, when appropriate. In addition, if you are using a yaml file to set Tamr Core configuration, be sure to update the value ofTAMR_HBASE_NAMESPACE
(if set) before re-applying configuration from the file.
Cloud SQL
When Tamr Core is configured to run on Cloud SQL PostgreSQL, it can use Cloud SQL's native Admin API to perform backup. Backup and restore operations are typically faster with this API than with pg_dump, and the API does not require the pg_dump binary to be available.
If needed, disable Cloud SQL native backups in favor of pg_dump by setting TAMR_BACKUP_CLOUD_SQL_ENABLED
to false
(default true
).
Google Cloud Storage
When using GCS for the Tamr Core filesystem and/or backup filesystem, Tamr Core uses gsutil to copy files efficiently. gsutil provides parallelism and allows direct copying between GCS locations (without downloading/uploading data via an intermediary).
To use gsutil, it must be present on the Tamr Core VM and on the PATH of Tamr Core services.
By default, gsutil is disabled. Enable gsutil by setting TAMR_BACKUP_GSUTIL_ENABLED
to true
.
If necessary, you can pass command line options to gsutil by setting TAMR_BACKUP_GSUTIL_EXTRA_ARGS
.
Migrating to a GCP Scale-Out Instance
The following procedure outlines the migration process from a GCP single-node source instance to a scale-out destination instance that uses GCP native services. See GCP Native Backup.
To use backup and restore to migrate from GCP single-node to GCP scale-out:
- On the source instance, set
TAMR_STORAGE_DRIVER_DATA_STORE_BACKUP_ENABLED=true
. See Configuring Tamr Core Backup. - Restart the source Tamr Core instance, and then create the [backup]doc:backup-tamr) file for it.
- Verify that the backup manifest has a
tamrStorageDriverBackup
entry, and then copy the backup file to a location that is accessible from the destination instance and unzip it. - On the destination instance, create the service infrastructure including Bigtable, Google Cloud Storage, and Cloud SQL.
- Start the destination Tamr Core instance and its dependencies and validate that Tamr Core is up and running.
- Find the value for the
TAMR_PERSISTENCE_DB_URL
. For example:
tamr/utils/unify-admin.sh config:get TAMR_PERSISTENCE_DB_URL
returns a value like
jdbc:postgresql://google/doit?sslmode=disable&socketFactory=com.google.cloud.sql.postgres.SocketFactory&cloudSqlInstance=tamr-internal-scale:us-east1:brt2-test-3
- Change the setting for the
TAMR_PERSISTENCE_DB_URL
by changing the "jdbc:postgresql://google" hostname to "jdbc:postgresql://localhost". For example:
tamr/utils/unify-admin.sh config:set ‘TAMR_PERSISTENCE_DB_URL=jdbc:postgresql://localhost/doit?sslmode=disable&socketFactory=com.google.cloud.sql.postgres.SocketFactory&cloudSqlInstance=tamr-internal-scaletest:us-east1:brt2-test-3'
Note: Be sure to use single quotation marks (’) around the'TAMR_PERSISTENCE_DB_URL=<new value>'
. - Set the value of TAMR_STORAGE_DRIVER_DATA_STORE_BACKUP_WORKERS to something greater than 0. This will be the number of threads used for backup or restore operations on record data (datasets). 1/4 to 1/2 of the cores on the system is a reasonable range.
- Using your own login (that is, not the tamr functional user) stop the local PostgreSQL service.
sudo service postgresql stop
- Disable the local PostgreSQL service so that it doesn't start up with other dependencies:
sudo service postgresql disable
- Restart the destination Tamr Core instance and use the unify.log to verify that it started successfully and is connecting to the Cloud SQL service.
- Download the Cloud SQL proxy and then make it executable:
wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
chmod +x cloud_sql_proxy
- Start the Cloud SQL proxy, pointing it at the same instance as in your
TAMR_PERSISTENCE_DB_URL
:
./cloud_sql_proxy -instances=tamr-internal-scaletest:us-east1:brt2-test-3=tcp:5432 >proxy.log 2>&1 &
- Restore Tamr Core from backup by running POST /v1/instance/restore. See Initiate an asynchronous restore operation.
Note: The destination instance will now have the same username and password as the old instance. This can cause problems with some workflows. - On the destination instance, verify that the values for the following configuration variables point to the new resources.
- TAMR_BIGTABLE_CLUSTER_ID
- TAMR_BIGTABLE_INSTANCE_ID
- TAMR_FS_URI
- TAMR_PERSISTENCE_DB_URL
- TAMR_UNIFY_BACKUP_URI
- TAMR_UNIFY_DATA_DIR
Updated 11 months ago