Maintaining System and Tamr Software Health and Performance

The following list suggests ways to improve System and Tamr software performance on the back-end and front-end. These are considered Best Practices for good administration of Tamr software.

  1. As with all systems and software, a regular restart of Tamr and Tamr dependencies helps in clearing memory; this will improve backend performance. Likewise, most operating systems are designed to be restarted periodically, so restarting the system that is running the Tamr software occasionally can also help with system performance.

  2. The home page of Tamr checks all datasets across projects to ensure all projects are up-to-date. If you have a large number of datasets that are no longer in use, removing those datasets can help improve performance. It will also remove the dataset from Hbase, which can reduce disk space usage. Likewise, deleting projects that are no longer in use can also increase page load performance since the health of those projects does not need to be checked. Please check the steps to delete.

  3. Occasionally deleting backups that are no longer needed can reduce disk space usage. If you still need the backups, you can zip(or compress) the backup directories to reduce the size. You can find the location of backups by checking the TAMR_UNIFY_BACKUP_URI variable. For more details on the commands, please see How To Free Up Disk Space.

  4. By default, Tamr stores logs for 30 days as configured in Tamr using the TAMR_LOG_RETENTION_DAYS variable. You can set a lesser number of days for retaining the logs. If you still prefer 30 days of logs, make sure to delete (or move to a long-term storage solution) old logs regularly. Tamr does not clean out the logs directory in order to keep a full history of usage to make auditing easier. This means the user is responsible for deciding when a log is no longer needed locally. Take a look at How To Free Up Disk Space article for the lists of things that can be deleted apart from Tamr logs and the commands to do that.

  5. The current dataset table in Postgres is used frequently, we have found that if you are experiencing general system slowness, creating an index to the datasets table can improve front-end performance. This is already done starting in version 2019.012.0 and later, but can be applied as follows in earlier versions as well:

    create index b on dataset.dataset_ns_current USING gin ((data -> 'upstreamDatasets') jsonb_path_ops);

  6. If you have a large turnover of data (lots of deletions of old records, or deletions of projects and datasets that are no longer needed) cleaning old documents from Elasticsearch can result in major performance improvements on the front-end. In newer versions of Tamr starting from version 2019.015.1, you can do this using the following API endpoint:

Otherwise, in older versions (v2019.014.1 and lower) you can do this with a curl command directly to Elasticsearch:

`curl -XPOST localhost:9200/_all/_forcemerge?only_expunge_deletes=true`
  1. Use Dataset cleanup utility: You can use the CleanupIncompletelyDeletedProjects maintenance utility to convert derived datasets for deleted projects into source datasets. Tamr administrators can then either delete these datasets or materialize them for use in other projects in the Tamr UI. For more information, please see Dataset Cleanup.