How to Clear and Reindex Elasticsearch

This article is relevant for Tamr versions 2019.023 and above.

Level of complexity: Complex (for advanced users)

Use case: Clearing and reindexing Elasticsearch (ES) is a way to ensure that the data contained in Elasticsearch has the correct content and structure and is also in sync with what is in Postgres and/or Hbase. These systems can fall out of sync in certain scenarios where the data content or structure changes dramatically (for example, during an upgrade or potentially from troubleshooting complex issues).

Note: The only update that requires re-indexing everything is the one that updates ES to 6.8.2. Elasticsearch was upgraded to 6.8.2 in v2019.019.0 which means you only need to clear and reindex ES if you're upgrading from a Tamr version older than v2019.019.0

We recommend you to take a Tamr backup before applying the resolution suggested below.

How-to-guide for clearing and reindexing Elasticsearch

Step 1. Stop Tamr (do not stop dependencies)

1.a If you want to truly wipe all ES data, then you can run the following command to manually delete all ES indexes (This assumes that ES is running on the default port - if it is not, replace 9200 with the value of the port in the TAMR_ES_API_PORT configuration variable:

curl -X DELETE localhost:9200/_all

Note: This command will leave the Tamr UI in an unusable state until ES is repopulated via the reindex APIs below. This only works on single-node ES clusters. We won't be able to run this command in a cloud-native deployment (e.g.Tamr on GCP)where a shared ES cluster is used.

1.b You can also clean up individual projects ES indexes by supplying the specific index name to the delete command. The format of the index name should be: tamrproject (replace with the actual numeric ID of the project). So for example if the project id is 4 then the command would be

curl -X DELETE localhost:9200/tamr_project_4

. If one chooses to clean up ES in a per-project manner, then step 3 would be to rerun the pipelines of the project instead of calling the reindexing APIs.

Step 2. Start Tamr

Step 3. Run the following two reindexing APIs to repopulate Elasticsearch

Note: Ensure that you run ‘reindex/all-datascale’ first and ‘reindex/all-humanscale’ second.

Order is important: The ‘reindex/all-datascale’ API triggers a set of jobs on the Tamr UI. Only after all the jobs are completed successfully, should you trigger the ‘reindex/all-humanscale’ API. This order of running the API’s is important for versions equal or older than v2020.004.

Step 4. Update Golden Records

For all Golden Records projects, follow the instructions to update golden records.