How to Clear and Reindex Elasticsearch?

This article is relevant for Tamr versions 2019.023 and above.

Level of complexity: Complex (for advanced users)

Use case: Clearing and reindexing Elasticsearch (ES) is a way to ensure that the data contained in Elasticsearch has the correct content and structure and is also in sync with what is in Postgres and/or Hbase. These systems can fall out of sync in certain scenarios where the data content or structure changes dramatically (for example, during upgrade or potentially from troubleshooting complex issues).

Note: If you are unsure whether or not clearing out ES is applicable to your instance, reach out to [email protected] before proceeding.

We recommend you to take a Tamr backup before applying the resolution suggested below.

How-to-guide for clearing and reindexing Elasticsearch

Step 1. Stop Tamr (do not stop dependencies)

1.a If you want to truly wipe all ES data, then you can run the following command to manually delete all ES indexes (This assumes that ES is running on the default port - if it is not, replace 9200 with the value of the port in the TAMR_ES_API_PORT configuration variable:

curl -X DELETE localhost:9200/_all

Note: This command will leave the Tamr UI in an unusable state until ES is repopulated via the reindex APIs below. This only works on single-node ES clusters. We won't be able to run this command in a cloud-native deployment (e.g.Tamr on GCP)where a shared ES cluster is used.

1.b You can also clean up individual projects ES indexes by supplying the specific index name to the delete command. The format of the index name should be: tamrproject (replace with the actual numeric ID of the project). So for example if the project id is 4 then the command would be

curl -X DELETE localhost:9200/tamr_project_4

. If one chooses to clean up ES in a per-project manner, then step 3 would be to rerun the pipelines of the project instead of calling the reindexing APIs.

Step 2. Start Tamr

Step 3. Run the following two reindexing APIs to repopulate Elasticsearch

Note: Ensure that you run ‘reindex/all-datascale’ first and ‘reindex/all-humanscale’ second.

Order is important: The ‘reindex/all-datascale’ API triggers a set of jobs on the Tamr UI. Only after all the jobs are completed successfully, should you trigger the ‘reindex/all-humanscale’ API. This order of running the API’s is important for versions equal or older than v2020.004.


Did this page help you?