What To Do when HBase Region Servers are Unhealthy

Problem: Connection Loss, Zookeeper to Hbase, while running a job.

Most common error message:

2021-10-22 20:00:50,636 WARN  [main] zookeeper.ZKUtil: clean znode for master0x0, quorum=localhost:2181, baseZNode=/hbase Unable to get data of znode /hbase/master
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Cause: A common cause is under-provisioned nodes where HBase loses connection to ZooKeeper.

Resolution: The recommended option is deploying a cloud-native install of Tamr. See the public docs for more information about cloud-native deployments and reach out to your Tamr point of contact or [email protected] to discuss options.

Alternative resolution: Increase HBase’s ZooKeeper time-out settings

If a cloud-native deployment is not an option, then the first alternative resolution is to increase Hbase’s ZooKeeper time-out settings.

You can increase these settings by adding the below snippet to ${TAMR_HOME}/hbase-1.3.1/conf/hbase-site.xml.j2

Some values are hardcoded and some of them are using template values that come from Tamr configs.

Depending on your version of Tamr you may have some of these configuration variables present but hard coded to another number, or you may not have the property present in your config file. In the case of the former simply update the value to what is shown below and in the case of the latter simply put the entire ... block into the config file.

    <property>

        <name>hbase.rpc.timeout</name>

        <value>600000</value>

    </property>

    <property>

        <name>hbase.client.scanner.timeout.period</name>

        <value>600000</value>

    </property>

    <property>

        <name>zookeeper.session.timeout</name>

        <value>2400000</value>

    </property>

<property>

      <name>hbase.hstore.blockingStoreFiles</name>

      <value>200</value>

</property>

<property>

        <name>zookeeper.session.timeout</name>

        <value>600000</value>

 </property>

 <property>

        <name>zookeeper.recovery.retry</name>

        <value>5</value>

 </property>

Two things to note when implementing this:

Ensure you are editing the jinja template file (has .j2 file extension) and not hbase-site.xml directly. On startup, Tamr renders the latter file using the former
On upgrade, these jinja template files will be overwritten so you will need to store a copy of this elsewhere and reconcile it with the latest jinja template file provided in the version to which you are upgrading.

Solution of last resort: Set up a second region server

If increasing the ZooKeeper config change did not resolve the issue, then try setting up a second region server:

Step 1. Update HBase to use jemalloc lib instead of malloc.

Installation guide here.
To verify whether the lib is installed or not: ldconfig -p | grep jemal

Step 2. Creating a second region server in HBase (with 5G memory each) so that, even if one region server is dead while the job is running, the job can pick up the second region server. This can be done by replacing tamr-start-hbase.sh.j2 with the below changes.

#! /bin/bash

chmod +x /opt/tamr/hbase-1.3.1/bin/*

/opt/tamr/hbase-1.3.1/bin/hbase-daemon.sh --config "/opt/tamr/hbase-1.3.1/conf" start zookeeper
/opt/tamr/hbase-1.3.1/bin/hbase-daemon.sh --config "/opt/tamr/hbase-1.3.1/conf" start master
# HBASE_IDENT_STRING gets overidden with $USER-$DN in local-regionserver.sh
# this env must be kept in-sync with the script to stop HBase.
USER="tamr-local" /opt/tamr/hbase-1.3.1/bin/local-regionservers.sh start 1
USER="tamr-local" /opt/tamr/hbase-1.3.1/bin/local-regionservers.sh start 2

Step 3. Adding the following to the hbase-site.xml.j2, so that spark retries were long enough to switch to the 2nd region server.

 <property>

        <name>hbase.client.retries.number</name>

        <value>15</value>

    </property>

    <property>

        <name>hbase.client.pause</name>

        <value>60000</value>

    </property>

Step 4. Update the below configuration variables using unify-admin.sh.

TAMR_HBASE_NUMBER_OF_REGIONS: "10"

TAMR_HBASE_NUMBER_OF_SALT_VALUES: "300"

TAMR_FS_EXTRA_URIS: "/opt/tamr/hbase-1.3.1/conf/hbase-site.xml"

TAMR_HBASE_REGION_SERVER_MEM: "5G"

Step 5. Create a cron job that checks the HBase region servers at a scheduled interval and restarts the servers in case they are down.