KeyValue Size Too Large Error

Problem: This error, KeyValue size too large shows up when either running an update pairs job or publish clusters job.

Log Error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 in stage 2.0 failed 4 times, most recent failure: Lost task 162.3 in stage 2.0 (TID 234, tamr-tamr-dev.c.tamr-cus-staples.internal, executor 1): java.lang.IllegalArgumentException: KeyValue size too large

	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
Caused by: java.lang.IllegalArgumentException: KeyValue size too large
	at org.apache.hadoop.hbase.client.ConnectionUtils.validatePut(ConnectionUtils.java:599)

This error may occur if any record ends up having a long list of values or an array value with too many elements after using either the merge transformation or pre-group-by.

Fix:

Option 1: You may reduce the number of elements in any arrays to 25 values max either of the following:

  • If you are using merge transformation
    Add a MultiFormula (with all columns selected) like array.slice2($COL, 0, 25) AS $COL
  • If you are using pre-group-by, replace collect_set with collect_subset k=25.

Option 2: Figure out the large key that is causing the issue.

Workarounds (if large key values cannot be found) to increase the threshold:

The KeyValue size limit is configurable and can be disabled. It has to be set both on the client and the server.

Client configuration

TAMR_HBASE_EXTRA_CONFIG: {"hbase.client.keyvalue.maxsize": "10485760"}

Server configuration

For single node deployment, add hbase.server.keyvalue.maxsize to hbase-site.xml.j2 configuration file.

<property>
<name>hbase.server.keyvalue.maxsize</name>
<value>10485760</value>
</property>

Restart Tamr and Tamr dependencies for the changes to take effect. Note that this configuration file will get overwritten on upgrades.