Repartition

Primarily used for performance tuning, a REPARTITION statement manually sets parallelism for transformation processing.

Suggest Edits

Important: This is an advanced feature. If you are familiar with the way data is partitioned on disk/HBase for your Tamr installation and understand the topology of your Spark cluster, you can manually repartition the data to the amount of parallelism that is appropriate for your system.

By default, transformations do not repartition data. If you are experiencing slow performance, adding a REPARTITION statement can improve processing speed by providing a user-defined parallelism value for Spark to use when performing subsequent calculations. While a REPARTITION can be placed anywhere in a transformation script it is most often inserted before a computationally-intensive statement. You can, for example, add a REPARTITION statement before a JOIN to make more efficient use of system resources and reduce processing time.

REPARTITION 50 BY PART_ID;

Updated over 4 years ago

What’s Next