Repartition
Primarily used for performance tuning, a REPARTITION
statement manually sets parallelism for transformation processing.
Important: This is an advanced feature. If you are familiar with the way data is partitioned on disk/HBase for your Tamr installation and understand the topology of your Spark cluster, you can manually repartition the data to the amount of parallelism that is appropriate for your system.
By default, transformations do not repartition data. If you are experiencing slow performance, adding a REPARTITION
statement can improve processing speed by providing a user-defined parallelism value for Spark to use when performing subsequent calculations. While a REPARTITION
can be placed anywhere in a transformation script it is most often inserted before a computationally-intensive statement. You can, for example, add a REPARTITION
statement before a JOIN
to make more efficient use of system resources and reduce processing time.
REPARTITION 50 BY PART_ID;
Updated over 4 years ago