Repartition
Primarily used for performance tuning, a REPARTITION
statement manually sets parallelism for transformation processing.
By default, transformations do not repartition data. If you are experiencing slow performance, adding a REPARTITION
statement can improve processing speed by providing a user-defined parallelism value for Spark to use when performing subsequent calculations. While a REPARTITION
can be placed anywhere in a transformation script it is most often inserted before a computationally-intensive statement. You can, for example, add a REPARTITION
statement before a JOIN
to make more efficient use of system resources and reduce processing time.
Important: This is an advanced feature. If you are familiar with the way data is partitioned on disk/HBase for your installation and understand the topology of your Spark cluster, you can use
REPARTITION
to manually repartition data to the amount of parallelism that is appropriate for your system.
An example of a REPARTITION
statement follows.
REPARTITION 50 BY PART_ID;
Updated almost 3 years ago