By default, transformations do not repartition data. If you are experiencing slow performance, adding a
REPARTITION statement can improve processing speed by providing a user-defined parallelism value for Spark to use when performing subsequent calculations. While a
REPARTITION can be placed anywhere in a transformation script it is most often inserted before a computationally-intensive statement. You can, for example, add a
REPARTITION statement before a
JOIN to make more efficient use of system resources and reduce processing time.
Important: This is an advanced feature. If you are familiar with the way data is partitioned on disk/HBase for your installation and understand the topology of your Spark cluster, you can use
REPARTITIONto manually repartition data to the amount of parallelism that is appropriate for your system.
An example of a
REPARTITION statement follows.
REPARTITION 50 BY PART_ID;
Updated about 2 years ago