If you are experiencing slow performance or problems when migrating datasets between environments that use different versions of Spark, using
CHECKPOINT statements can enhance system performance.
Important: This is an advanced feature. Using
CHECKPOINTincorrectly can decrease system performance instead of improving it. If you are not experiencing performance delays, there is no need to add
CHECKPOINT breaks a series of transformations into more manageably-sized chunks. Instead of asking the Tamr Core transformation service to remember all of the transformations you are trying to complete at once,
CHECKPOINT tells it to work on a set of transformations and cache the results before moving on to the next set of transformations.
CHECKPOINT statements between logical chunks of transformation work. Choosing where to add checkpoints takes experience and experimentation. The script to add a checkpoint looks like this:
When you add a
CHECKPOINT, you have the option to include a
HINT to specify the Spark store behavior as either
checkpoint.reliable (the default) or
checkpoint.local. See Statement Modifiers.
Note: Depending on the setup of the underlying Spark cluster, checkpointing to a local store can result in better performance. Use this
HINT value with caution and consult with Tamr Support at [email protected].
To include a
HINT in a
CHECKPOINT statement, use the following syntax:
Updated over 1 year ago