How to optimize Spark jobs?

The usage of Kryo data serialization as much as possible instead of Java data serialization as Kryo serialization is much faster and compact
Broadcasting data values across multiple stages rather than sending the data to executors every time
Avoiding user-defined functions (UDFs) in favor of Spark SQL functions. Spark SQL functions are predefined and often better optimized for the querying use cases
Locating the data being processed as close to the computational nodes as possible (even within the nodes in certain cases) to enhance data locality and minimize data movement
Scaling Dynamic Allocation configurations up or down depending on the use case
Using better data structures for garbage collection optimization, e.g., using arrays instead of linked lists
Assigning a higher number of executor cores for each executor to achieve higher throughput. Generally, five executor cores are recommended
Scaling up the number of executors depending on the magnitude of the use case
Increasing the memory per executor to about 12 GBs
Manipulating the number of partitions of data imported, based on the degree of computational parallelism intended to be achieved in the Spark job

If you are looking for an online course to learn Spark, I recommend this Apache Spark Training program by Intellipaat.

1 Answer