0 votes
1 view
in Big Data Hadoop & Spark by (6.5k points)

Can anyone tell me how to optimize Spark jobs?

1 Answer

0 votes
by (11.3k points)

There are certain practices used to optimize the performance of Spark jobs:

  • The usage of Kryo data serialization as much as possible instead of Java data serialization as Kryo serialization is much faster and compact
  • Broadcasting data values across multiple stages rather than sending the data to executors every time
  • Avoiding user-defined functions (UDFs) in favor of Spark SQL functions. Spark SQL functions are predefined and often better optimized for the querying use cases
  • Locating the data being processed as close to the computational nodes as possible (even within the nodes in certain cases) to enhance data locality and minimize data movement
  • Scaling Dynamic Allocation configurations up or down depending on the use case
  • Using better data structures for garbage collection optimization, e.g., using arrays instead of linked lists
  • Assigning a higher number of executor cores for each executor to achieve higher throughput. Generally, five executor cores are recommended
  • Scaling up the number of executors depending on the magnitude of the use case
  • Increasing the memory per executor to about 12 GBs
  • Manipulating the number of partitions of data imported, based on the degree of computational parallelism intended to be achieved in the Spark job

If you are looking for an online course to learn Spark, I recommend this Apache Spark Training program by Intellipaat.

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...