Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am running a Spark streaming application with 2 workers. Application has a join and an union operations.

All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times).

Please find the spark stage details in the below image: enter image description here

After researching on this, found that

Shuffle spill happens when there is not sufficient memory for shuffle data.

Shuffle spill (memory) - size of the deserialized form of the data in memory at the time of spilling

shuffle spill (disk) - size of the serialized form of the data on disk after spilling

Since deserialized data occupies more space than serialized data. So, Shuffle spill (memory) is more.

Noticed that this spill memory size is incredibly large with big input data.

How to optimize this spilling both memory and disk?

1 Answer

0 votes
by (32.3k points)
edited by

Spark 1.4 has some better diagnostics and visualization in the interface which can help you.

In summary, you spill when the size of the RDD partitions at the end of the stage exceeds the amount of memory available for the shuffle buffer.

You can:

  • Try to achieve smaller partitions from input by doing repartition() manually.

  • Increase the memory in your executor processes(spark.executor.memory), so that there will be some increment in the shuffle buffer.

  • Increase the shuffle buffer by increasing the fraction of executor memory allocated to it (spark.shuffle.memoryFraction) from the default of 0.2. You need to give back spark.storage.memoryFraction.

Reduce the ratio of worker threads (SPARK_WORKER_CORES) to executor memory in order to increase the shuffle buffer per thread.

If you want to know more about Spark, then do check out this awesome video tutorial:

Browse Categories

...