What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-09T11:31:02+0000

I would suggest you to try setting spark.sql.shuffle.partitions to 2001, if you're running out of memory on the shuffle,.

Whenever the number of partitions is greater than 2000, Spark uses a different data structure for shuffle book-keeping:

private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
…

I really wish they would let you configure this independently.

Note: Found this information in a Cloudera slide deck.

If you want to know more about Spark, then do check out this awesome video tutorial:

What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources