0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)
Hi I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partitions from 200 default to 1000 but it is not helping. Please correct me if I am wrong this partitions will share data shuffle load so more the partitions less data to hold. Please guide I am new to Spark. I am using Spark 1.4.0 and I have around 1TB of uncompressed data to process using hiveContext.sql() group by queries.

1 Answer

0 votes
by (26.3k points)
edited by

I would suggest you to try setting spark.sql.shuffle.partitions to 2001, if you're running out of memory on the shuffle,.

Whenever the number of partitions is greater than 2000, Spark uses a different data structure for shuffle book-keeping:

private[spark] object MapStatus {

  def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {

    if (uncompressedSizes.length > 2000) {

      HighlyCompressedMapStatus(loc, uncompressedSizes)

    } else {

      new CompressedMapStatus(loc, uncompressedSizes)

    }

  }

I really wish they would let you configure this independently.


Note: Found this information in a Cloudera slide deck.

If you want to know more about Spark, then do check out this awesome video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...