Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (19k points)
What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

I have tried to set both of them in SparkSQL, but the task number of the second stage is always 200.

1 Answer

0 votes
by (33.1k points)

The spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations.

The spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. But the spark.default.parallelism seems to only be working for raw RDD and is ignored when working with data frames.

If your task doesn't need a join or aggregation and you are working with data frames then setting these will not have any effect. You set the number of partitions yourself by calling df.repartition(numOfPartitions).

For example:

sqlContext.setConf("spark.sql.shuffle.partitions", "300")

sqlContext.setConf("spark.default.parallelism", "300")

./bin/spark-submit --conf spark.sql.shuffle.partitions=300 --conf spark.default.parallelism=300

Hope this answer helps you!

Learn Spark with this Spark Certification Course by Intellipaat.

You can learn in-depth about SQL statements, queries and become proficient in SQL queries by enrolling in our industry-recognized SQL training online.

Browse Categories