What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

Question

1 Answer

Anurag · Answer 1 · 2019-08-20T13:08:20+0000

The spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations.

The spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. But the spark.default.parallelism seems to only be working for raw RDD and is ignored when working with data frames.

If your task doesn't need a join or aggregation and you are working with data frames then setting these will not have any effect. You set the number of partitions yourself by calling df.repartition(numOfPartitions).

For example:

sqlContext.setConf("spark.sql.shuffle.partitions", "300")
sqlContext.setConf("spark.default.parallelism", "300")

./bin/spark-submit --conf spark.sql.shuffle.partitions=300 --conf spark.default.parallelism=300

Hope this answer helps you!

Learn Spark with this Spark Certification Course by Intellipaat.

You can learn in-depth about SQL statements, queries and become proficient in SQL queries by enrolling in our industry-recognized SQL training online.

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources