Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContext to take number of partitions parameter.

Repartitioning of the RDD causes shuffling and results in more processing time.

>

val result = sqlContext.sql("select * from bt_st_ent")


Has the log output of:

Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes)
Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes)


I would like to know is there any way to increase the partitions size of the SQL output.

1 Answer

0 votes
by (32.3k points)

Spark < 2.0:

You can use Hadoop configuration options:

  • mapred.min.split.size

  • Mapred.max.split.size

as well as HDFS block size to control partition size for filesystem based formats*.

val minSplit: Int = ???

val maxSplit: Int = ???

sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)

sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)

Spark 2.0+:

You can use spark.sql.files.maxPartitionBytes configuration:

spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)

I would suggest you to always check documentation / implementation details of the format you used as the values present in both the cases may not be in use by a specific data source API.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...