Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Big Data Hadoop & Spark by (11.4k points)

I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller.

Is there a way to avoid all this shuffling? I cannot set autoBroadCastJoinThreshold, because it supports only Integers - and the table I am trying to broadcast is slightly bigger than integer number of bytes.

Is there a way to force broadcast ignoring this variable?

1 Answer

0 votes
by (32.3k points)

Broadcast Hash Joins:

In SparkSQL, you can see the type of join being performed by calling queryExecution.executedPlan. As with core Spark, if one of the tables is much smaller than the other you may want a broadcast hash join. You can hint to Spark SQL that a given DF should be broadcast for join by calling method broadcast on the DataFrame before joining it

 

Example:  largedataframe.join(broadcast(smalldataframe), "key")

 

Is there a way to force broadcast ignoring this variable?

Try the below command:

sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")


Another way to hint for a dataframe to be broadcasted is by using

left.join(broadcast(right), ...)

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.1k users

Browse Categories

...