Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller.

Is there a way to avoid all this shuffling? I cannot set autoBroadCastJoinThreshold, because it supports only Integers - and the table I am trying to broadcast is slightly bigger than integer number of bytes.

Is there a way to force broadcast ignoring this variable?

1 Answer

0 votes
by (32.3k points)

Broadcast Hash Joins:

In SparkSQL, you can see the type of join being performed by calling queryExecution.executedPlan. As with core Spark, if one of the tables is much smaller than the other you may want a broadcast hash join. You can hint to Spark SQL that a given DF should be broadcast for join by calling method broadcast on the DataFrame before joining it

 

Example:  largedataframe.join(broadcast(smalldataframe), "key")

 

Is there a way to force broadcast ignoring this variable?

Try the below command:

sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")


Another way to hint for a dataframe to be broadcasted is by using

left.join(broadcast(right), ...)

Browse Categories

...