0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller.

Is there a way to avoid all this shuffling? I cannot set autoBroadCastJoinThreshold, because it supports only Integers - and the table I am trying to broadcast is slightly bigger than integer number of bytes.

Is there a way to force broadcast ignoring this variable?

1 Answer

0 votes
by (31.4k points)

Broadcast Hash Joins:

In SparkSQL, you can see the type of join being performed by calling queryExecution.executedPlan. As with core Spark, if one of the tables is much smaller than the other you may want a broadcast hash join. You can hint to Spark SQL that a given DF should be broadcast for join by calling method broadcast on the DataFrame before joining it


Example:  largedataframe.join(broadcast(smalldataframe), "key")


Is there a way to force broadcast ignoring this variable?

Try the below command:

sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")

Another way to hint for a dataframe to be broadcasted is by using

left.join(broadcast(right), ...)

Welcome to Intellipaat Community. Get your technical queries answered by top developers !