DataFrame join optimization - Broadcast Hash Join

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-13T07:01:36+0000

Broadcast Hash Joins:

In SparkSQL, you can see the type of join being performed by calling queryExecution.executedPlan. As with core Spark, if one of the tables is much smaller than the other you may want a broadcast hash join. You can hint to Spark SQL that a given DF should be broadcast for join by calling method broadcast on the DataFrame before joining it

Example: largedataframe.join(broadcast(smalldataframe), "key")

Is there a way to force broadcast ignoring this variable?

Try the below command:

sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")

Another way to hint for a dataframe to be broadcasted is by using

left.join(broadcast(right), ...)

DataFrame join optimization - Broadcast Hash Join

Please log in to add a comment.

Please log in to answer this question.

1 Answer

Please log in to add a comment.

Related questions