Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
4 views
in Big Data Hadoop & Spark by (11.4k points)

I have a largeDataFrame (multiple columns and billions of rows) and a smallDataFrame (single column and 10,000 rows).

I'd like to filter all the rows from the largeDataFrame whenever the some_identifier column in the largeDataFrame matches one of the rows in the smallDataFrame.

Here's an example:

largeDataFrame

some_idenfitier,first_name
111,bob
123,phil
222,mary
456,sue


smallDataFrame

some_identifier
123
456


desiredOutput

111,bob
222,mary


Here is my ugly solution.

val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row"))
val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad")


Is there a cleaner solution?

1 Answer

0 votes
by (32.3k points)

I would suggest you to use a left_anti join.

The LeftAntiJoin is the opposite of a LeftSemiJoin.

It filters out data from the right table in the left table according to a given key :

largeDataFrame

   .join(smallDataFrame, Seq("some_identifier"),"left_anti")

   .show

// +---------------+----------+

// |some_identifier|first_name|

// +---------------+----------+

// |            222|      mary|

// |            111|       bob|

// +---------------+----------+

...