I have a largeDataFrame (multiple columns and billions of rows) and a smallDataFrame (single column and 10,000 rows).
I'd like to filter all the rows from the largeDataFrame whenever the some_identifier column in the largeDataFrame matches one of the rows in the smallDataFrame.
Here's an example:
largeDataFrame
some_idenfitier,first_name
111,bob
123,phil
222,mary
456,sue
smallDataFrame
some_identifier
123
456
desiredOutput
111,bob
222,mary
Here is my ugly solution.
val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row"))
val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad")
Is there a cleaner solution?