Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Big Data Hadoop & Spark by (11.4k points)

... by checking whether a columns' value is in a seq.
Perhaps I'm not explaining it very well, I basically want this (to express it using regular SQL): DF_Column IN seq?

First I did it using a broadcast var (where I placed the seq), UDF (that did the checking) and registerTempTable.
The problem is that I didn't get to test it since I ran into a known bug that apparently only appears when using registerTempTable with ScalaIDE.

I ended up creating a new DataFrame out of seq and doing inner join with it (intersection), but I doubt that's the most performant way of accomplishing the task.

For example:

How to do filter based on whether elements of one DataFrame's column are in another DF's column (like SQL select * from A where login in (select username from B))?

E.g: First DF:

login      count
login1     192 
login2     146 
login3     72   

Second DF:


The result:

login      count
login2     146 
login3     72 

1 Answer

0 votes
by (32.3k points)

Following description of your sample, I tried to execute my code that runs normally in Spark 1.4.0-SNAPSHOT on these two configurations:


  • Intellij IDEA's test

  • Spark Standalone cluster with 8 nodes (1 master, 7 worker)

Below code will give you your desired output, please check it:


val bc = sc.broadcast(Array[String]("login3", "login4"))

val x = Array(("login1", 192), ("login2", 146), ("login3", 72))

val xdf = sqlContext.createDataFrame(x).toDF("name", "cnt")


val func: (String => Boolean) = (arg: String) => bc.value.contains(arg)

val sqlfunc = udf(func)

val filtered = xdf.filter(sqlfunc(col("name")))



name cnt 

login1 192 

login2 146 

login3 72


name cnt

login3 72

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers


94.1k users

Browse Categories