Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

... by checking whether a columns' value is in a seq.
Perhaps I'm not explaining it very well, I basically want this (to express it using regular SQL): DF_Column IN seq?

First I did it using a broadcast var (where I placed the seq), UDF (that did the checking) and registerTempTable.
The problem is that I didn't get to test it since I ran into a known bug that apparently only appears when using registerTempTable with ScalaIDE.

I ended up creating a new DataFrame out of seq and doing inner join with it (intersection), but I doubt that's the most performant way of accomplishing the task.

For example:

How to do filter based on whether elements of one DataFrame's column are in another DF's column (like SQL select * from A where login in (select username from B))?

E.g: First DF:

login      count
login1     192 
login2     146 
login3     72   


Second DF:

username
login2
login3
login4


The result:

login      count
login2     146 
login3     72 

1 Answer

0 votes
by (32.3k points)

Following description of your sample, I tried to execute my code that runs normally in Spark 1.4.0-SNAPSHOT on these two configurations:

 

  • Intellij IDEA's test

  • Spark Standalone cluster with 8 nodes (1 master, 7 worker)

Below code will give you your desired output, please check it:

 

val bc = sc.broadcast(Array[String]("login3", "login4"))

val x = Array(("login1", 192), ("login2", 146), ("login3", 72))

val xdf = sqlContext.createDataFrame(x).toDF("name", "cnt")

 

val func: (String => Boolean) = (arg: String) => bc.value.contains(arg)

val sqlfunc = udf(func)

val filtered = xdf.filter(sqlfunc(col("name")))

 

xdf.show()

Output:

 

name cnt 

login1 192 

login2 146 

login3 72 

 

filtered.show()

Output:

name cnt

login3 72

Browse Categories

...