What's the most efficient way to filter a DataFrame

Question

asked Jul 12, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

... by checking whether a columns' value is in a seq.
Perhaps I'm not explaining it very well, I basically want this (to express it using regular SQL): DF_Column IN seq?

First I did it using a broadcast var (where I placed the seq), UDF (that did the checking) and registerTempTable.
The problem is that I didn't get to test it since I ran into a known bug that apparently only appears when using registerTempTable with ScalaIDE.

I ended up creating a new DataFrame out of seq and doing inner join with it (intersection), but I doubt that's the most performant way of accomplishing the task.

For example:

How to do filter based on whether elements of one DataFrame's column are in another DF's column (like SQL select * from A where login in (select username from B))?

E.g: First DF:

login      count
login1     192
login2     146
login3     72

Second DF:

username
login2
login3
login4

The result:

login      count
login2     146
login3     72

1 Answer

Amit Rawat · Answer 1 · 2019-07-13T06:51:55+0000

Following description of your sample, I tried to execute my code that runs normally in Spark 1.4.0-SNAPSHOT on these two configurations:

Intellij IDEA's test
Spark Standalone cluster with 8 nodes (1 master, 7 worker)

Below code will give you your desired output, please check it:

val bc = sc.broadcast(Array[String]("login3", "login4"))
val x = Array(("login1", 192), ("login2", 146), ("login3", 72))
val xdf = sqlContext.createDataFrame(x).toDF("name", "cnt")

val func: (String => Boolean) = (arg: String) => bc.value.contains(arg)
val sqlfunc = udf(func)
val filtered = xdf.filter(sqlfunc(col("name")))

xdf.show()

Output:

name cnt
login1 192
login2 146
login3 72

filtered.show()

Output:

name cnt
login3 72

What's the most efficient way to filter a DataFrame

What's the most efficient way to filter a DataFrame

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions