Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in

sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')

where a is the tuple (1, 2, 3). I am getting this error:

java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found

which is basically saying it was expecting something like '(1, 2, 3)' instead of a. The problem is I can't manually write the values in a as it's extracted from another job.

How would I filter in this case?

1 Answer

0 votes
by (32.3k points)
edited by

The string you pass to SQLContext, it gets evaluated in the scope of the SQL environment. It is not able to capture the closure.

In order to pass a variable, you'll have to do it explicitly using string formatting:

df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))


sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()

##  2 

It is also obvious that, using something like this won’t be your priority in a "real" SQL environment due to security considerations but it shouldn't matter here.

In practice DataFrame DSL is a must choice when you want to create dynamic queries:

from pyspark.sql.functions import col

df.where(col("v").isin({"foo", "bar"})).count()

## 2

It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

If you want to know more about Spark, then do check out this awesome video tutorial:

Browse Categories