Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
+1 vote
2 views
in Big Data Hadoop & Spark by (11.4k points)
edited by

What's the difference between selecting with a where clause and filtering in Spark?
Are there any use cases in which one is more appropriate than the other one?

When do I use

DataFrame newdf = df.select(df.col("*")).where(df.col("somecol").leq(10))


and when 

DataFrame newdf = df.select(df.col("*")).filter("somecol <= 10")

2 Answers

+1 vote
by (32.3k points)

According to spark documentation "where() is an alias for filter()"

Using filter(condition) you can filter the rows based on the given condition and where() is an alias for filter().

Parameters: condition – a Column of types.BooleanType or a string of SQL expression.

>>> df.filter(df.age > 4).collect()

[Row(age=6, name=u'Amit')]

>>> df.where(df.age == 3).collect()

[Row(age=3, name=u'Prateek')]

>>> df.filter("age > 4").collect()

[Row(age=6, name=u'Amit')]

>>> df.where("age == 2").collect()

[Row(age=2, name=u'Prateek')]

0 votes
by (33.1k points)

Both 'filter' and 'where' in Spark SQL gives the same result. There is no difference between the two.

For example:

employee.filter($"age" > 15)

employee.where($"age" > 15)

employees.filter($"emp_id".isin(items:_*)).show

employees.where($"emp_id".isin(items:_*)).show

Hope this answer helps you!

Browse Categories

...