+1 vote
1 view
in Big Data Hadoop & Spark by (11.5k points)
edited by

What's the difference between selecting with a where clause and filtering in Spark?
Are there any use cases in which one is more appropriate than the other one?

When do I use

DataFrame newdf = df.select(df.col("*")).where(df.col("somecol").leq(10))

and when 

DataFrame newdf = df.select(df.col("*")).filter("somecol <= 10")

2 Answers

+1 vote
by (31.4k points)

According to spark documentation "where() is an alias for filter()"

Using filter(condition) you can filter the rows based on the given condition and where() is an alias for filter().

Parameters: condition – a Column of types.BooleanType or a string of SQL expression.

>>> df.filter(df.age > 4).collect()

[Row(age=6, name=u'Amit')]

>>> df.where(df.age == 3).collect()

[Row(age=3, name=u'Prateek')]

>>> df.filter("age > 4").collect()

[Row(age=6, name=u'Amit')]

>>> df.where("age == 2").collect()

[Row(age=2, name=u'Prateek')]

0 votes
by (33.2k points)

Both 'filter' and 'where' in Spark SQL gives the same result. There is no difference between the two.

For example:

employee.filter($"age" > 15)

employee.where($"age" > 15)



Hope this answer helps you!

Welcome to Intellipaat Community. Get your technical queries answered by top developers !