Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

Spark 1.4.1

I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below

import sqlContext.implicits._
import org.apache.spark.sql._

case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()


Then grouping and filtering:

df.groupBy("x").count()
  .filter("count >= 2")
  .show()


Throws an exception:

java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2

1 Answer

0 votes
by (32.3k points)
edited by

When you pass a string to the filter function, the string is interpreted as SQL. Count is a SQL keyword and using count as a variable confuses the parser. This is a small bug. You can easily avoid this. Instead of using a String use a column expression, as shown below:

df.groupBy("x").count()

  .filter($"count" >= 2)// or .filter("`count` >= 2")

  .show()

If you want to know more about Spark, then do check out this awesome video tutorial:

 

...