0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks.

This works fine for ascending order:

def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={
    val top_keys: List[String] = top_key.split(", ").map(_.trim).toList
    val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*)
       .orderBy(top_value)
    val rankCondition = "rn < "+top_x.toString
    val dfTop = df.withColumn("rn",row_number().over(w))
      .where(rankCondition).drop("rn")
  return dfTop
}


But when I try to change it to orderBy(desc(top_value)) or orderBy(top_value.desc) in line 4, I get a syntax error. What's the correct syntax here?

1 Answer

0 votes
by (25.6k points)

There are two versions of orderBy, one that works with strings and one that works with Column objects (API). Your code is from the first version, which does not allow for changing the sort order. You need to switch to the column version and then call the desc method, e.g., myCol.desc.

Now, we get into API design territory. Passing the Column parameters gives you an advantage of flexibility, e.g., you can use expressions, etc. If you want to maintain an API that takes in a string as opposed to a Column, you need to convert the string to a column. The easiest way to do this is to use org.apache.spark.sql.functions.col(myColName).

Putting it all together, we get

.orderBy(org.apache.spark.sql.functions.col(top_value).desc)

...