Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use within the context of groupBy / agg, so that I can mix it with other PySpark aggregate functions. If this is not possible for some reason, a different approach would be fine as well.

I also have access to the percentile_approx Hive UDF but I don't know how to use it as an aggregate function.

For the sake of specificity, suppose I have the following dataframe:

from pyspark import SparkContext
import pyspark.sql.functions as f

sc = SparkContext()   

df = sc.parallelize([
    ['A', 1],
    ['A', 2],
    ['A', 3],
    ['B', 4],
    ['B', 5],
    ['B', 6],
]).toDF(('grp', 'val'))

df_grp = df.groupBy('grp').agg(f.magic_percentile('val', 0.5).alias('med_val'))
df_grp.show()

1 Answer

0 votes
by (32.3k points)

Unfortunately, and to the best of my knowledge, it seems that it is not possible to do this with "pure" PySpark commands.

However, I would like to provide you a workaround.

As you mentioned that you have access to percentile_approx, I would suggest you to use it in a SQL command:

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

df.registerTempTable("df")

df2 = sqlContext.sql("select grp, percentile_approx(val, 0.5) as med_val from df group by grp")

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.5k answers

500 comments

108k users

Browse Categories

...