How to find median and quantiles using Spark

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-10T09:08:44+0000

Spark 2.0+:

Use approxQuantile method which implements Greenwald-Khanna algorithm:

Python:

df.approxQuantile("x", [0.5], 0.25)

Scala:

df.stat.approxQuantile("x", Array(0.5), 0.25)

where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation.

For Spark 2.2+: Since (SPARK-14352)It supports estimation on multiple columns:

df.approxQuantile(["x", "y", "z"], [0.5], 0.25)

and

df.approxQuantile(Array("x", "y", "z"), Array(0.5), 0.25)

Here is another method I used using window functions (with pyspark 2.2.0).

from pyspark.sql import DataFrame
class median():
    // Create median class with over method to pass partition //
    def __init__(self, df, col, name):
        assert col
        self.column=col
        self.df = df
        self.name = name
    def over(self, window):
        from pyspark.sql.functions import percent_rank, pow, first
        first_window = window.orderBy(self.column) # first, order by column we want to compute the median for
        df = self.df.withColumn("percent_rank", percent_rank().over(first_window)) # add percent_rank column, percent_rank = 0.5 corresponds to median
        second_window = window.orderBy(pow(df.percent_rank-0.5, 2)) # order by (percent_rank - 0.5)^2 ascending
        return df.withColumn(self.name, first(self.column).over(second_window)) # the first row of the window corresponds to median
def addMedian(self, col, median_name):
    // Method to be added to spark native DataFrame class //
    return median(self, col, median_name)
# Add method to DataFrame cl
ass
Dat
aFr
ame.addMedian = addMedian

Then finally to calculate the median of col2 call the addMedian meth

o

d:

f

rom pyspark.sql import Window
median_window = Window.partitionBy("col1")
df = df.addMedian("col2", "median").over(median_wind

ow)

Finally

you can group by if needed.

df.groupby("col1", "median")

Learn Spa rk with this Spark Certification Course by Intellipaat.

How to find median and quantiles using Spark

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources