Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark?

With an example dataset as follows:

df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")],
                                 ["time", "value", "class"] )

+----+-----+-----+
|time|value|class|
+----+-----+-----+
|   1|    2|    a|
|   3|    2|    a|
|   1|    3|    b|
|   2|    2|    a|
|   2|    3|    b|
+----+-----+-----+


I would like to add a cumulative sum column of value for each class grouping over the (ordered) time variable.

1 Answer

0 votes
by (32.3k points)

In order to achieve your desired output, I would suggest you use a combination of a window function and the Window.unboundedPreceding value in the window's range as follows:

from pyspark.sql import Window

from pyspark.sql import functions as F

windowval = (Window.partitionBy('class').orderBy('time')

             .rangeBetween(Window.unboundedPreceding, 0))

df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))

df_w_cumsum.show()

+----+-----+-----+-------+

|time|value|class|cum_sum|

+----+-----+-----+-------+

|   1|    3|    b|      3|

|   2|    3|    b|      6|

|   1|    2|    a|      2|

|   2|    2|    a|      4|

|   3|    2|    a|      6|

+----+-----+-----+-------+

Browse Categories

...