0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I am having a Spark SQL DataFrame with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function like:

Window \
    .partitionBy('id') \
    .orderBy('start')


and here comes the problem. I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:

.rowsBetween(-sys.maxsize, 0)


but would like to achieve something like:

.rangeBetween("7 days", 0)

1 Answer

0 votes
by (32.5k points)

I don’t think that what you are asking for is directly possible(check this) in Spark or in Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type as shown in the code below::

from pyspark.sql import Row

row = Row("id", "start", "some_value")

df = sc.parallelize([

    row(1, "2015-01-01", 20.0),

    row(1, "2015-01-06", 10.0),

    row(1, "2015-01-07", 25.0),

    row(1, "2015-01-12", 30.0),

    row(2, "2015-01-01", 5.0),

    row(2, "2015-01-03", 30.0),

    row(2, "2015-02-01", 20.0)

]).toDF().withColumn("start", col("start").cast("date"))

A small helper and window definition:

from pyspark.sql.window import Window

from pyspark.sql.functions import mean, col


 

# Hive timestamp is interpreted as UNIX timestamp in seconds*

days = lambda i: i * 86400 

Finally, query:

w = (Window()

   .partitionBy(col("id"))

   .orderBy(col("start").cast("timestamp").cast("long"))

   .rangeBetween(-days(7), 0))

df.select(col("*"), mean("some_value").over(w).alias("mean")).show()

## +---+----------+----------+------------------+

## | id|     start|some_value|              mean|

## +---+----------+----------+------------------+

## |  1|2015-01-01|      20.0| 20.0|

## |  1|2015-01-06|      10.0| 15.0|

## |  1|2015-01-07|      25.0|18.333333333333332|

## |  1|2015-01-12|      30.0|21.666666666666668|

## |  2|2015-01-01|       5.0| 5.0|

## |  2|2015-01-03|      30.0| 17.5|

## |  2|2015-02-01|      20.0| 20.0|

## +---+----------+----------+------------------+

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...