Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
3 views
in Big Data Hadoop & Spark by (11.4k points)

I'm wondering how I can achieve the following in Spark (Pyspark)

Initial Dataframe:

+--+---+
|id|num|
+--+---+
|4 |9.0|
+--+---+
|3 |7.0|
+--+---+
|2 |3.0|
+--+---+
|1 |5.0|
+--+---+

Resulting Dataframe:

+--+---+-------+
|id|num|new_Col|
+--+---+-------+
|4 |9.0|  7.0  |
+--+---+-------+
|3 |7.0|  3.0  |
+--+---+-------+
|2 |3.0|  5.0  |
+--+---+-------+

I manage to generally "append" new columns to a dataframe by using something like: df.withColumn("new_Col", df.num * 10)

However I have no idea on how I can achieve this "shift of rows" for the new column, so that the new column has the value of a field from the previous row (as shown in the example above).

1 Answer

0 votes
by (32.3k points)

I solved your problem using lag window function. Just go through the code below:

>>> from pyspark.sql.functions import lag, col

>>> from pyspark.sql.window import Window



 

Then, create your df:

image

Finally, use LAG():

w = Window().partitionBy().orderBy(col("id"))

df.select("*", lag("num").over(w).alias("new_col")).na.drop().show()

image

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...