Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Big Data Hadoop & Spark by (11.4k points)

There's a DataFrame in pyspark with data as below:

user_id object_id score
user_1  object_1  3
user_1  object_1  1
user_1  object_2  2
user_2  object_1  5
user_2  object_2  2
user_2  object_2  6

What I expect is returning 2 records in each group with the same user_id, which need to have the highest score. Consequently, the result should look as the following:

user_id object_id score
user_1  object_1  3
user_1  object_2  2
user_2  object_2  6
user_2  object_1  5

I'm really new to pyspark, could anyone give me a code snippet or portal to the related documentation of this problem?

1 Answer

0 votes
by (32.3k points)

I would suggest you to use window functions here in order to attain the rank of each row based on user_id and score, and subsequently filter your results to only keep the first two values.

from pyspark.sql.window import Window

from pyspark.sql.functions import rank, col

window = Window.partitionBy(df['user_id']).orderBy(df['score'].desc())'*', rank().over(window).alias('rank')) 

  .filter(col('rank') <= 2) 





#| user_1| object_1|    3|   1|

#| user_1| object_2|    2|   2|

#| user_2| object_2|    6|   1|

#| user_2| object_1|    5|   2|


You can also understand this from spark official programming guide . It is a good place to start learning Spark.


rdd = sc.parallelize([("user_1",  "object_1",  3), 

                      ("user_1",  "object_2",  2), 

                      ("user_2",  "object_1",  5), 

                      ("user_2",  "object_2",  2), 

                      ("user_2",  "object_2",  6)])

df = sqlContext.createDataFrame(rdd, ["user_id", "object_id", "score"])

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers


94.2k users

Browse Categories