Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

Let us say I have the following two RDDs, with the following key-pair values.

rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ]


rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ]

Now, I want to join them by key values, so for example I want to return the following

ret = [ (key1, [value1, value2, value5, value6]), (key2, [value3, value4, value7]) ] 

How can I do this, in spark using Python or Scala?

1 Answer

0 votes
by (32.3k points)

I would suggest you to use join and then map the resulting rdd.

rdd1.join(rdd2).map(case (k, (ls, rs)) => (k, ls ++ rs))

Browse Categories