Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

In my pig code I do this:

all_combined = Union relation1, relation2,
    relation3, relation4, relation5, relation 6.


I want to do the same with spark. However, unfortunately, I see that I have to keep doing it pairwise:

first = rdd1.union(rdd2)
second = first.union(rdd3)
third = second.union(rdd4)

 

# .... and so on


Is there a union operator that will let me operate on multiple rdds at a time:

e.g. union(rdd1, rdd2,rdd3, rdd4, rdd5, rdd6)

It is a matter on convenience.

1 Answer

0 votes
by (32.3k points)
edited by

If these are the RDDs, then you can use SparkContext.union method:

rdd1 = sc.parallelize([1, 2, 3])

rdd2 = sc.parallelize([4, 5, 6])

rdd3 = sc.parallelize([7, 8, 9])

rdd = sc.union([rdd1, rdd2, rdd3])

rdd.collect()

## [1, 2, 3, 4, 5, 6, 7, 8, 9]

Regarding your problem, there is no DataFrame equivalent but this approach will work:

from functools import reduce  # For Python 3.x

from pyspark.sql import DataFrame

def unionAll(*dfs):

    return reduce(DataFrame.unionAll, dfs)

df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))

df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))

df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))

unionAll(df1, df2, df3).show()

## +---+----+

## |  k|   v|

## +---+----+

## |  1|foo1|

## |  2|bar1|

## |  3|foo2|

## |  4|bar2|

## |  5|foo3|

## |  6|bar3|

## +---+----+

If you want to know more about Spark, then do check out this awesome video tutorial:

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...