Concatenate two PySpark dataframes

Question

asked Jul 12, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

I'm trying to concatenate two PySpark dataframes with some columns that are only on each of them:

from pyspark.sql.functions import randn, rand
df_1 = sqlContext.range(0, 10)

+--+
|id|
+--+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+--+

df_2 = sqlContext.range(11, 20)

+--+
|id|
+--+
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+--+

df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal_2"))

and now I want to generate a third dataframe. I would like something like pandas concat:

df_1.show()

+---+--------------------+--------------------+
| id|             uniform|              normal|
+---+--------------------+--------------------+
| 0| 0.8122802274304282| 1.2423430583597714|
| 1| 0.8642043127063618| 0.3900018344856156|
| 2| 0.8292577771850476| 1.8077401259195247|
| 3|   0.198558705368724| -0.4270585782850261|
| 4|0.012661361966674889|   0.702634599720141|
| 5| 0.8535692890157796|-0.42355804115129153|
| 6| 0.3723296190171911| 1.3789648582622995|
| 7| 0.9529794127670571| 0.16238718777444605|
| 8| 0.9746632635918108| 0.02448061333761742|
| 9|   0.513622008243935| 0.7626741803250845|
+---+--------------------+--------------------+

df_2.show()

+---+--------------------+--------------------+
| id|             uniform|            normal_2|
+---+--------------------+--------------------+
| 11| 0.3221262660507942| 1.0269298899109824|
| 12| 0.4030672316912547|   1.285648175568798|
| 13| 0.9690555459609131|-0.22986601831364423|
| 14|0.011913836266515876| -0.678915153834693|
| 15| 0.9359607054250594|-0.16557488664743034|
| 16| 0.45680471157575453| -0.3885563551710555|
| 17| 0.6411908952297819| 0.9161177183227823|
| 18| 0.5669232696934479| 0.7270125277020573|
| 19|   0.513622008243935| 0.7626741803250845|
+---+--------------------+--------------------+

#do some concatenation here, how?

df_concat.show()

| id|             uniform|              normal| normal_2   |
+---+--------------------+--------------------+------------+
| 0| 0.8122802274304282| 1.2423430583597714| None       |
| 1| 0.8642043127063618| 0.3900018344856156| None       |
| 2| 0.8292577771850476| 1.8077401259195247| None       |
| 3|   0.198558705368724| -0.4270585782850261| None       |
| 4|0.012661361966674889|   0.702634599720141| None       |
| 5| 0.8535692890157796|-0.42355804115129153| None       |
| 6| 0.3723296190171911| 1.3789648582622995| None       |
| 7| 0.9529794127670571| 0.16238718777444605| None       |
| 8| 0.9746632635918108| 0.02448061333761742| None       |
| 9|   0.513622008243935| 0.7626741803250845| None       |
| 11| 0.3221262660507942| None              | 0.123      |
| 12| 0.4030672316912547| None              |0.12323     |
| 13| 0.9690555459609131| None              |0.123       |
| 14|0.011913836266515876| None              |0.18923     |
| 15| 0.9359607054250594| None              |0.99123     |
| 16| 0.45680471157575453| None              |0.123       |
| 17| 0.6411908952297819| None              |1.123       |
| 18| 0.5669232696934479| None              |0.10023     |
| 19|   0.513622008243935| None              |0.916332123 |
+---+--------------------+--------------------+------------+

Is that possible?

1 Answer

Amit Rawat · Answer 1 · 2019-07-12T13:25:25+0000

For PySpark 2x:

Finally after a lot of research, I found a way to do it. Just follow the steps below:

from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand
import pyspark.sql.functions as F
df_1 = sqlContext.range(0, 10)
df_2 = sqlContext.range(11, 20)
df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=11).alias("uniform"), randn(seed=28).alias("normal_2"))

def get_uniform(df1_uniform, df2_uniform):
    if df1_uniform:
        return df1_uniform
    if df2_uniform:
        return df2_uniform

df_concat = df_1.union(df_2)

df_concat.show()

Output:

+---+-----------+--------------------+--------------------+
| id| uniform| normal| normal_2|
+---+-----------+--------------------+--------------------+
| 0| 0.41371265| 0.5888539012978773| null|
| 1| 0.7311719| 0.8645537008427937| null|
| 2| 0.19829196| 0.06157382353970104| null|
| 3| 0.12714182| 0.3623040918178586| null|
| 4| 0.7604318|-0.49575204523675975| null|
| 5|0.120307155| 1.0854146699817222| null|
| 6| 0.12131364| -0.5284523629183004| null|
| 7| 0.44292918| -0.4798519469521663| null|
| 8| 0.88987845| -0.8820294772950535| null|
| 9|0.036507078| -2.1591956435415334| null|
| 11| 0.19829196| null| 0.06157382353970104|
| 12| 0.12714182| null| 0.3623040918178586|
| 13|0.120307155| null| 1.0854146699817222|
| 14| 0.12131364| null| -0.5284523629183004|
| 15| 0.44292918| null| -0.4798519469521663|
| 16| 0.88987845| null| -0.8820294772950535|
| 17| 0.27310732| null|-0.15116027592854422|
| 18| 0.7784518| null| -0.3785563841011868|
| 19| 0.43776396| null| 0.47700719174464357|
+---+-----------+--------------------+--------------------+

If you wish to learn Pyspark visit this Pyspark Tutorial.

Concatenate two PySpark dataframes

1 Answer

Related questions

Browse Categories