Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I'm trying to concatenate two PySpark dataframes with some columns that are only on each of them:

from pyspark.sql.functions import randn, rand

df_1 = sqlContext.range(0, 10)

+--+
|id|
+--+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+--+

df_2 = sqlContext.range(11, 20)

+--+
|id|
+--+
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+--+

df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal_2"))


and now I want to generate a third dataframe. I would like something like pandas concat:

df_1.show()


+---+--------------------+--------------------+
| id|             uniform|              normal|
+---+--------------------+--------------------+
|  0|  0.8122802274304282|  1.2423430583597714|
|  1|  0.8642043127063618|  0.3900018344856156|
|  2|  0.8292577771850476|  1.8077401259195247|
|  3|   0.198558705368724| -0.4270585782850261|
|  4|0.012661361966674889|   0.702634599720141|
|  5|  0.8535692890157796|-0.42355804115129153|
|  6|  0.3723296190171911|  1.3789648582622995|
|  7|  0.9529794127670571| 0.16238718777444605|
|  8|  0.9746632635918108| 0.02448061333761742|
|  9|   0.513622008243935|  0.7626741803250845|
+---+--------------------+--------------------+

df_2.show()


+---+--------------------+--------------------+
| id|             uniform|            normal_2|
+---+--------------------+--------------------+
| 11|  0.3221262660507942|  1.0269298899109824|
| 12|  0.4030672316912547|   1.285648175568798|
| 13|  0.9690555459609131|-0.22986601831364423|
| 14|0.011913836266515876|  -0.678915153834693|
| 15|  0.9359607054250594|-0.16557488664743034|
| 16| 0.45680471157575453| -0.3885563551710555|
| 17|  0.6411908952297819|  0.9161177183227823|
| 18|  0.5669232696934479|  0.7270125277020573|
| 19|   0.513622008243935|  0.7626741803250845|
+---+--------------------+--------------------+

#do some concatenation here, how?

df_concat.show()

| id|             uniform|              normal| normal_2   |
+---+--------------------+--------------------+------------+
|  0|  0.8122802274304282|  1.2423430583597714| None       |
|  1|  0.8642043127063618|  0.3900018344856156| None       |
|  2|  0.8292577771850476|  1.8077401259195247| None       |
|  3|   0.198558705368724| -0.4270585782850261| None       |
|  4|0.012661361966674889|   0.702634599720141| None       |
|  5|  0.8535692890157796|-0.42355804115129153| None       |
|  6|  0.3723296190171911|  1.3789648582622995| None       |
|  7|  0.9529794127670571| 0.16238718777444605| None       |
|  8|  0.9746632635918108| 0.02448061333761742| None       |
|  9|   0.513622008243935|  0.7626741803250845| None       |
| 11|  0.3221262660507942|  None              | 0.123      |
| 12|  0.4030672316912547|  None              |0.12323     |
| 13|  0.9690555459609131|  None              |0.123       |
| 14|0.011913836266515876|  None              |0.18923     |
| 15|  0.9359607054250594|  None              |0.99123     |
| 16| 0.45680471157575453|  None              |0.123       |
| 17|  0.6411908952297819|  None              |1.123       |
| 18|  0.5669232696934479|  None              |0.10023     |
| 19|   0.513622008243935|  None              |0.916332123 |
+---+--------------------+--------------------+------------+


Is that possible?

1 Answer

0 votes
by (32.3k points)

For PySpark 2x:

Finally after a lot of research, I found a way to do it. Just follow the steps below:

from pyspark.sql.types import FloatType

from pyspark.sql.functions import randn, rand

import pyspark.sql.functions as F

df_1 = sqlContext.range(0, 10)

df_2 = sqlContext.range(11, 20)

df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))

df_2 = df_2.select("id", rand(seed=11).alias("uniform"), randn(seed=28).alias("normal_2"))

def get_uniform(df1_uniform, df2_uniform):

    if df1_uniform:

        return df1_uniform

    if df2_uniform:

        return df2_uniform

df_concat = df_1.union(df_2)

df_concat.show()

Output:

+---+-----------+--------------------+--------------------+                     

| id|    uniform|             normal| normal_2|

+---+-----------+--------------------+--------------------+

|  0| 0.41371265|  0.5888539012978773|                null|

|  1| 0.7311719|  0.8645537008427937|                null|

|  2| 0.19829196| 0.06157382353970104|                null|

|  3| 0.12714182|  0.3623040918178586|                null|

|  4| 0.7604318|-0.49575204523675975|                null|

|  5|0.120307155|  1.0854146699817222|                null|

|  6| 0.12131364| -0.5284523629183004|                null|

|  7| 0.44292918| -0.4798519469521663|                null|

|  8| 0.88987845| -0.8820294772950535|                null|

|  9|0.036507078| -2.1591956435415334|                null|

| 11| 0.19829196|                null| 0.06157382353970104|

| 12| 0.12714182|                null| 0.3623040918178586|

| 13|0.120307155|                null| 1.0854146699817222|

| 14| 0.12131364|                null| -0.5284523629183004|

| 15| 0.44292918|                null| -0.4798519469521663|

| 16| 0.88987845|                null| -0.8820294772950535|

| 17| 0.27310732|                null|-0.15116027592854422|

| 18|  0.7784518|                null| -0.3785563841011868|

| 19| 0.43776396|                null| 0.47700719174464357|

+---+-----------+--------------------+--------------------+

If you wish to learn Pyspark visit this Pyspark Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.5k answers

500 comments

108k users

Browse Categories

...