+4 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?

5 Answers

+5 votes
by (31.4k points)
edited by

Here is how you can concatenate columns using “concat” function:

import pyspark

from pyspark.sql import functions as sf

sc = pyspark.SparkContext()

sqlc = pyspark.SQLContext(sc)

df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])

df.show()

gives,

+--------+--------+

|colname1|colname2|

+--------+--------+

|   row11|   row12|

|   row21|   row22|

+--------+--------+

create new column by concatenating:

df = df.withColumn('joined_column', 

                    sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))

df.show()

+--------+--------+-------------+

|colname1|colname2|joined_column|

+--------+--------+-------------+

|   row11|   row12|  row11_row12|

|   row21|   row22|  row21_row22|

+--------+--------+-------------+

Another method:

If you are using Spark 2.3 or greater version, Spark SQL supports the concatenation operator ||.

For example;

val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")

If you want to know more about Spark, then do check out this awesome video tutorial:

by (17.2k points)
This worked for me. Thank you.
by (47.2k points)
if you don't know the number or name of columns, you can use the follwing below:

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
by (40.4k points)
Well explained. Thank you.
+2 votes
by (96.7k points)

You can use the following set of codes for scala:

import org.apache.spark.sql.functions.{concat, lit}

df.select(concat($"k", lit(" "), $"v"))

For Python:

from pyspark.sql.functions import concat, col, lit

df.select(concat(col("k"), lit(" "), col("v")))

by (19.8k points)
It worked for me!
+3 votes
by (44.6k points)

If you are running it on raw SQL, then use CONCAT:

In Scala

import sqlContext.implicits._

val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")

df.registerTempTable("df")

sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

In Python

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))

df.registerTempTable("df")

sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

by (31.8k points)
I used the python code given in your answer. It worked for me!
+1 vote
by (33.2k points)

Try this code using PySpark:

#import concat and lit functions from pyspark.sql.functions 

from pyspark.sql.functions import concat, lit

#Create your data frame

countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])

#Use select, concat, and lit functions to do the concatenation

personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))

#Show the new data frame

personDF.show()

Output:

84

+------------+

|East African|

+------------+

|   Ethiopian|

|      Kenyan|

|     Ugandan|

|     Rwandan|

+------------+

by (29.1k points)
well explained!
by (29.8k points)
i used the same, works for me
0 votes
by (107k points)

To concatenate two columns in an Apache Spark DataFrame in the Spark when you don't know the number or name of the columns in the Data Frame you can use the below-mentioned code:-

See the example below:-

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...