How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?

Question

5 Answers

Amit Rawat · Answer 1 · 2019-07-10T05:02:11+0000

Here is how you can concatenate columns using “concat” function:

import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()

gives,

+--------+--------+
|colname1|colname2|
+--------+--------+
| row11| row12|
| row21| row22|
+--------+--------+

create new column by concatenating:

df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()
+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
| row11| row12| row11_row12|
| row21| row22| row21_row22|
+--------+--------+-------------+

Another method:

If you are using Spark 2.3 or greater version, Spark SQL supports the concatenation operator ||.

For example;

val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")

If you want to know more about Spark, then do check out this awesome video tutorial:

To master SQL statements, queries and become proficient in SQL queries, enroll in our industry-recognized SQL course.

if you don't know the number or name of columns, you can use the follwing below:

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*)) — Ashok, Aug 17, 2019

vinita · Answer 2 · 2019-08-08T06:26:26+0000

You can use the following set of codes for scala:

import org.apache.spark.sql.functions.{concat, lit}
df.select(concat($"k", lit(" "), $"v"))

For Python:

from pyspark.sql.functions import concat, col, lit
df.select(concat(col("k"), lit(" "), col("v")))

kodee · Answer 3 · 2019-08-08T07:43:00+0000

If you are running it on raw SQL, then use CONCAT:

In Scala

import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")

In Python

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")

I used the python code given in your answer. It worked for me! — Kasheeka, Aug 14, 2019

Anurag · Answer 4 · 2019-08-10T10:42:25+0000

Try this code using PySpark:

#import concat and lit functions from pyspark.sql.functions
from pyspark.sql.functions import concat, lit
#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])
#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))
#Show the new data frame
personDF.show()

Output:

84
+------------+
|East African|
+------------+
| Ethiopian|
| Kenyan|
| Ugandan|
| Rwandan|
+------------+

Vishal · Answer 5 · 2019-08-19T08:57:22+0000

To concatenate two columns in an Apache Spark DataFrame in the Spark when you don't know the number or name of the columns in the Data Frame you can use the below-mentioned code:-

See the example below:-

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?

5 Answers

Related questions

Browse Categories