Here is how you can concatenate columns using “concat” function:
import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()
gives,
+--------+--------+
|colname1|colname2|
+--------+--------+
| row11| row12|
| row21| row22|
+--------+--------+
create new column by concatenating:
df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()
+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
| row11| row12| row11_row12|
| row21| row22| row21_row22|
+--------+--------+-------------+
Another method:
If you are using Spark 2.3 or greater version, Spark SQL supports the concatenation operator ||.
For example;
val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")
If you want to know more about Spark, then do check out this awesome video tutorial:
To master SQL statements, queries and become proficient in SQL queries, enroll in our industry-recognized SQL course.