How to change dataframe column names in pyspark?

Question

asked Jul 5, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

df.columns = new_column_name_list

However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas ?

1 Answer

Amit Rawat · Answer 1 · 2019-07-05T14:50:14+0000

There are many ways to change dataframe column names, let me give you an example using sqlContext.sql and alias:

data = sqlContext.createDataFrame([("amit", 2), ("prateek", 2)],
["Name", "intellipaat"])
data.show()
data.printSchema()
# Output
#+-------+-----------+
#| Name|intellipaat|
#+-------+-----------+
#| amit| 2|
#|prateek| 2|
#+-------+-----------+
#root
# |-- Name: string (nullable = true)
# |-- intellipaat: long (nullable = true)

Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

sqlContext.registerDataFrameAsTable(data, "myTable")
df2 = sqlContext.sql("SELECT Name AS name, intellipaat as age from myTable")
df2.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#| amit| 2|
#|prateek| 2|
#+-------+---+

Using alias in Scala:

from pyspark.sql.functions import col
data = data.select(col("Name").alias("name"), col("intellipaat").alias("age"))
data.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#| amit| 2|
#|prateek| 2|
#+-------+---+

If you have any doubts regarding Spark, you can refer the following video tutorial:

If you wish to learn Pyspark visit this Pyspark Tutorial.

How to change dataframe column names in pyspark?

1 Answer

Related questions

Browse Categories