Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Big Data Hadoop & Spark by (11.4k points)

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

df.columns = new_column_name_list


However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas ?

1 Answer

0 votes
by (32.3k points)
edited by

There are many ways to change dataframe column names, let me give you an example using sqlContext.sql and alias:

data = sqlContext.createDataFrame([("amit", 2), ("prateek", 2)], 

                                  ["Name", "intellipaat"])

data.show()

data.printSchema()

# Output

#+-------+-----------+

#|   Name|intellipaat|

#+-------+-----------+

#|   amit|          2|

#|prateek|          2|

#+-------+-----------+

#root

# |-- Name: string (nullable = true)

# |-- intellipaat: long (nullable = true)

Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

sqlContext.registerDataFrameAsTable(data, "myTable")

df2 = sqlContext.sql("SELECT Name AS name, intellipaat as age from myTable")

df2.show()

# Output

#+-------+---+

#|   name|age|

#+-------+---+

#|   amit|  2|

#|prateek|  2|

#+-------+---+

Using alias in Scala:

from pyspark.sql.functions import col

data = data.select(col("Name").alias("name"), col("intellipaat").alias("age"))

data.show()

# Output

#+-------+---+

#|   name|age|

#+-------+---+

#|   amit|  2|

#|prateek|  2|

#+-------+---+

If you have any doubts regarding Spark, you can refer the following video tutorial:

 

If you wish to learn Pyspark visit this Pyspark Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.2k users

Browse Categories

...