0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

df.columns = new_column_name_list


However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas ?

1 Answer

0 votes
by (31.4k points)
edited by

There are many ways to change dataframe column names, let me give you an example using sqlContext.sql and alias:

data = sqlContext.createDataFrame([("amit", 2), ("prateek", 2)], 

                                  ["Name", "intellipaat"])

data.show()

data.printSchema()

# Output

#+-------+-----------+

#|   Name|intellipaat|

#+-------+-----------+

#|   amit|          2|

#|prateek|          2|

#+-------+-----------+

#root

# |-- Name: string (nullable = true)

# |-- intellipaat: long (nullable = true)

Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

sqlContext.registerDataFrameAsTable(data, "myTable")

df2 = sqlContext.sql("SELECT Name AS name, intellipaat as age from myTable")

df2.show()

# Output

#+-------+---+

#|   name|age|

#+-------+---+

#|   amit|  2|

#|prateek|  2|

#+-------+---+

Using alias in Scala:

from pyspark.sql.functions import col

data = data.select(col("Name").alias("name"), col("intellipaat").alias("age"))

data.show()

# Output

#+-------+---+

#|   name|age|

#+-------+---+

#|   amit|  2|

#|prateek|  2|

#+-------+---+

If you have any doubts regarding Spark, you can refer the following video tutorial:

 

If you wish to learn Pyspark visit this Pyspark Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...