Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

Suppose I'm doing something like:

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()

root
 |-- year: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)

df.show()
year make  model comment              blank
2012 Tesla S     No comment               
1997 Ford  E350  Go get one now th...  

but I really wanted the year as Int (and perhaps transform some other columns). 

Any Suggestions?

1 Answer

0 votes
by (32.3k points)
edited by

For Spark version 1.4+:

 Apply the casting method with DataType on the column:

import org.apache.spark.sql.types.IntegerType

val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))

    .drop("year")

    .withColumnRenamed("yearTmp", "year")

If you are using SQL expressions you can also do:

val df2 = df.selectExpr("cast(year as int) year", 

                        "make", 

                        "model", 

                        "comment", 

                        "blank")

In case you need a helper method, use:

object DFHelper{

  def castColumnTo( df: DataFrame, cn: String, type: DataType ) : DataFrame = {

    df.withColumn( cn, df(cn).cast(type) )

  }

}

which is used like:

import DFHelper._

val df2 = castColumnTo( df, "year", IntegerType )

If you want to know more regarding spark, you can refer the following video:

 

If you wish to learn What is Apache Spark visit this Apache Spark Training by Intellipaat.

You can learn in-depth about SQL statements, queries and become proficient in SQL queries by enrolling in our industry-recognized Microsoft SQL Certification.

Browse Categories

...