0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I have a dataframe in Spark with many columns and a udf that I defined. I want the same dataframe back, except with one column transformed. Furthermore, my udf takes in a string and returns a timestamp. Is there an easy way to do this? I tried

val test = myDF.select("my_column").rdd.map(r => getTimestamp(r)) 


but this returns an RDD and just with the transformed column.

1 Answer

0 votes
by (32.5k points)

In order to achieve your task I would suggest you to follow any of the two options:

1) Using map / toDF:

import org.apache.spark.sql.Row

import sqlContext.implicits._

def getTimestamp: (String => java.sql.Timestamp) = // your function here

val test = myDF.select("my_column").rdd.map {

  case Row(string_val: String) => (string_val, getTimestamp(string_val))

}.toDF("my_column", "new_column")

2) Using UDFs (UserDefinedFunction):

import org.apache.spark.sql.functions._

def getTimestamp: (String => java.sql.Timestamp) = // your function here

val newCol = udf(getTimestamp).apply(col("my_column")) // creates the new column

val test = myDF.withColumn("new_column", newCol) // adds the new column to original DF

Alternatively,

If you just want to transform a StringType column into a TimestampType column you can use the unix_timestamp column function available since Spark SQL 1.5. Also, keep in mind that it is necessary to multiply the result of unix_timestamp by 1000 before casting to timestamp (issue SPARK-11724). The resulting code would be:

val test = myDF

  .withColumn("new_column", (unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm") *1000L).cast("timestamp"))

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...