Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in SQL by (11.4k points)

I have a DataFrame that look something like that. I want to operate on the day of the date_time field.

root
 |-- host: string (nullable = true)
 |-- user_id: string (nullable = true)

 |-- date_time: timestamp (nullable = true)

I tried to add a column to extract the day. So far my attempts have failed.

df = df.withColumn("day", df.date_time.getField("day"))

org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type TimestampType;


This has also failed

df = df.withColumn("day", df.select("date_time").map(lambda row: row.date_time.day))

AttributeError: 'PipelinedRDD' object has no attribute 'alias'


Any idea how this can be done?

1 Answer

0 votes
by (32.3k points)

In your case, I would suggest you to simply use map function:

df.rdd.map(lambda row:

    Row(row.__fields__ + ["day"])(row + (row.date_time.day, ))

)

Another option is to register a function and run SQL query:

sqlContext.registerFunction("day", lambda x: x.day)

sqlContext.registerDataFrameAsTable(df, "df")

sqlContext.sql("SELECT *, day(date_time) as day FROM df")

Finally you can define udf like this:

from pyspark.sql.functions import udf

from pyspark.sql.types import IntegerType

day = udf(lambda date_time: date_time.day, IntegerType())

df.withColumn("day", day(df.date_time))

Browse Categories

...