0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I have Spark DataFrame with take(5) top rows as follows:

[Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),
 Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),
 Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),
 Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),
 Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]


It's schema is defined as:

elevDF.printSchema()

root
 |-- date: timestamp (nullable = true)
 |-- hour: long (nullable = true)
 |-- value: double (nullable = true)


How do I get the Year, Month, Day values from the 'date' field?

1 Answer

0 votes
by (31.4k points)

For Spark >=1.5, you are provided with numbers of date processing functions and you can use these functions in your case.

pyspark.sql.functions.year

pyspark.sql.functions.month

pyspark.sql.functions.dayofmonth

pyspark.sql.functions.dayofweek()

pyspark.sql.functions.dayofyear

pyspark.sql.functions.weekofyear()

import datetime

from pyspark.sql.functions import year, month, dayofmonth

elevDF = sc.parallelize([

    (datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55),

    (datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55),

    (datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55),

    (datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55),

    (datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55)

]).toDF(["date", "hour", "value"])

elevDF.select(

    year("date").alias('year'), 

    month("date").alias('month'), 

    dayofmonth("date").alias('day')

).show()

# +----+-----+---+

# |year|month|day|

# +----+-----+---+

# |1984|    1| 1|

# |1984|    1| 1|

# |1984|    1| 1|

# |1984|    1| 1|

# |1984|    1| 1|

# +----+-----+---+

Now, I would simply suggest you to use map with any other RDD:

elevDF = sqlContext.createDataFrame(sc.parallelize([

        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),

        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),

        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),

        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),

        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]))

(elevDF

 .map(lambda (date, hour, value): (date.year, date.month, date.day))

 .collect())

and the result is:

[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]


Note: datetime.datetime stores an hour anyway so keeping it separately seems to be a waste of memory.

If you wish to learn Spark visit this Spark Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...