Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I have a Spark Dataframe in that consists of a series of dates:

from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
import pandas as pd

rdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'),
                                    ('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321','sip:6413445440'),
                                    ('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229','sip:4534437492'),
                                    ('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881','sip:6474454453'),
                                    ('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323','sip:8874458555')])
schema = StructType([StructField('ID', StringType(), True),
                     StructField('EndDateTime', StringType(), True),
                     StructField('StartDateTime', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)


What I want to do is find duration by subtracting EndDateTime and StartDateTime. I figured I'd try and do this using a function:

# Function to calculate time delta


def time_delta(y,x):
    end = pd.to_datetime(y)
    start = pd.to_datetime(x)
    delta = (end-start)
    return delta

# create new RDD and add new column 'Duration' by applying time_delta function


df2 = df.withColumn('Duration', time_delta(df.EndDateTime, df.StartDateTime)) 


However this just gives me:

>>> df2.show()
ID  EndDateTime          StartDateTime        ANI            Duration
X01 2014-02-13T12:36:... 2014-02-13T12:31:... sip:4534454450 null   
X02 2014-02-13T12:35:... 2014-02-13T12:32:... sip:6413445440 null   
X03 2014-02-13T12:36:... 2014-02-13T12:32:... sip:4534437492 null   
XO4 2014-02-13T12:37:... 2014-02-13T12:32:... sip:6474454453 null   
XO5 2014-02-13T12:36:... 2014-02-13T12:33:... sip:8874458555 null  


I'm not sure if my approach is correct or not. If not, I'd gladly accept another suggested way to achieve this.

1 Answer

0 votes
by (32.3k points)

Since Spark 1.5, you can use unix_timestamp:

from pyspark.sql import functions as F

timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"

timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)

            - F.unix_timestamp('StartDateTime', format=timeFmt))

df = df.withColumn("Duration", timeDiff)

Note the Java style time format.

>>> df.show()

+---+--------------------+--------------------+--------+

| ID|         EndDateTime|     StartDateTime|Duration|

+---+--------------------+--------------------+--------+

|X01|2014-02-13T12:36:...|2014-02-13T12:31:...|     258|

|X02|2014-02-13T12:35:...|2014-02-13T12:32:...|     204|

|X03|2014-02-13T12:36:...|2014-02-13T12:32:...|     228|

|XO4|2014-02-13T12:37:...|2014-02-13T12:32:...|     269|

|XO5|2014-02-13T12:36:...|2014-02-13T12:33:...|     202|

+---+--------------------+--------------------+--------+

Browse Categories

...