Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I have a Spark Dataframe in that consists of a series of dates:

from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
import pandas as pd

rdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'),
schema = StructType([StructField('ID', StringType(), True),
                     StructField('EndDateTime', StringType(), True),
                     StructField('StartDateTime', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)

What I want to do is find duration by subtracting EndDateTime and StartDateTime. I figured I'd try and do this using a function:

# Function to calculate time delta

def time_delta(y,x):
    end = pd.to_datetime(y)
    start = pd.to_datetime(x)
    delta = (end-start)
    return delta

# create new RDD and add new column 'Duration' by applying time_delta function

df2 = df.withColumn('Duration', time_delta(df.EndDateTime, df.StartDateTime)) 

However this just gives me:

ID  EndDateTime          StartDateTime        ANI            Duration
X01 2014-02-13T12:36:... 2014-02-13T12:31:... sip:4534454450 null   
X02 2014-02-13T12:35:... 2014-02-13T12:32:... sip:6413445440 null   
X03 2014-02-13T12:36:... 2014-02-13T12:32:... sip:4534437492 null   
XO4 2014-02-13T12:37:... 2014-02-13T12:32:... sip:6474454453 null   
XO5 2014-02-13T12:36:... 2014-02-13T12:33:... sip:8874458555 null  

I'm not sure if my approach is correct or not. If not, I'd gladly accept another suggested way to achieve this.

1 Answer

0 votes
by (32.3k points)

Since Spark 1.5, you can use unix_timestamp:

from pyspark.sql import functions as F

timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"

timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)

            - F.unix_timestamp('StartDateTime', format=timeFmt))

df = df.withColumn("Duration", timeDiff)

Note the Java style time format.



| ID|         EndDateTime|     StartDateTime|Duration|


|X01|2014-02-13T12:36:...|2014-02-13T12:31:...|     258|

|X02|2014-02-13T12:35:...|2014-02-13T12:32:...|     204|

|X03|2014-02-13T12:36:...|2014-02-13T12:32:...|     228|

|XO4|2014-02-13T12:37:...|2014-02-13T12:32:...|     269|

|XO5|2014-02-13T12:36:...|2014-02-13T12:33:...|     202|


Browse Categories