Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I am looking for the help as to how I can extract the hour and minutes, separately. 

from a string in PySpark:

df = spark.createDataFrame([['1325'], ['1433'], ['730']], ['time'])

df = df.withColumn("time", to_timestamp("time"))  # cast timestamp

display(df)

# example timestamp results

  time

1 1325-01-01T00:00:00.000-0500

2 1433-01-01T00:00:00.000-0500

3 null

The casting it to the unixtime, date, and timestamp are all not very cooperative with the type of string data.

Ideally, I would like it to return:

  time  hour  minutes

1 1325   13     25

2 1433   14     33

3 730    7      30

1 Answer

0 votes
by (36.8k points)

IIUC, one way you can try is to split this string with the pattern (?=\d\d$), and then take the hour/minutes from the resulting array:

from pyspark.sql import functions as F

df.withColumn('hm', F.split(F.lpad('time',4,'0'), '(?=\d\d$)')) \

    .selectExpr('time', 'int(hm[0]) as hour', 'int(hm[1]) as minutes') \

    .show()

+----+----+-------+

|time|hour|minutes|

+----+----+-------+

|1325| 13| 25|

|1433| 14| 33|

| 730| 7| 30|

| 2| 0| 2|

+----+----+-------+

Master end-to-end Data Science through Data Science Online Courses 

Browse Categories

...