Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

What is the correct way to filter data frame by timestamp field?

I have tried different date formats and forms of filtering, nothing helps: either pyspark returns 0 objects, or throws an error that it doesn't understand datetime format

Here is what i got so far:

from pyspark import SparkContext
from pyspark.sql import SQLContext

from django.utils import timezone
from django.conf import settings

from myapp.models import Collection

sc = SparkContext("local", "DjangoApp")
sqlc = SQLContext(sc)
url = "jdbc:postgresql://%(HOST)s/%(NAME)s?user=%(USER)s&password=%(PASSWORD)s" % settings.DATABASES['default']
sf = sqlc.load(source="jdbc", url=url, dbtable='myapp_collection')

 

range for timestamp field:

system_tz = timezone.pytz.timezone(settings.TIME_ZONE)
date_from = datetime.datetime(2014, 4, 16, 18, 30, 0, 0, tzinfo=system_tz)
date_to = datetime.datetime(2015, 6, 15, 18, 11, 59, 999999, tzinfo=system_tz)

1 Answer

0 votes
by (32.3k points)

Let’s assume you have the following data frame:

sf = sqlContext.createDataFrame([

    [datetime.datetime(2013, 6, 29, 11, 34, 29)],

    [datetime.datetime(2015, 7, 14, 11, 34, 27)],

    [datetime.datetime(2012, 3, 10, 19, 00, 11)],

    [datetime.datetime(2016, 2, 8, 12, 21)],

    [datetime.datetime(2014, 4, 4, 11, 28, 29)]

], ('my_col', ))

with schema:

root

 |-- my_col: timestamp (nullable = true)

and you want to find dates in the following range:

import datetime, time 

dates = ("2013-01-01 00:00:00",  "2015-07-01 00:00:00")

timestamps = (

    time.mktime(datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S").timetuple())

    for s in dates)

It is possible to query using timestamps either computed on a driver side:

q1 = "CAST(my_col AS INT) BETWEEN {0} AND {1}".format(*timestamps)

sf.where(q1).show()

or using unix_timestamp function:

q2 = """CAST(my_col AS INT)

        BETWEEN unix_timestamp('{0}', 'yyyy-MM-dd HH:mm:ss')

        AND unix_timestamp('{1}', 'yyyy-MM-dd HH:mm:ss')""".format(*dates)

sf.where(q2).show()

Browse Categories

...