Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

I have a time-series that is not recognized as a DatetimeIndex despite being indexed by standard YYYY-MM-DD strings with valid dates. Coercing them to a valid DatetimeIndex seems to be inelegant enough to make me think I'm doing something wrong.

I read in (someone else's lazily formatted) data that contains invalid datetime values and remove these invalid observations.

In [1]: df = pd.read_csv('data.csv',index_col=0)

In [2]: print df['2008-02-27':'2008-03-02']

Out[2]: 

             count

2008-02-27  20

2008-02-28   0

2008-02-29  27

2008-02-30   0

2008-02-31   0

2008-03-01   0

2008-03-02  17

In [3]: def clean_timestamps(df):

    # remove invalid dates like '2008-02-30' and '2009-04-31'

    to_drop = list()

    for d in df.index:

        try:

            datetime.date(int(d[0:4]),int(d[5:7]),int(d[8:10]))

        except ValueError:

            to_drop.append(d)

    df2 = df.drop(to_drop,axis=0)

    return df2

In [4]: df2 = clean_timestamps(df)

In [5] :print df2['2008-02-27':'2008-03-02']

Out[5]:

             count

2008-02-27  20

2008-02-28   0

2008-02-29  27

2008-03-01   0

2008-03-02  17

This new index is still only recognized as an 'object' dtype rather than a DatetimeIndex.

In [6]: df2.index

Out[6]: Index([2008-01-01, 2008-01-02, 2008-01-03, ..., 2012-11-27, 2012-11-28,

   2012-11-29], dtype=object)

Reindexing produces NaNs because they're different dtypes:

In [7]: i = pd.date_range(start=min(df2.index),end=max(df2.index))

In [8]: df3 = df2.reindex(index=i,columns=['count'])

In [9]: df3['2008-02-27':'2008-03-02']

Out[9]: 

            count

2008-02-27 NaN

2008-02-28 NaN

2008-02-29 NaN

2008-03-01 NaN

2008-03-02 NaN

I create a fresh dataframe with the appropriate index, drop the data to a dictionary, then populate the new dataframe based on the dictionary values (skipping missing values).

In [10]: df3 = pd.DataFrame(columns=['count'],index=i)

In [11]: values = dict(df2['count'])

In [12]: for d in i:

    try:

        df3.set_value(index=d,col='count',value=values[d.isoformat()[0:10]])

    except KeyError:

        pass

In [13]: print df3['2008-02-27':'2008-03-02']

Out[13]: 

             count

2008-02-27  20

2008-02-28   0

2008-02-29  27

2008-03-01   0

2008-03-02  17

In [14]: df3.index

Out[14];

<class 'pandas.tseries.index.DatetimeIndex'>

[2008-01-01 00:00:00, ..., 2012-11-29 00:00:00]

Length: 1795, Freq: D, Timezone: None

This last part of setting values based on lookups to a dictionary keyed by strings seems especially hacky and makes me think I've missed something important.

1 Answer

0 votes
by (108k points)

You could use pd.to_datetime:

In [1]: import pandas as pd

In [2]: pd.to_datetime('2008-02-27')

Out[2]: datetime.datetime(2008, 2, 27, 0, 0)

This permits you to "clean" the index (or similarly a column) by applying it to the Series:

df.index = pd.to_datetime(df.index)

If you are interested in learning Pandas and want to become an expert in Python Programming, then check out this Python Course and upskill yourself.

Browse Categories

...