Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

I have a time-series that is not recognized as a DatetimeIndex despite being indexed by standard YYYY-MM-DD strings with valid dates. Coercing them to a valid DatetimeIndex seems to be inelegant enough to make me think I'm doing something wrong.

I read in (someone else's lazily formatted) data that contains invalid datetime values and remove these invalid observations.

In [1]: df = pd.read_csv('data.csv',index_col=0)

In [2]: print df['2008-02-27':'2008-03-02']

Out[2]: 

             count

2008-02-27  20

2008-02-28   0

2008-02-29  27

2008-02-30   0

2008-02-31   0

2008-03-01   0

2008-03-02  17

In [3]: def clean_timestamps(df):

    # remove invalid dates like '2008-02-30' and '2009-04-31'

    to_drop = list()

    for d in df.index:

        try:

            datetime.date(int(d[0:4]),int(d[5:7]),int(d[8:10]))

        except ValueError:

            to_drop.append(d)

    df2 = df.drop(to_drop,axis=0)

    return df2

In [4]: df2 = clean_timestamps(df)

In [5] :print df2['2008-02-27':'2008-03-02']

Out[5]:

             count

2008-02-27  20

2008-02-28   0

2008-02-29  27

2008-03-01   0

2008-03-02  17

This new index is still only recognized as an 'object' dtype rather than a DatetimeIndex.

In [6]: df2.index

Out[6]: Index([2008-01-01, 2008-01-02, 2008-01-03, ..., 2012-11-27, 2012-11-28,

   2012-11-29], dtype=object)

Reindexing produces NaNs because they're different dtypes:

In [7]: i = pd.date_range(start=min(df2.index),end=max(df2.index))

In [8]: df3 = df2.reindex(index=i,columns=['count'])

In [9]: df3['2008-02-27':'2008-03-02']

Out[9]: 

            count

2008-02-27 NaN

2008-02-28 NaN

2008-02-29 NaN

2008-03-01 NaN

2008-03-02 NaN

I create a fresh dataframe with the appropriate index, drop the data to a dictionary, then populate the new dataframe based on the dictionary values (skipping missing values).

In [10]: df3 = pd.DataFrame(columns=['count'],index=i)

In [11]: values = dict(df2['count'])

In [12]: for d in i:

    try:

        df3.set_value(index=d,col='count',value=values[d.isoformat()[0:10]])

    except KeyError:

        pass

In [13]: print df3['2008-02-27':'2008-03-02']

Out[13]: 

             count

2008-02-27  20

2008-02-28   0

2008-02-29  27

2008-03-01   0

2008-03-02  17

In [14]: df3.index

Out[14];

<class 'pandas.tseries.index.DatetimeIndex'>

[2008-01-01 00:00:00, ..., 2012-11-29 00:00:00]

Length: 1795, Freq: D, Timezone: None

This last part of setting values based on lookups to a dictionary keyed by strings seems especially hacky and makes me think I've missed something important.

1 Answer

0 votes
by (107k points)

You could use pd.to_datetime:

In [1]: import pandas as pd

In [2]: pd.to_datetime('2008-02-27')

Out[2]: datetime.datetime(2008, 2, 27, 0, 0)

This permits you to "clean" the index (or similarly a column) by applying it to the Series:

df.index = pd.to_datetime(df.index)

If you are interested in learning Pandas and want to become an expert in Python Programming, then check out this Python Course and upskill yourself.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...