0 votes
1 view
in Data Science by (17.6k points)

Given a dataset with Date column and Value column, I need to come up with the best solution of segmenting the data by date based on trends in the Value column. My output should be a CSV filewith the columns: StartDate, EndDate,StartValue,EndValue. Start and End date define the bounds of the segment. A short example is presented: input data:

 **Date**        **Value**

  01/01/2014        10

  01/02/2014        5

  01/03/2014        5

  01/04/2014        0

output:

 **StartDate**   **EndDate**   **StartValue**   **EndValue**

   01/01/2014      01/15/2014        10              5

   01/16/2014      02/03/2014         5              5

   02/04/2014      03/10/2014         5              4

1 Answer

0 votes
by (30.1k points)
edited by

Here is an approach using pandas.DataFrame.shift  which will give you the desired output.Follow the below steps along with code: 

1.Create a dataframe with some data:

import pandas as pd

datelist = pd.date_range('1/1/2019', periods=100).tolist()

values = np.random.randint(1, 5, 100)

df = pd.DataFrame({'Date': datelist, 'Value': values})

df = df.set_index('Date')

df.head(10)

Date        Value

2019-01-01  1

2019-01-02  4

2019-01-03  2

2019-01-04  2

2019-01-05  2

2019-01-06  3

2019-01-07  2

2019-01-08  2

2019-01-09  3

2019-01-10  2

2.Drop the  duplicate rows:

df = df.loc[df.Value.shift() != df.Value]

Date        Value

2019-01-01  2

2019-01-02  1

2019-01-04  2

2019-01-05  3

2019-01-06  1

3.Reset the index:

df = df.reset_index()

4.Rename the existing columns to be the start columns.

df.columns = ['Start_Date', 'Start_Value']

5.Create end columns by shifting the start columns back one row.

df['End_Date'] = df.Start_Date.shift(-1)

df['End_Value'] = df.Start_Value.shift(-1)

6.Drop the NaNs.

df = df.dropna()

Set the End_Value type to int (if preferred).

df['End_Value'] = df['End_Value'].astype(int)

df.head(10)

    Start_Date  Start_Value End_Date    End_Value

0   2019-01-01  1           2019-01-02  4

1   2019-01-02  4           2019-01-03  2

2   2019-01-03  2           2019-01-06  3

3   2019-01-06  3           2019-01-07  2

4   2019-01-07  2           2019-01-09  3

5   2019-01-09  3           2019-01-10  2

6   2019-01-10  2           2019-01-11  1

7   2019-01-11  1           2019-01-12  2

8   2019-01-12  2           2019-01-15  1

9   2019-01-15  1           2019-01-16  4

7.Finally, create a CSV file from the dataframe:

df.to_csv('trends.csv')

If you are interested in learning Data Science then watch this video:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...