0 votes
1 view
in Data Science by (17.6k points)

Given a dataset with Date column and Value column, I need to come up with the best solution of segmenting the data by date based on trends in the Value column. My output should be a CSV filewith the columns: StartDate, EndDate,StartValue,EndValue. Start and End date define the bounds of the segment. A short example is presented: input data:

 **Date**        **Value**

  01/01/2014        10

  01/02/2014        5

  01/03/2014        5

  01/04/2014        0

output:

 **StartDate**   **EndDate**   **StartValue**   **EndValue**

   01/01/2014      01/15/2014        10              5

   01/16/2014      02/03/2014         5              5

   02/04/2014      03/10/2014         5              4

1 Answer

0 votes
by (38.2k points)
edited by

Here is an approach using pandas.DataFrame.shift  which will give you the desired output.Follow the below steps along with code: 

1.Create a dataframe with some data:

import pandas as pd

datelist = pd.date_range('1/1/2019', periods=100).tolist()

values = np.random.randint(1, 5, 100)

df = pd.DataFrame({'Date': datelist, 'Value': values})

df = df.set_index('Date')

df.head(10)

Date        Value

2019-01-01  1

2019-01-02  4

2019-01-03  2

2019-01-04  2

2019-01-05  2

2019-01-06  3

2019-01-07  2

2019-01-08  2

2019-01-09  3

2019-01-10  2

2.Drop the  duplicate rows:

df = df.loc[df.Value.shift() != df.Value]

Date        Value

2019-01-01  2

2019-01-02  1

2019-01-04  2

2019-01-05  3

2019-01-06  1

3.Reset the index:

df = df.reset_index()

4.Rename the existing columns to be the start columns.

df.columns = ['Start_Date', 'Start_Value']

5.Create end columns by shifting the start columns back one row.

df['End_Date'] = df.Start_Date.shift(-1)

df['End_Value'] = df.Start_Value.shift(-1)

6.Drop the NaNs.

df = df.dropna()

Set the End_Value type to int (if preferred).

df['End_Value'] = df['End_Value'].astype(int)

df.head(10)

    Start_Date  Start_Value End_Date    End_Value

0   2019-01-01  1           2019-01-02  4

1   2019-01-02  4           2019-01-03  2

2   2019-01-03  2           2019-01-06  3

3   2019-01-06  3           2019-01-07  2

4   2019-01-07  2           2019-01-09  3

5   2019-01-09  3           2019-01-10  2

6   2019-01-10  2           2019-01-11  1

7   2019-01-11  1           2019-01-12  2

8   2019-01-12  2           2019-01-15  1

9   2019-01-15  1           2019-01-16  4

7.Finally, create a CSV file from the dataframe:

df.to_csv('trends.csv')

If you are interested in learning Data Science then watch this video:

If you wish to learn more about how to use python for data science, then go through this data science python course by Intellipaat for more insights.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...