Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I am working on jupyter notebook were I am using 2 data frames.

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})

                  Date   A

0  2014-11-21 11:00:00   1

1  2014-11-21 11:00:03   2

2  2014-11-21 11:00:04   5

3  2014-11-21 11:00:05   3

4  2014-11-21 11:00:07   9

5  2014-11-21 11:00:08   6

6  2014-11-21 11:00:10   3

7  2014-11-21 11:00:11   0

8  2014-10-24 10:00:55   8

9  2014-10-24 10:00:59  10

The 2nd data frame contains the Date and time range as shown below:

info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:08:00', '2014-10-24 10:55:00'], "Stop": ['2014-11-21 11:07:00', '2014-11-21 11:11:00', '2014-10-24 10:59:00']})

                 Start                 Stop

0  2014-11-21 11:00:00  2014-11-21 11:00:07

1  2014-11-21 11:00:08  2014-11-21 11:00:11

2  2014-10-24 10:00:55  2014-10-24 10:00:59

I am trying to get the cumulated sum of the data frame using the second window. It should only be done if and only if the row in the data frame is present in the range of that particular row. For example, the cumulated sum of the row with date 2014-11-21 11:00:08 should be 0 because its the beginning of the range. The cumulative for date 2014-11-21 11:00:07 should be 12(9+3).

I have given the code below for getting the result but I am not getting expected result:

import pandas as pd

import numpy as np

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})

info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})

#info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05'], "Stop": ['2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:06', '2014-11-21 11:00:07']})

info['groupnum']=info.index

info.Start=pd.to_datetime(info.Start)

info.Stop=pd.to_datetime(info.Stop)

cinfo = info.set_index(pd.IntervalIndex.from_arrays(info.Start, info.Stop, closed='both'))['groupnum']

df['groupnum']=pd.to_datetime(df.Date).map(cinfo)

df['cum'] = df.groupby('groupnum').A.cumsum()

print(df)

output:

               Date        A    group     num  cum

0  2014-11-21 11:00:00   1         0    1

1  2014-11-21 11:00:03   2         0    3

2  2014-11-21 11:00:04   5         0    8

3  2014-11-21 11:00:05   3         0   11

4  2014-11-21 11:00:07   9         0   20

5  2014-11-21 11:00:08   6         1    6

6  2014-11-21 11:00:10   3         1    9

7  2014-11-21 11:00:11   0         1    9

8  2014-10-24 10:00:55   8         2    8

9  2014-10-24 10:00:59  10         2   18

But i wanted the result in way:

                 Date   A      group   num  cum

0  2014-11-21 11:00:00   1         0    1

1  2014-11-21 11:00:03   2         0    2

2  2014-11-21 11:00:04   5         0    7

3  2014-11-21 11:00:05   3         0   10

4  2014-11-21 11:00:07   9         0   12

5  2014-11-21 11:00:08   6         1    6

6  2014-11-21 11:00:10   3         1    9

7  2014-11-21 11:00:11   0         1    3

8  2014-10-24 10:00:55   8         2    8

9  2014-10-24 10:00:59  10         2   10

 

1 Answer

0 votes
by (36.8k points)

The code may not work well for a huge amount of data, but works well for a small amount of data

To create the group num column we can use the numpy.ufunc.outer with the attributes greater than equal to and less than equal to. Check the info for each start and stop of the data frame and get the value where it is giving TRUE. For that row, you can use the groupby().

# create an boolean array to find in which range each row is

arr_bool = ( np.greater_equal.outer(df.Date.to_numpy(), info.Start.to_numpy())

             & np.less_equal.outer(df.Date.to_numpy(), info.Stop.to_numpy()))

# use argmax to find the position of the first True row-wise

df['groupnum'] = arr_bool.argmax(axis=1)

# select only rows within ranges, use set_index for later rolling and index alignment

df = df.loc[arr_bool.any(axis=1), :].set_index('Date')

# groupby groupnum, do the sum for a closed interval of 2s

df['cum'] = df.groupby('groupnum').rolling('2s', closed = 'both').A.sum()\

              .reset_index(level=0, drop=True) # for index alignment

df = df.reset_index() # get back date as a column

print (df)

                 Date   A  groupnum   cum

0 2014-11-21 11:00:00   1         0   1.0

1 2014-11-21 11:00:03   2         0   2.0

2 2014-11-21 11:00:04   5         0   7.0

3 2014-11-21 11:00:05   3         0  10.0

4 2014-11-21 11:00:07   9         0  12.0

5 2014-11-21 11:00:08   6         1   6.0

6 2014-11-21 11:00:10   3         1   9.0

7 2014-11-21 11:00:11   0         1   3.0

8 2014-10-24 10:00:55   8         2   8.0

9 2014-10-24 10:00:59  10         2  10.0

If you fail to do this because array bool is failing then you try to iterate the info rows individually by using the code below:

# get once an array of all dates (should be faster)

arr_date = df.Date.to_numpy()

# create groups by sum 

df['groupnum'] = np.sum([i* (np.greater_equal(arr_date, start)&np.less_equal(arr_date, stop)) 

                         for i, (start, stop) in enumerate(zip(info.Start.to_numpy(), info.Stop.to_numpy()), 1)], axis=0) - 1

# remove the rows that are not in any range

df = df.loc[df['groupnum'].ge(0), :].set_index('Date')

# then same for the column cum

df['cum] = ...

Learn Python for Data Science Course to improve your technical knowledge. 

Related questions

0 votes
1 answer
0 votes
1 answer
asked Jul 31, 2019 in Data Science by sourav (17.6k points)
0 votes
1 answer
asked Jan 21, 2021 in BI by Chris (11.1k points)

Browse Categories

...