Cumulative Sum 2s back using range dataset

Question

asked Apr 9, 2020 in Data Science by blackindya (18.4k points)

I am working on jupyter notebook were I am using 2 data frames.

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
Date A
0 2014-11-21 11:00:00 1
1 2014-11-21 11:00:03 2
2 2014-11-21 11:00:04 5
3 2014-11-21 11:00:05 3
4 2014-11-21 11:00:07 9
5 2014-11-21 11:00:08 6
6 2014-11-21 11:00:10 3
7 2014-11-21 11:00:11 0
8 2014-10-24 10:00:55 8
9 2014-10-24 10:00:59 10

The 2nd data frame contains the Date and time range as shown below:

info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:08:00', '2014-10-24 10:55:00'], "Stop": ['2014-11-21 11:07:00', '2014-11-21 11:11:00', '2014-10-24 10:59:00']})
Start Stop
0 2014-11-21 11:00:00 2014-11-21 11:00:07
1 2014-11-21 11:00:08 2014-11-21 11:00:11
2 2014-10-24 10:00:55 2014-10-24 10:00:59

I am trying to get the cumulated sum of the data frame using the second window. It should only be done if and only if the row in the data frame is present in the range of that particular row. For example, the cumulated sum of the row with date 2014-11-21 11:00:08 should be 0 because its the beginning of the range. The cumulative for date 2014-11-21 11:00:07 should be 12(9+3).

I have given the code below for getting the result but I am not getting expected result:

import pandas as pd
import numpy as np
df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})
#info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05'], "Stop": ['2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:06', '2014-11-21 11:00:07']})
info['groupnum']=info.index
info.Start=pd.to_datetime(info.Start)
info.Stop=pd.to_datetime(info.Stop)
cinfo = info.set_index(pd.IntervalIndex.from_arrays(info.Start, info.Stop, closed='both'))['groupnum']
df['groupnum']=pd.to_datetime(df.Date).map(cinfo)
df['cum'] = df.groupby('groupnum').A.cumsum()
print(df)

output:

Date A group num cum
0 2014-11-21 11:00:00 1 0 1
1 2014-11-21 11:00:03 2 0 3
2 2014-11-21 11:00:04 5 0 8
3 2014-11-21 11:00:05 3 0 11
4 2014-11-21 11:00:07 9 0 20
5 2014-11-21 11:00:08 6 1 6
6 2014-11-21 11:00:10 3 1 9
7 2014-11-21 11:00:11 0 1 9
8 2014-10-24 10:00:55 8 2 8
9 2014-10-24 10:00:59 10 2 18

But i wanted the result in way:

Date A group num cum
0 2014-11-21 11:00:00 1 0 1
1 2014-11-21 11:00:03 2 0 2
2 2014-11-21 11:00:04 5 0 7
3 2014-11-21 11:00:05 3 0 10
4 2014-11-21 11:00:07 9 0 12
5 2014-11-21 11:00:08 6 1 6
6 2014-11-21 11:00:10 3 1 9
7 2014-11-21 11:00:11 0 1 3
8 2014-10-24 10:00:55 8 2 8
9 2014-10-24 10:00:59 10 2 10

1 Answer

supriya · Answer 1 · 2020-04-09T08:33:49+0000

The code may not work well for a huge amount of data, but works well for a small amount of data

To create the group num column we can use the numpy.ufunc.outer with the attributes greater than equal to and less than equal to. Check the info for each start and stop of the data frame and get the value where it is giving TRUE. For that row, you can use the groupby().

# create an boolean array to find in which range each row is
arr_bool = ( np.greater_equal.outer(df.Date.to_numpy(), info.Start.to_numpy())
& np.less_equal.outer(df.Date.to_numpy(), info.Stop.to_numpy()))
# use argmax to find the position of the first True row-wise
df['groupnum'] = arr_bool.argmax(axis=1)
# select only rows within ranges, use set_index for later rolling and index alignment
df = df.loc[arr_bool.any(axis=1), :].set_index('Date')
# groupby groupnum, do the sum for a closed interval of 2s
df['cum'] = df.groupby('groupnum').rolling('2s', closed = 'both').A.sum()\
.reset_index(level=0, drop=True) # for index alignment
df = df.reset_index() # get back date as a column
print (df)
Date A groupnum cum
0 2014-11-21 11:00:00 1 0 1.0
1 2014-11-21 11:00:03 2 0 2.0
2 2014-11-21 11:00:04 5 0 7.0
3 2014-11-21 11:00:05 3 0 10.0
4 2014-11-21 11:00:07 9 0 12.0
5 2014-11-21 11:00:08 6 1 6.0
6 2014-11-21 11:00:10 3 1 9.0
7 2014-11-21 11:00:11 0 1 3.0
8 2014-10-24 10:00:55 8 2 8.0
9 2014-10-24 10:00:59 10 2 10.0

If you fail to do this because array bool is failing then you try to iterate the info rows individually by using the code below:

# get once an array of all dates (should be faster)
arr_date = df.Date.to_numpy()
# create groups by sum
df['groupnum'] = np.sum([i* (np.greater_equal(arr_date, start)&np.less_equal(arr_date, stop))
for i, (start, stop) in enumerate(zip(info.Start.to_numpy(), info.Stop.to_numpy()), 1)], axis=0) - 1
# remove the rows that are not in any range
df = df.loc[df['groupnum'].ge(0), :].set_index('Date')
# then same for the column cum
df['cum] = ...

Learn Python for Data Science Course to improve your technical knowledge.

Cumulative Sum 2s back using range dataset

Cumulative Sum 2s back using range dataset

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions