2 views

Is there a convenient way to split an array such that regardless of the number of elements in each section, the range of values it contains is the same?

Say we have data in range (0, 100). Let the size of the array be 1000. The first 500 elements are all in (0, 20), 300 elements in (20, 40) and so on. I'd like to manipulate values in the subsections split by 20, 40, 60 and 80.

The data could look something like this:

1st div:  0,  0,  0, ... 17, 18

2nd div: 22, 22, 24, ... 37, 39

3rd div: 40, 41, 41, ... 55, 59

4th div: 65, 68, 73, 76, 76

5th div: 93, 96

It's very easy to split an array in equal size sections by the section size. But I'm plotting a trend line using some simple averaging, and the amount of data in each section varies. I know the split points.

It could be made with np.where with a condition like arr > border1 taking only the first element, combining and then splitting, but that seems like a long-ish way of doing things.

Any pointers would be greatly appreciated. I can't be the only one with this problem. Also, if another library does this kind of thing, I'd certainly be open to using it.

by (41.4k points)

If the elements are sorted, then you can use groupby as depicted in the code

import itertools input=[0,1,5,17,18,22,27,37,39,40,41,48,57,65,68,72,77,79,81,85,88,91,99]

for i, j in itertools.groupby(input, key=lambda x: x//20):

# i=0, j=[0, 1, 5, 17, 18]

# i=1, j=[22, 27, 37, 39]

# i=2, j=[40, 41, 48, 57]

Otherwise

You should use  np.searchsorted() to get the indices that will split into groups.

After that, split those with np.split -

np.random.seed(0)

a = np.sort(np.random.randint(0,100,(10000)))

bins = [20,40,60,80]

idx = np.searchsorted(a, bins)

np.split(a,idx)

Output:

[array([ 0,  0,  0, ..., 19, 19, 19]),

array([20, 20, 20, ..., 39, 39, 39]),

array([40, 40, 40, ..., 59, 59, 59]),

array([60, 60, 60, ..., 79, 79, 79]),

array([80, 80, 80, ..., 99, 99, 99])]

If you want to learn data science in-depth then enroll for best data science training.