1 view

I'm successfully using the groupby() function to compute statistics on grouped data, however, I'd now like to do the same for subsets of each group.

I can't seem to understand how to generate a subset for each group (as a groupby object) that can then be applied to a groupby function such as mean(). The following line works as intended:

d.groupby(['X','Y'])['Value'].mean()

How can I subset the values of the individual groups to then supply to the mean function? I suspect transform() or filter() might be useful though I can't figure out how.

random.seed(881)

value = np.random.randn(15)

letter = np.random.choice(['a','b','c'],15)

date = np.repeat(pd.date_range(start = '1/1/2001', periods=3), 5)

data = {'date':date,'letter':letter,'value':value}

df = pd.DataFrame(data)

df.groupby(['date','letter'])['value'].mean()

date        letter

2001-01-01   a        -0.039407

b        -0.350787

c         1.221200

2001-01-02   a        -0.688744

b         0.346961

c        -0.702222

2001-01-03   a         1.320947

b        -0.915636

c        -0.419655

Name: value, dtype: float64

Here's an example of calculating the mean of the multi-level group. Now I'd like to find the mean of a subset of each group. For example, the mean of each groups data that is < the groups 10th percentile. The key take away being that the subsets must be performed on the groups and not the entire df first.

by (41.4k points)

Use quantile() function that you can add to groupby().apply() statement.

Now, if you want to get the 10th percentile use quantile(.1)

df.groupby(['date','letter'])['value'].apply(lambda g: g[g <= g.quantile(.1)].mean())

Gain practical exposure with data science projects in Intellipaat's Data Science course online.