Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I'm successfully using the groupby() function to compute statistics on grouped data, however, I'd now like to do the same for subsets of each group.

I can't seem to understand how to generate a subset for each group (as a groupby object) that can then be applied to a groupby function such as mean(). The following line works as intended:

d.groupby(['X','Y'])['Value'].mean()

How can I subset the values of the individual groups to then supply to the mean function? I suspect transform() or filter() might be useful though I can't figure out how.

EDIT to add reproducible example:

random.seed(881)

value = np.random.randn(15)

letter = np.random.choice(['a','b','c'],15)

date = np.repeat(pd.date_range(start = '1/1/2001', periods=3), 5)

data = {'date':date,'letter':letter,'value':value}

df = pd.DataFrame(data)

df.groupby(['date','letter'])['value'].mean()

 date        letter

2001-01-01   a        -0.039407

             b        -0.350787

             c         1.221200

2001-01-02   a        -0.688744

             b         0.346961

             c        -0.702222

2001-01-03   a         1.320947

             b        -0.915636

             c        -0.419655

Name: value, dtype: float64

Here's an example of calculating the mean of the multi-level group. Now I'd like to find the mean of a subset of each group. For example, the mean of each groups data that is < the groups 10th percentile. The key take away being that the subsets must be performed on the groups and not the entire df first.

1 Answer

0 votes
by (41.4k points)

Use quantile() function that you can add to groupby().apply() statement.

 Now, if you want to get the 10th percentile use quantile(.1)

df.groupby(['date','letter'])['value'].apply(lambda g: g[g <= g.quantile(.1)].mean())

Gain practical exposure with data science projects in Intellipaat's Data Science course online.

Browse Categories

...