Count missing instances in subgroups

Question

asked Jul 11, 2019 in Data Science by sourav (17.6k points)

I have a dataframe in Pandas with collected data;

import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})
+-------+----------+-----+
| Group | Subgroup | Obs |
+-------+----------+-----+
| A | Blue | 1 |
| A | Blue | 2 |
| A | Blue | 4 |
| A | Red | 1 |
| A | Red | 2 |
| A | Red | 3 |
| A | Red | 4 |
| B | Blue | 1 |
| B | Blue | 2 |
| B | Blue | 3 |
| B | Blue | 6 |
| B | Red | 1 |
| B | Red | 2 |
| B | Red | 3 |
+-------+----------+-----+

The Observations ('Obs') are supposed to be numbered without gaps, but you can see we have 'missed' Blue 3 in group A and Blue 4 and 5 in group B. The desired outcome is a percentage of all 'missed' Observations ('Obs') per group, so in the example:

+-------+--------------------+--------+--------+
| Group | Total Observations | Missed | % |
+-------+--------------------+--------+--------+
| A | 8 | 1 | 12.5% |
| B | 9 | 2 | 22.22% |
+-------+--------------------+--------+--------+

I tried both with for loops and by using groups (for example:

df.groupby(['Group','Subgroup']).sum()
print(groups.head)

) but I can't seem to get that to work in any way I try. Am I going about this the wrong way?

From another answer (big shoutout to @Lie Ryan) I found a function to look for missing elements, however I don't quite understand how to implement this yet;

def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
def missing_elements(L):
missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
return list(missing)

Can anyone give me a pointer is the right direction?

1 Answer

Shlok Pandey · Answer 1 · 2019-07-20T06:49:38+0000

Here, you need to use groupby():

1.Now, to figure out how many observations are missing per Group and SubGroup use groupby + diff.

2.After that, Group df on Group, and then compute the size and sum of the column that is calculated in the previous step.

Then,following some steps and calculating the %, you will get the desired output.

f = [ # declare an aggfunc list in advance, we'll need it later
('Total Observations', 'size'),
('Missed', 'sum')
]
g = df.groupby(['Group', 'Subgroup'])\
.Obs.diff()\
.sub(1)\
.groupby(df.Group)\
.agg(f)
g['Total Observations'] += g['Missed']
g['%'] = g['Missed'] / g['Total Observations'] * 100
g
Total Observations Missed %
Group
A 8.0 1.0 12.500000
B 9.0 2.0 22.222222

There can be another approach using groupby, apply and assign that will give you the desired output:

(
df.groupby(['Group','Subgroup']).Obs
.apply(lambda x: [x.max()-x.min()+1, x.max()-x.min()+1-len(x)])
.apply(pd.Series)
.groupby(level=0).sum()
.assign(pct=lambda x: x[1]/x[0]*100)
.set_axis(['Total Observations', 'Missed', '%'], axis=1, inplace=False)
)
Out[75]:
Total Observations Missed %
Group
A 8 1 12.500000
B 9 2 22.222222

If you wish to Learn more about Pandas Dataframe visit this Pandas Tutorial.

Count missing instances in subgroups

Please log in to add a comment.

Please log in to answer this question.

1 Answer

Please log in to add a comment.

Related questions