Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Data Science by (17.6k points)

I have a dataframe in Pandas with collected data;

import pandas as pd

df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})

+-------+----------+-----+

| Group | Subgroup | Obs |

+-------+----------+-----+

| A     | Blue     |   1 |

| A     | Blue     |   2 |

| A     | Blue     |   4 |

| A     | Red      |   1 |

| A     | Red      |   2 |

| A     | Red      |   3 |

| A     | Red      |   4 |

| B     | Blue     |   1 |

| B     | Blue     |   2 |

| B     | Blue     |   3 |

| B     | Blue     |   6 |

| B     | Red      |   1 |

| B     | Red      |   2 |

| B     | Red      |   3 |

+-------+----------+-----+

The Observations ('Obs') are supposed to be numbered without gaps, but you can see we have 'missed' Blue 3 in group A and Blue 4 and 5 in group B. The desired outcome is a percentage of all 'missed' Observations ('Obs') per group, so in the example:

+-------+--------------------+--------+--------+

| Group | Total Observations | Missed |   %    |

+-------+--------------------+--------+--------+

| A     |                  8 |      1 | 12.5%  |

| B     |                  9 |      2 | 22.22% |

+-------+--------------------+--------+--------+

I tried both with for loops and by using groups (for example:

df.groupby(['Group','Subgroup']).sum()

print(groups.head)

) but I can't seem to get that to work in any way I try. Am I going about this the wrong way?

From another answer (big shoutout to @Lie Ryan) I found a function to look for missing elements, however I don't quite understand how to implement this yet;

def window(seq, n=2):

    "Returns a sliding window (of width n) over data from the iterable"

    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "

    it = iter(seq)

    result = tuple(islice(it, n))

    if len(result) == n:

        yield result

    for elem in it:

        result = result[1:] + (elem,)

        yield result

def missing_elements(L):

    missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)

    return list(missing)

Can anyone give me a pointer is the right direction?

1 Answer

0 votes
by (41.4k points)

Here, you need to use groupby():

1.Now, to figure out how many observations are missing per Group and SubGroup use groupby + diff.

2.After that, Group df on Group, and then compute the size and sum of the column that is  calculated in the previous step.

Then,following some steps and calculating the %, you will get the desired output.

f = [   # declare an aggfunc list in advance, we'll need it later

      ('Total Observations', 'size'), 

      ('Missed', 'sum')

]

g = df.groupby(['Group', 'Subgroup'])\

      .Obs.diff()\

      .sub(1)\

      .groupby(df.Group)\

      .agg(f)

g['Total Observations'] += g['Missed']

g['%'] = g['Missed'] / g['Total Observations'] * 100 

g

       Total Observations  Missed          %

Group                                       

A                     8.0     1.0  12.500000

B                     9.0     2.0  22.222222

There can be another approach using groupby, apply and assign that will give you the desired output:

(

    df.groupby(['Group','Subgroup']).Obs

    .apply(lambda x: [x.max()-x.min()+1, x.max()-x.min()+1-len(x)])

    .apply(pd.Series)

    .groupby(level=0).sum()

    .assign(pct=lambda x: x[1]/x[0]*100)

    .set_axis(['Total Observations', 'Missed', '%'], axis=1, inplace=False)

)

Out[75]: 

       Total Observations  Missed          %

Group                                       

A                       8       1  12.500000

B                       9       2  22.222222

If you wish to Learn more about Pandas Dataframe visit this Pandas Tutorial.

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.2k users

Browse Categories

...