Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I have a dataframe in Pandas with collected data;

import pandas as pd

df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})

+-------+----------+-----+

| Group | Subgroup | Obs |

+-------+----------+-----+

| A     | Blue     |   1 |

| A     | Blue     |   2 |

| A     | Blue     |   4 |

| A     | Red      |   1 |

| A     | Red      |   2 |

| A     | Red      |   3 |

| A     | Red      |   4 |

| B     | Blue     |   1 |

| B     | Blue     |   2 |

| B     | Blue     |   3 |

| B     | Blue     |   6 |

| B     | Red      |   1 |

| B     | Red      |   2 |

| B     | Red      |   3 |

+-------+----------+-----+

The Observations ('Obs') are supposed to be numbered without gaps, but you can see we have 'missed' Blue 3 in group A and Blue 4 and 5 in group B. The desired outcome is a percentage of all 'missed' Observations ('Obs') per group, so in the example:

+-------+--------------------+--------+--------+

| Group | Total Observations | Missed |   %    |

+-------+--------------------+--------+--------+

| A     |                  8 |      1 | 12.5%  |

| B     |                  9 |      2 | 22.22% |

+-------+--------------------+--------+--------+

I tried both with for loops and by using groups (for example:

df.groupby(['Group','Subgroup']).sum()

print(groups.head)

) but I can't seem to get that to work in any way I try. Am I going about this the wrong way?

From another answer (big shoutout to @Lie Ryan) I found a function to look for missing elements, however I don't quite understand how to implement this yet;

def window(seq, n=2):

    "Returns a sliding window (of width n) over data from the iterable"

    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "

    it = iter(seq)

    result = tuple(islice(it, n))

    if len(result) == n:

        yield result

    for elem in it:

        result = result[1:] + (elem,)

        yield result

def missing_elements(L):

    missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)

    return list(missing)

Can anyone give me a pointer is the right direction?

1 Answer

0 votes
by (41.4k points)

Here, you need to use groupby():

1.Now, to figure out how many observations are missing per Group and SubGroup use groupby + diff.

2.After that, Group df on Group, and then compute the size and sum of the column that is  calculated in the previous step.

Then,following some steps and calculating the %, you will get the desired output.

f = [   # declare an aggfunc list in advance, we'll need it later

      ('Total Observations', 'size'), 

      ('Missed', 'sum')

]

g = df.groupby(['Group', 'Subgroup'])\

      .Obs.diff()\

      .sub(1)\

      .groupby(df.Group)\

      .agg(f)

g['Total Observations'] += g['Missed']

g['%'] = g['Missed'] / g['Total Observations'] * 100 

g

       Total Observations  Missed          %

Group                                       

A                     8.0     1.0  12.500000

B                     9.0     2.0  22.222222

There can be another approach using groupby, apply and assign that will give you the desired output:

(

    df.groupby(['Group','Subgroup']).Obs

    .apply(lambda x: [x.max()-x.min()+1, x.max()-x.min()+1-len(x)])

    .apply(pd.Series)

    .groupby(level=0).sum()

    .assign(pct=lambda x: x[1]/x[0]*100)

    .set_axis(['Total Observations', 'Missed', '%'], axis=1, inplace=False)

)

Out[75]: 

       Total Observations  Missed          %

Group                                       

A                       8       1  12.500000

B                       9       2  22.222222

If you wish to Learn more about Pandas Dataframe visit this Pandas Tutorial.

Related questions

...