0 votes
1 view
in Data Science by (28.9k points)

I want to mark some quantiles in my data, and for each row of the DataFrame, I would like the entry in a new column called e.g. "xtile" to hold this value.

For example, suppose I create a data frame like this:

import pandas, numpy as np

dfrm = pandas.DataFrame({'A':np.random.rand(100), 

                         'B':(50+np.random.randn(100)), 

                         'C':np.random.randint(low=0, high=3, size=(100,))})

And let's say I write my own function to compute the quintile of each element in an array. I have my own function for this, but for example, just refer to scipy.stats.mstats.mquantile.

import scipy.stats as st

def mark_quintiles(x, breakpoints):

    # Assume this is filled in, using st.mstats.mquantiles.

    # This returns an array the same shape as x, with an integer for which

    # breakpoint-bucket that entry of x falls into.

Now, the real question is how to use transform to add a new column to the data. Something like this:

def transformXtiles(dataFrame, inputColumnName, newColumnName, breaks):

    dataFrame[newColumnName] = mark_quintiles(dataFrame[inputColumnName].values, 

                                              breaks)

    return dataFrame

And then:

dfrm.groupby("C").transform(lambda x: transformXtiles(x, "A", "A_xtile", [0.2, 0.4, 0.6, 0.8, 1.0]))

The problem is that the above code will not add the new column "A_xtile". It just returns my data frame unchanged. If I first add a column full of dummy values, like NaN, called "A_xtile", then it does successfully over-write this column to include the correct quintile markings.

But it is extremely inconvenient to have to first write in the column for anything like this that I may want to add on the fly.

Note that a simple "apply" will not work here, since it won't know how to make sense of the possibly differently-sized result arrays for each group.

1 Answer

0 votes
by (63.8k points)

It works for this toy example in the following code and the group lengths are different:

In [82]: df

Out[82]: 

   X         Y

0  0 -0.631214

1  0 0.783142

2  0 0.526045

3  1 -1.750058

4  1 1.163868

5  1 1.625538

6  1 0.076105

7  2 0.183492

8  2 0.541400

9  2 -0.672809

In [83]: def func(x):

   ....:     x['NewCol'] = np.nan

   ....:     return x

   ....: 

In [84]: df.groupby('X').apply(func)

Out[84]: 

   X         Y NewCol

0  0 -0.631214     NaN

1  0 0.783142     NaN

2  0 0.526045     NaN

3  1 -1.750058     NaN

4  1 1.163868     NaN

5  1 1.625538     NaN

6  1 0.076105     NaN

7  2 0.183492     NaN

8  2 0.541400     NaN

9  2 -0.672809     NaN

If you are interested in learning Pandas and want to become an expert in Python Programming, then check out this Python Course and upskill yourself.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...