Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

I want to mark some quantiles in my data, and for each row of the DataFrame, I would like the entry in a new column called e.g. "xtile" to hold this value.

For example, suppose I create a data frame like this:

import pandas, numpy as np

dfrm = pandas.DataFrame({'A':np.random.rand(100), 

                         'B':(50+np.random.randn(100)), 

                         'C':np.random.randint(low=0, high=3, size=(100,))})

And let's say I write my own function to compute the quintile of each element in an array. I have my own function for this, but for example, just refer to scipy.stats.mstats.mquantile.

import scipy.stats as st

def mark_quintiles(x, breakpoints):

    # Assume this is filled in, using st.mstats.mquantiles.

    # This returns an array the same shape as x, with an integer for which

    # breakpoint-bucket that entry of x falls into.

Now, the real question is how to use transform to add a new column to the data. Something like this:

def transformXtiles(dataFrame, inputColumnName, newColumnName, breaks):

    dataFrame[newColumnName] = mark_quintiles(dataFrame[inputColumnName].values, 

                                              breaks)

    return dataFrame

And then:

dfrm.groupby("C").transform(lambda x: transformXtiles(x, "A", "A_xtile", [0.2, 0.4, 0.6, 0.8, 1.0]))

The problem is that the above code will not add the new column "A_xtile". It just returns my data frame unchanged. If I first add a column full of dummy values, like NaN, called "A_xtile", then it does successfully over-write this column to include the correct quintile markings.

But it is extremely inconvenient to have to first write in the column for anything like this that I may want to add on the fly.

Note that a simple "apply" will not work here, since it won't know how to make sense of the possibly differently-sized result arrays for each group.

1 Answer

0 votes
by (108k points)

It works for this toy example in the following code and the group lengths are different:

In [82]: df

Out[82]: 

   X         Y

0  0 -0.631214

1  0 0.783142

2  0 0.526045

3  1 -1.750058

4  1 1.163868

5  1 1.625538

6  1 0.076105

7  2 0.183492

8  2 0.541400

9  2 -0.672809

In [83]: def func(x):

   ....:     x['NewCol'] = np.nan

   ....:     return x

   ....: 

In [84]: df.groupby('X').apply(func)

Out[84]: 

   X         Y NewCol

0  0 -0.631214     NaN

1  0 0.783142     NaN

2  0 0.526045     NaN

3  1 -1.750058     NaN

4  1 1.163868     NaN

5  1 1.625538     NaN

6  1 0.076105     NaN

7  2 0.183492     NaN

8  2 0.541400     NaN

9  2 -0.672809     NaN

If you are interested in learning Pandas and want to become an expert in Python Programming, then check out this Python Course and upskill yourself.

Browse Categories

...