0 votes
1 view
in Python by (6.1k points)

I have the following dataframe:

import pandas as pd

df = pd.DataFrame({'var': ['A', 'A', 'B', 'B', 'C', 'C', 'C'],

                       'value': [1, 2, 1, 2, 3, 4, 5],

                       'input': [0.1, 0.1, 0.2, 0.2, 0.3, 0.3, 0.3]})

I would like to keep the var for which the value is the highest by input and set the rest of the var to NA.

So I would like to end up with:

   df = pd.DataFrame({'var': [np.nan, 'A', np.nan, 'B', np.nan, np.nan, 'C'],

                       'value': [1, 2, 1, 2, 3, 4, 5],

                       'input': [0.1, 0.1, 0.2, 0.2, 0.3, 0.3, 0.3]})

Any ideas ?

1 Answer

0 votes
by (12.1k points)

Use GroupBy.transform with max for Series with same size like original DataFrame, compare for not equal by Series.ne and set new values with loc:

mask = df.groupby('var')['value'].transform('max').ne(df['value'])

 

df.loc[mask, 'var'] = np.nan

print (df)

   var  value  input

0  NaN 1    0.1

1 A 2    0.2

2  NaN 1    0.3

3 B 2    0.4

4  NaN 3    0.5

5  NaN 4    0.6

6 C 5    0.7

...