2 views

I have a data frame that contains five columns for ID values, and some arbitrary metric. The ID values relate to 5 employees for a specific project, but there is no standard for the order that each ID is entered into the dataset. I want to perform a groupby on the set of 5 ID's to evaluate at a group level.

In[1]: df1 = pd.DataFrame({'ID_1' : [1, 1, 1, 1],

'ID_2' : [2, 2, 4, 4],

'ID_3' : [3, 3, 2, 2],

'ID_4' : [4, 5, 8, 8],

'ID_5' : [5, 4, 7, 7],

'some_metric' : [.7, .8, .2, .9]})

In[2]: print df1

Out[2]:

ID_1  ID_2  ID_3  ID_4  ID_5  some_metric

0     1     2     3     4     5          0.7

1     1     2     3     5     4          0.8

2     1     4     2     8     7          0.2

3     1     4     2     8     7          0.9

However due to the non-unique ordering of the ID's in the original dataset, if I perform a group by on the 5 ID's I will get three groups when there are only two unique groups.

In[3]: df1.groupby(['ID_1', 'ID_2', 'ID_3', 'ID_4', 'ID_5']).mean()

Out[3]:

ID_1    ID_2    ID_3    ID_4    ID_5    some_metric

1       2       3       4       5         0.70

5       4         0.80

4       2       8       7         0.55

Is there an easy way to group by unique values from the column? Or is there a way to pull out the data into a dictionary, and then back into a data frame?

by (36.8k points)

To group on a unique combination of items in multiple columns regardless of order, with no missing values, sort the values and assign the sorted values back to the columns. Then you can group plainly. In this case, we use NumPy because it's one of the faster ways to sort

import numpy as np

cols = [f'ID_{i}' for i in range(1,6)]

df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)

df1

#   ID_1  ID_2  ID_3  ID_4  ID_5  some_metric

#0     1     2     3     4     5          0.7

#1     1     2     3     4     5          0.8

#2     1     2     4     7     8          0.2

#3     1     2     4     7     8          0.9

# Grouping on these columns works, regardless of original ordering:

df1.groupby(cols).size()

#ID_1  ID_2  ID_3  ID_4  ID_5

#1     2     3     4     5       2

#            4     7     8       2

Do check out Python Data Science Course which helps you understand from scratch