Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

My code:

import pandas as pd

from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 

                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],

                  'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

Last line of code doesn't work, it says "Key error 'Short name'" and if I try to group only by City, then I got an AssertionError. What can I do fix it?

1 Answer

0 votes
by (41.4k points)

To get a count series use value_counts() and then get the first row:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 

                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],

                  'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

If you wish to know what is python visit this python tutorial and python interview questions.

Browse Categories

...