Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I'm working with an airbnb dataset on Kaggle:

https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

and want to simplify the values for the language column into 2 groupings - english and non-english.

For instance:

users.language.value_counts()

en    15011

zh      101

fr       99

de       53

es       53

ko       43

ru       21

it       20

ja       19

pt       14

sv       11

no        6

da        5

nl        4

el        2

pl        2

tr        2

cs        1

fi        1

is        1

hu        1

Name: language, dtype: int64

And the result I want it is:

users.language.value_counts()

    english    15011

    non-english 459

    Name: language, dtype: int64

This is sort of the solution I want:

def language_groupings():

    for i in users:

        if users.language !='en':

            replace(users.language.str, 'non-english')

        else: 

            replace(users.language.str, 'english')

    return users

users['language'] = users.apply(lambda row: language_groupings)

Except there's obviously something wrong with this as it returns an empty series when I run value_counts on the column.

1 Answer

0 votes
by (41.4k points)

Try this line of code:

( users.assign(lang=np.where(users.language == 'en', 'english', 'non-english'))['lang'].value_counts() )

If you wish to know more about Python visit this  Python Course.

...