Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (18.4k points)

I have the dataset that has the no_employees column that is the str object. whats is a best way to create the new column (company_size) in a data frame and fill it with values based on a no_employees column like in example below

mental_health_df = pd.read_csv("Mental Health.csv")

pd.set_option('display.max_columns', None)


no_employees        company_size


6-25             |Small

More than 1000   |Extremely Large

500-1000         |Very Large

26-100           |Medium

100-500          |Large

1-5              |Very Small

1 Answer

0 votes
by (36.8k points)
edited by

Please bin using df.cut

 import numpy as np

df['company_size']=pd.cut(df['no_employees']. astype('category')*10,[-np.inf,9,19,29,39,49,np.inf], labels=['Very Small','Large','Medium','Very Large','Small','Extremely Large'])


    no_employees     company_size

0            6-25            Small

1  More than 1000  Extremely Large

2        500-1000       Very Large

3          26-100           Medium

4         100-500            Large

5             1-5       Very Small

It works like this:

#Converted no of employees to codes but for ease of defining bins multiplied by ten

  df['no_employees']. astype('category')*10

#Decided to bin using df.cut

pd.cut(df['no_employees']. astype('category')*10,\

       [-np.inf,9,19,29,39,49,np.inf], labels=['Very Small','Large','Medium','Very Large','Small','Extremely Large'])

Learn Data Science with Python Course to improve your technical knowledge. 

Browse Categories