Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I am having dataset which is of the following shape:

tconst  GreaterEuropean British WestEuropean  Italian French  Jewish   Germanic    Nordic  

0   tt0000001   3   1    2   0   1  0  0  1    0   0   0   0   0   0    0  0   0    8

1   tt0000002   2   0    2   0   2  0  0  0    0   0   0   0   0   0    0  0   0    6

2   tt0000003   4   0    3   0   3  1  0  0    0   0   0   0   0   0    0  0   0    11

3   tt0000004   2   0    2   0   2  0  0  0    0   0   0   0   0   0    0  0   0    6

4   tt0000005   3   2    1   0   0  0  1  0    0   0   0   0   0   0    0  0   0    7

It is IMDB data and after processing, I created these columns which represents there are this many number of ethnic actors in a movie (tcons).

I want to create another column df["diversity"] which is:

( diversity score "gini index")

For example: for each movie lets say we have 10 actors; 3 asian, 3 British, 3 african american and 1 french. so we divide by total 3/10 3/ 10 3/10 1/10 then 1 minus the summation of ( 3/10 ) square ( 3/ 10) square ( 3/10) square (1/10) square add the score of each actor to a column as diversity.

I am trying simple pandas manipulation, but not getting there.

EDIT:

for the first row, we have total ethnicities as 8

3 GreaterEuropean

1 British

2 WestEuropean

1 French

1 nordic

so the score will be

1- [(3/8)^2 + (1/8)^2 + (2/8)^2 + (1/8)^2 + (1/8)^2]

1 Answer

0 votes
by (41.4k points)

You should use numpy vectorization, this will give you the desired answer.

one = df.drop(['total_ethnicities'],1).values

# Select the values other than total_ethnicities

two = df['total_ethnicities'].values[:,None]

# Select the values of total_ethnicities

df['diversity'] = 1 - pd.np.sum((one/two)**2, axis=1)

# Divide the values of one by two, square them. Sum over the axis. Then subtract from 1. 

df['diversity']

tconst

tt0000001    0.750000

tt0000002    0.666667

tt0000003    0.710744

tt0000004    0.666667

tt0000005    0.693878

Name: diversity, dtype: float64

If you want to learn data science in-depth then enroll for best data science training.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.4k questions

32.5k answers

500 comments

108k users

Browse Categories

...