2 views

I am having dataset which is of the following shape:

tconst  GreaterEuropean British WestEuropean  Italian French  Jewish   Germanic    Nordic

0   tt0000001   3   1    2   0   1  0  0  1    0   0   0   0   0   0    0  0   0    8

1   tt0000002   2   0    2   0   2  0  0  0    0   0   0   0   0   0    0  0   0    6

2   tt0000003   4   0    3   0   3  1  0  0    0   0   0   0   0   0    0  0   0    11

3   tt0000004   2   0    2   0   2  0  0  0    0   0   0   0   0   0    0  0   0    6

4   tt0000005   3   2    1   0   0  0  1  0    0   0   0   0   0   0    0  0   0    7

It is IMDB data and after processing, I created these columns which represents there are this many number of ethnic actors in a movie (tcons).

I want to create another column df["diversity"] which is:

( diversity score "gini index")

For example: for each movie lets say we have 10 actors; 3 asian, 3 British, 3 african american and 1 french. so we divide by total 3/10 3/ 10 3/10 1/10 then 1 minus the summation of ( 3/10 ) square ( 3/ 10) square ( 3/10) square (1/10) square add the score of each actor to a column as diversity.

I am trying simple pandas manipulation, but not getting there.

EDIT:

for the first row, we have total ethnicities as 8

3 GreaterEuropean

1 British

2 WestEuropean

1 French

1 nordic

so the score will be

1- [(3/8)^2 + (1/8)^2 + (2/8)^2 + (1/8)^2 + (1/8)^2]

by (41.4k points)

You should use numpy vectorization, this will give you the desired answer.

one = df.drop(['total_ethnicities'],1).values

# Select the values other than total_ethnicities

two = df['total_ethnicities'].values[:,None]

# Select the values of total_ethnicities

df['diversity'] = 1 - pd.np.sum((one/two)**2, axis=1)

# Divide the values of one by two, square them. Sum over the axis. Then subtract from 1.

df['diversity']

tconst

tt0000001    0.750000

tt0000002    0.666667

tt0000003    0.710744

tt0000004    0.666667

tt0000005    0.693878

Name: diversity, dtype: float64

If you want to learn data science in-depth then enroll for best data science training.