Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I have the below dataset

Labels   Usernames

1         Londonderry

1         Londoncalling

1          Steveonder43

0         Maryclare_re

1         Patent107391

0         Anonymous 

1         _24londonqr

... 

I am trying to show there is a correlation between usernames containing the word London and label 1. To do it, I created the second label to see where the word London was

for idx, username in df['Usernames']:

    if 'London' in username:

        df['London'].iloc[idx] = 1

    else:

        df['London'].iloc[idx] = 0

Then I compared these binary variables, using the Pearson correlation coefficient:

import scipy.stats.pearsonr as rho

corr = rho(df['labels'], df['London'])

However, it is not working. Am I missing something?

1 Answer

0 votes
by (36.8k points)

You have gone wrong with the column name that is the reason you are getting the error. I have also enhanced the code:

df['London'] = df['Usernames'].str.contains('London').astype(int)

from scipy import stats

stats.pearsonr(df['Labels'], df['London'])

Out[12]: (0.4, 0.37393392381774704)

 Do check out Data Science with Python course which helps you understand from scratch 

Related questions

0 votes
1 answer
asked Mar 8, 2020 in Data Science by ashely (50.2k points)
+4 votes
1 answer
0 votes
1 answer
asked Oct 15, 2019 in Python by Sammy (47.6k points)
0 votes
1 answer

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...