# Python: tf-idf-cosine: to find document similarity

1 view

I was following a tutorial that was available at Part 1 & Part 2. Unfortunately, the author didn't have the time for the final section which involved using cosine similarity to actually find the distance between two documents. I followed the examples in the article with the help of the following link from StackOverflow, included in the code mentioned in the above link (just so as to make life easier)

by (33.2k points)
edited by

In the sklearn library, there are many other functions you can use, to find cosine similarities between documents. You can directly use TfidfVectorizer in the sklearn’s feature_extraction.text class to Vectorize the words. It will calculate TF_IDF normalization and row-wise euclidean normalization.

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> from sklearn.datasets import fetch_20newsgroups

>>> twenty = fetch_20newsgroups()

>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)

>>> tfidf

<11314x130088 sparse matrix of type '<type 'numpy.float64'>'

with 1787553 stored elements in Compressed Sparse Row format>

After TFIDF-Vectorization, you need to slice the matrix row-wise to get a submatrix with a single row:

>>> tfidf[0:1]

<1x130088 sparse matrix of type '<type 'numpy.float64'>'

with 89 stored elements in Compressed Sparse Row format>

In sklearn, we can perform the dot product of the vector by using a linear kernel.

>>> from sklearn.metrics.pairwise import linear_kernel

>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()

>>> cosine_similarities

array([ 1.        , 0.04405952, 0.11016969, ...,  0.04433602,

0.04457106,  0.03293218])

To find the most related documents, we can use cosine_similarites.argsort() to get the most related document similarities values.

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]

>>> related_docs_indices

array([    0, 958, 10576,  3277])

>>> cosine_similarities[related_docs_indices]

array([ 1.        , 0.54967926, 0.32902194,  0.2825788 ])

This output shows the cosine values of the top 5 most related documents stored in the array.

>>> print twenty.data

Output:

From: [email protected] (where's my thing)

Subject: WHAT car is this!?

Nntp-Posting-Host: rac3.wam.umd.edu

Organization: University of Maryland, College Park

Lines: 15

I was wondering if anyone out there could enlighten me on this car I saw

the other day. It was a 2-door sports car, looked to be from the late 60s/

early 70s. It was called a Bricklin. The doors were really small. In addition,

the front bumper was separate from the rest of the body. This is

all I know. If anyone can tellme a model name, engine specs, years

of production, where this car is made, history, or whatever info you

have on this funky looking car, please e-mail.

Thanks,

- IL

---- brought to you by your neighborhood Lerxst ----

If you want to learn  Python for Data Science then you can watch this Python tutorial:

If you wish to learn more about Machine Learning, then check out this Machine Learning tutorial for more insights.

by (28.1k points)
Thanks for the detailed explanation.
by (19.8k points)
Very well explained!
by (16.3k points)
This worked for me. Thanks.
by (92.8k points)

What we need to do is write a simple for loop to iterate over the two arrays that represent the train data and test data.

First, perform a simple lambda function to hold formula for the cosine calculation:

cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

And then just write  a for loop to iterate over the to vector, simple logic is for every "For each vector in trainVectorizerArray, you have to find the cosine similarity with the vector in testVectorizerArray."

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from nltk.corpus import stopwords

import numpy as np

import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."] #Documents

test_set = ["The sun in the sky is bright."] #Query

stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)

#print vectorizer

transformer = TfidfTransformer()

#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()

testVectorizerArray = vectorizer.transform(test_set).toarray()

print 'Fit Vectorizer to train set', trainVectorizerArray

print 'Transform Vectorizer to test set', testVectorizerArray

cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

for vector in trainVectorizerArray:

print vector

for testV in testVectorizerArray:

print testV

cosine = cx(vector, testV)

print cosine

transformer.fit(trainVectorizerArray)

print

print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)

print

tfidf = transformer.transform(testVectorizerArray)

print tfidf.todense()

Here is the output:

Fit Vectorizer to train set [[1 0 1 0]

[0 1 0 1]]

Transform Vectorizer to test set [[0 1 1 1]]

[1 0 1 0]

[0 1 1 1]

0.408

[0 1 0 1]

[0 1 1 1]

0.816

[[ 0.70710678  0.          0.70710678  0.        ]

[ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

by (44.6k points)
by (29.8k points)
I agree with  @kodee , that method seems simpler ! anyway this is also a decent work around
by (47.2k points)
I'm learning from the beginning and your answer is the easiest to follow. I think that you can use np.corrcoef() instead your roll-your-own method
by (107k points)

Use the below-mentioned code it will help to find document similarity:-

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity tfidf_vectorizer = TfidfVectorizer()

tfidf_matrix = tfidf_vectorizer.fit_transform(train_set)

print(tfidf_matrix)

cosine = cosine_similarity(tfidf_matrix[length-1], tfidf_matrix)

print(cosine)

Output:-

[[ 0.34949812 0.81649658 1. ]]