Python: tf-idf-cosine: to find document similarity

Question

3 Answers

Anurag · Answer 1 · 2019-06-18T11:18:20+0000

In the sklearn library, there are many other functions you can use, to find cosine similarities between documents. You can directly use TfidfVectorizer in the sklearn’s feature_extraction.text class to Vectorize the words. It will calculate TF_IDF normalization and row-wise euclidean normalization.

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty = fetch_20newsgroups()
>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)
>>> tfidf
<11314x130088 sparse matrix of type '<type 'numpy.float64'>'
with 1787553 stored elements in Compressed Sparse Row format>

After TFIDF-Vectorization, you need to slice the matrix row-wise to get a submatrix with a single row:

>>> tfidf[0:1]
<1x130088 sparse matrix of type '<type 'numpy.float64'>'
with 89 stored elements in Compressed Sparse Row format>

In sklearn, we can perform the dot product of the vector by using a linear kernel.

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,
0.04457106, 0.03293218])

To find the most related documents, we can use cosine_similarites.argsort() to get the most related document similarities values.

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([ 0, 958, 10576, 3277])
>>> cosine_similarities[related_docs_indices]
array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])

This output shows the cosine values of the top 5 most related documents stored in the array.

>>> print twenty.data[0]

Output:

From: [email protected] (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----

Hope this answer helps.

If you want to learn Python for Data Science then you can watch this Python tutorial:

If you wish to learn more about Machine Learning, then check out this Machine Learning tutorial for more insights.

You can refer to our Python online course for more information.

vinita · Answer 2 · 2019-08-09T06:44:54+0000

What we need to do is write a simple for loop to iterate over the two arrays that represent the train data and test data.

First, perform a simple lambda function to hold formula for the cosine calculation:

cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

And then just write a for loop to iterate over the to vector, simple logic is for every "For each vector in trainVectorizerArray, you have to find the cosine similarity with the vector in testVectorizerArray."

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA
train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
for vector in trainVectorizerArray:
print vector
for testV in testVectorizerArray:
print testV
cosine = cx(vector, testV)
print cosine
transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()
transformer.fit(testVectorizerArray)
print
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

Here is the output:

Fit Vectorizer to train set [[1 0 1 0]
[0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]
[1 0 1 0]
[0 1 1 1]
0.408
[0 1 0 1]
[0 1 1 1]
0.816
[[ 0.70710678 0. 0.70710678 0. ]
[ 0. 0.70710678 0. 0.70710678]]
[[ 0. 0.57735027 0.57735027 0.57735027]]

I think using np.corrcoef() instead of your roll-your-own method is better. — kodee, Aug 9, 2019
I agree with @kodee , that method seems simpler ! anyway this is also a decent work around — Prakhar_04, Aug 17, 2019
I'm learning from the beginning and your answer is the easiest to follow. I think that you can use np.corrcoef() instead your roll-your-own method — Ashok, Aug 17, 2019

Vishal · Answer 3 · 2019-08-20T06:16:38+0000

Use the below-mentioned code it will help to find document similarity:-

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(train_set)
print(tfidf_matrix)
cosine = cosine_similarity(tfidf_matrix[length-1], tfidf_matrix)
print(cosine)

Output:-

[[ 0.34949812 0.81649658 1. ]]

Python: tf-idf-cosine: to find document similarity

3 Answers

Related questions

Browse Categories