In the sklearn library, there are many other functions you can use, to find cosine similarities between documents. You can directly use TfidfVectorizer in the sklearnâ€™s feature_extraction.text class to Vectorize the words. It will calculate TF_IDF normalization and row-wise euclidean normalization.

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> from sklearn.datasets import fetch_20newsgroups

>>> twenty = fetch_20newsgroups()

>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)

>>> tfidf

<11314x130088 sparse matrix of type '<type 'numpy.float64'>'

with 1787553 stored elements in Compressed Sparse Row format>

After TFIDF-Vectorization, you need to slice the matrix row-wise to get a submatrix with a single row:

>>> tfidf[0:1]

<1x130088 sparse matrix of type '<type 'numpy.float64'>'

with 89 stored elements in Compressed Sparse Row format>

In sklearn, we can perform the dot product of the vector by using a linear kernel.

>>> from sklearn.metrics.pairwise import linear_kernel

>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()

>>> cosine_similarities

array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,

0.04457106, 0.03293218])

To find the most related documents, we can use cosine_similarites.argsort() to get the most related document similarities values.

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]

>>> related_docs_indices

array([ 0, 958, 10576, 3277])

>>> cosine_similarities[related_docs_indices]

array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])

This output shows the cosine values of the top 5 most related documents stored in the array.

>>> print twenty.data[0]

**Output:**

From: [email protected] (where's my thing)

Subject: WHAT car is this!?

Nntp-Posting-Host: rac3.wam.umd.edu

Organization: University of Maryland, College Park

Lines: 15

I was wondering if anyone out there could enlighten me on this car I saw

the other day. It was a 2-door sports car, looked to be from the late 60s/

early 70s. It was called a Bricklin. The doors were really small. In addition,

the front bumper was separate from the rest of the body. This is

all I know. If anyone can tellme a model name, engine specs, years

of production, where this car is made, history, or whatever info you

have on this funky looking car, please e-mail.

Thanks,

- IL

---- brought to you by your neighborhood Lerxst ----

Hope this answer helps.

If you want to learn Python for Data Science then you can watch this Python tutorial:

If you wish to learn more about Machine Learning, then check out this Machine Learning tutorial for more insights.