2 views

I was reading about TfidfVectorizer implementation of scikit-learn, I don´t understand what´s the output of the method, for example:

new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball']

new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)

print tfidf_vectorizer.vocabulary_

print new_term_freq_matrix.todense()

Output:

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}

[[ 0.57735027  0.57735027 0.57735027  0. 0. 0.          0.

0.          0. 0.          0. ]

[ 0.          0.68091856 0.          0. 0.51785612 0.51785612

0.          0. 0.          0. 0. ]

[ 0.62276601  0. 0.          0.62276601 0. 0.          0.

0.4736296   0. 0.          0. ]]

What is? (e.g.: u'me': 8 ):

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}

is this a matrix or just a vector? I can´t understand what´s telling me the

Output:

[[ 0.57735027  0.57735027 0.57735027  0. 0. 0.          0.

0.          0. 0.          0. ]

[ 0.          0.68091856 0.          0. 0.51785612 0.51785612

0.          0. 0.          0. 0. ]

[ 0.62276601  0. 0.          0.62276601 0. 0.          0.

0.4736296   0. 0.          0. ]]

Could anybody explain to me in more detail these outputs?

Thanks!

by (33.1k points)

TfidfVectorizer - It is a technique used for natural language processing, that transforms text to feature vectors that can be used as input to the estimator.

There is a vocabulary_ method in the tfidf class, which returns a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.

Here the token 'me' is represented as feature number 8 in the output matrix.

Each sentence forms a vector, but the sentences you've entered are the matrix with 3 vectors. In terms of vector, the numbers (weights) represent features tf-idf score.

For example:

'julie': 4 --> describe that the in each sentence 'Julie' will have non-zero (tf-idf) weight. As you can see in the 2'nd vector:

[ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ]

Here, the 5'th element scored 0.51785612 - the tf-idf score for 'Julie'.

Study Scikit Learn Cheat Sheet to gain more insights on Tfidf.