0 votes
1 view
in Machine Learning by (15.7k points)

I was reading about TfidfVectorizer implementation of scikit-learn, I don´t understand what´s the output of the method, for example:

new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball']

new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)

print tfidf_vectorizer.vocabulary_

print new_term_freq_matrix.todense()

Output:

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}

[[ 0.57735027  0.57735027 0.57735027  0. 0. 0.          0.

   0.          0. 0.          0. ]

 [ 0.          0.68091856 0.          0. 0.51785612 0.51785612

   0.          0. 0.          0. 0. ]

 [ 0.62276601  0. 0.          0.62276601 0. 0.          0.

   0.4736296   0. 0.          0. ]]

What is? (e.g.: u'me': 8 ):

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}

is this a matrix or just a vector? I can´t understand what´s telling me the 

Output:

[[ 0.57735027  0.57735027 0.57735027  0. 0. 0.          0.

   0.          0. 0.          0. ]

 [ 0.          0.68091856 0.          0. 0.51785612 0.51785612

   0.          0. 0.          0. 0. ]

 [ 0.62276601  0. 0.          0.62276601 0. 0.          0.

   0.4736296   0. 0.          0. ]]

Could anybody explain to me in more detail these outputs?

Thanks!

1 Answer

0 votes
by (33.2k points)

TfidfVectorizer - It is a technique used for natural language processing, that transforms text to feature vectors that can be used as input to the estimator.

There is a vocabulary_ method in the tfidf class, which returns a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.

Your question about the (u'me': 8).

Here the token 'me' is represented as feature number 8 in the output matrix.

Each sentence forms a vector, but the sentences you've entered are the matrix with 3 vectors. In terms of vector, the numbers (weights) represent features tf-idf score. 

For example:

'julie': 4 --> describe that the in each sentence 'Julie' will have non-zero (tf-idf) weight. As you can see in the 2'nd vector:

[ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ]

Here, the 5'th element scored 0.51785612 - the tf-idf score for 'Julie'. 

Study Scikit Learn Cheat Sheet to gain more insights on Tfidf. 

Hope this answer helps.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...