Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Machine Learning by (19k points)
I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example

'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on

Is there any built-in function for this?

1 Answer

0 votes
by (33.1k points)
edited by

If cv is your CountVectorizer and X is the vectorized corpus, then

zip(cv.get_feature_names(),

    np.asarray(X.sum(axis=0)).ravel())

returns a list of (term, frequency) pairs for each distinct term in the corpus that the CountVectorizer extracted.

Undergo NLP Training comprehensively with the help of this video tutorial:

NLP is somewhat related to Machine Learning Tutorial as well, so studying it will always double benefit one when it comes to technology mastering.

Browse Categories

...