0 votes
1 view
in Machine Learning by (19k points)
I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example

'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on

Is there any built-in function for this?

1 Answer

0 votes
by (33.2k points)
edited by

If cv is your CountVectorizer and X is the vectorized corpus, then

zip(cv.get_feature_names(),

    np.asarray(X.sum(axis=0)).ravel())

returns a list of (term, frequency) pairs for each distinct term in the corpus that the CountVectorizer extracted.

Undergo NLP Training comprehensively with the help of this video tutorial:

NLP is somewhat related to Machine Learning Tutorial as well, so studying it will always double benefit one when it comes to technology mastering.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...