Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?

Question

asked Jul 8, 2019 in Machine Learning by ParasSharma1 (19k points)

I have been working with the CountVectorizer class in scikit-learn.

I understand that if used in the manner shown below, the final output will consist of an array containing counts of features, or tokens.

These tokens are extracted from a set of keywords, i.e.

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

The next step is:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data

Where we get

[[0 0 0 1 1 0]
[0 1 0 0 1 1]
[1 1 1 0 1 0]]

This is fine, but my situation is just a little bit different.

I want to extract the features the same way as above, but I don't want the rows in data to be the same documents that the features were extracted from.

In other words, how can I get counts of another set of documents, say,

list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]
]

And get:

[[0 0 0 1 0 0]
[0 1 0 0 0 1]
[0 0 0 0 0 0]]

I have read the documentation for the CountVectorizer class and came across the vocabulary argument, which is a mapping of terms to feature indices. I can't seem to get this argument to help me, however.

Any advice is appreciated.

1 Answer

Anurag · Answer 1 · 2019-07-08T14:52:29+0000

You can try this following code implementation for vocabulary in NLP using scikit learn:

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 1]], dtype=int64)

You can pass it in a dict with your desired features as the keys.

If you want to use CountVectorizer on one set of docs, then you should also use the set of features from those documents for a new set, try using the vocabulary_ attribute of your original CountVectorizer and pass it to the new one.

For example:

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

The above code will create a new tokenizer using the vocabulary from your first one.

Hope this answer is helpful.

If you want to know more about scikit learn then visit this Scikit Learn Tutorial.

Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?

1 Answer

Related questions

Browse Categories