I have been working with the CountVectorizer class in scikit-learn.
I understand that if used in the manner shown below, the final output will consist of an array containing counts of features, or tokens.
These tokens are extracted from a set of keywords, i.e.
tags = [
"python, tools",
"linux, tools, ubuntu",
"distributed systems, linux, networking, tools",
]
The next step is:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data
Where we get
[[0 0 0 1 1 0]
[0 1 0 0 1 1]
[1 1 1 0 1 0]]
This is fine, but my situation is just a little bit different.
I want to extract the features the same way as above, but I don't want the rows in data to be the same documents that the features were extracted from.
In other words, how can I get counts of another set of documents, say,
list_of_new_documents = [
["python, chicken"],
["linux, cow, ubuntu"],
["machine learning, bird, fish, pig"]
]
And get:
[[0 0 0 1 0 0]
[0 1 0 0 0 1]
[0 0 0 0 0 0]]
I have read the documentation for the CountVectorizer class and came across the vocabulary argument, which is a mapping of terms to feature indices. I can't seem to get this argument to help me, however.
Any advice is appreciated.