0 votes
1 view
in Machine Learning by (4.8k points)

I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.

corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)

But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.

Thanks in advance.

UPDATE

I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.

UPDATE

For example. I have the training data:

["a", "b", "c"]
["a", "b", "d"]

And do TFIDF, the result will contains 4 features(a,b,c,d)

When I TEST:

["a", "c", "d"]

to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"], there may have other problems.)

So how to store the features list for testing data (even more, store it in file)?

UPDATE

Solved, see answers below.

1 Answer

+1 vote
by (7.9k points)

I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

Codes below:

corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])

vectorizer = CountVectorizer(decode_error="replace")

vec_train = vectorizer.fit_transform(corpus)

#Save vectorizer.vocabulary_

pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))

#Load it later

transformer = TfidfTransformer()

loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))

tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

That works. tfidf will have same feature length as trained data.

you can do the vectorization and tfidf transformation in one stage:

vec =TfidfVectorizer()

then fit and transform on the training data

tfidf = vec.fit_transform(training_data)

and use the tfidf model to transform

unseen_tfidf = vec.transform(unseen_data)

km = KMeans(30)

kmresult = km.fit(tfidf).predict(unseen_tfid)

...