I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification.
I am looking forward to classifying a new document in the future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too.
My question is, how could I calculate the TF*IDF with just a single document?
As far as I understand, TF can be calculated based on a single document itself, but the IDF can only be calculated with a collection of document. In my current experiment, I actually calculate the TF*IDF value for the whole collection of documents. And then I use some documents as the training set and the others as the test set.
I just suddenly realized that this seems not so applicable to real life.
So there are actually 2 subtly different scenarios for classification:
- to classify some documents whose content are known but label is not known.
- to classify some totally unseen document.
For 1, we can combine all the documents, both with and without labels. And get the TF*IDF over all of them. This way, even we only use the documents with labels for training, the training result will still contain the influence of the documents without labels.
But my scenario is 2.
Suppose I have the following information for term T from the summary of the training set corpus:
- document count for T in the training set is n
- total number of training documents is N
Should I calculate the IDF of t for an unseen document D as below?
IDF(t, D)= log((N+1)/(n+1))
And what if I encounter a term in the new document which didn't show up in the training corpus before? How should I calculate the weight for it in the doc-term vector?