2 views

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k document that doesn't yet have categories. I'm trying to find the best way to programmatically categorize them.

I've been exploring NLTK and its Naive Bayes Classifier. It seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).

My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, the accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).

Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?

by (33.1k points)

For your case, you should perform some NLP techniques to classify your documents into categories.

You should simply use the term frequency-inverse document frequency (tf-idf) method to collect important words or corpus from a document, which helps to give the document a name or a class.

You can begin by converting your documents into TF-log(1 + IDF) vectors. The term frequencies of documents are sparse, so you should use python dict with the term as keys and count as values and then divide by total count to get the global frequencies.

Another approach is to use the abs(hash(term)) for terms as positive integer keys. Then you can use scipy.sparse vectors that are handier and more efficient to perform linear algebra operations over python dict.

You should also build about 150 frequencies vectors by taking the average of the frequencies of all the labeled documents belonging to the same category. Then for a new document to label, you can compute the cosine similarity between the document vector and each category vector and then select the most similar category as the label for your document.

If this doesn’t seem to work for you then try training logistic regression model using an L1 penalty. The L1 penalty is equal to the absolute value of the magnitude of coefficients. In other words, it is used to limit the size of the coefficients. L1 can yield sparse models (i.e. models with few coefficients).

The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good accuracy (precision and recall). The scikit learn lib offers a sklearn.metrics module that returns the score for a given model and given dataset.