For your case, you should perform some **NLP techniques** to classify your documents into categories.

You should simply use the **term frequency-inverse document frequency (tf-idf) method **to collect important words or corpus from a document, which helps to give the document a name or a class.

You can begin by converting your documents into **TF-log(1 + IDF) vectors**. The term frequencies of documents are sparse, so you should use python dict with the term as keys and count as values and then divide by total count to get the global frequencies.

Another approach is to use the **abs(hash(term))** for terms as positive integer keys. Then you can use scipy.sparse vectors that are handier and more efficient to perform linear algebra operations over python dict.

You should also build about 150 frequencies vectors by taking the average of the frequencies of all the labeled documents belonging to the same category. Then for a new document to label, you can compute the **cosine similarity** between the document vector and each category vector and then select the most similar category as the label for your document.

If this doesn’t seem to work for you then try training logistic regression model using an L1 penalty. The L1 penalty is equal to the absolute value of the magnitude of coefficients. In other words, it is used to limit the size of the coefficients. L1 can yield sparse models (i.e. models with few coefficients).

The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good accuracy (precision and recall). The scikit learn lib offers a **sklearn.metrics **module that returns the score for a given model and given dataset.

Hope this answer helps.