I have a ton of short stories about 500 words long and I want to categorize them into one of, let's say 20 categories:

  • Entertainment
  • Food
  • Music
  • etc

I can hand-classify a bunch of them, but I want to implement machine learning to guess the categories eventually. What's the best way to approach this? Is there a standard approach to machine learning I should be using? I don't think a decision tree would work well since it's text data...I'm completely new in this field.

Any help would be appreciated, thanks!

1 Answer

Using a naive Bayes will most probably work for you. The method is like:

You should fix a number of categories and simply train data set of (document, category) pairs.

A data vector from your document will be sth like a bag of words. e.g. Take the 100 most common words except words like "the", "and" and such. Each word should get a fixed component of your data vector. A feature vector is an array of booleans, each indicating whether the word came up in the corresponding document.


For your training set, calculate the probability of every feature and every class: 

p(C) = number documents of class C / total number of documents

Calculate the probability of a feature in a class: p(F|C) = number of documents of class with given feature (= word "food" is in the text) / number of documents in the given class.


Given an unclassified document, the probability of it belonging to class C is proportional to 

P(C|F1, ..., F500) = P(C) * P(F1|C) * P(F2|C) 


Since multiplication is numerically difficult, you can use the sum of the logs instead, which will maximize at the same 

C: log P(C|F1, ..., F500) = log P(C) + log P(F1|C) + log P(F2|C) + ... + log P(F500|C)

Hope this answer helps.

