2 views

A pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related.

Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name.

How do you go about finding the possibly related documents?

I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python or relying on MySQL or pgsql.

the current answer is pretty good but I'd like to see more. Maybe some really bare example code for a thing or two.

by (33.1k points)

Your question needs a vast answer, but I’d try to keep it short and crisp.

The simplest approach to this problem is called a bag of words. Each document should be reduced to a sparse vector of {word: wordcount} pairs, and you can simply use a NaiveBayes classifier at the set of vectors that represents your set of documents or compute similarity scores between each and every other bag, using k-nearest-neighbor classification. KNN is quite fast for this task but requires O(n^2) storage for the score matrix.

KNN rapidly becomes ineffective, when used on large datasets. In that case, you might consider a ranking support vector machine (SVM). SVMs are quite useful here because they don't constrain you to linear similarity measures, that's why they are still quite fast.

Stemming is another common preprocessing step for bag-of-words techniques. This works reducing morphologically related words, such as "cat" and "cats", "Bob" and "Bob's", or "similar" and "similarly", down to their sources before computing the bag of words. There are a bunch of different stemming algorithms to be used.

If the bag-of-words similarity isn't performing good enough, you can abstract it up a layer to bag-of-N-grams similarity, where you create the vector that represents a document based on pairs or triples of words. This has the problem of producing much larger vectors, and classification will, therefore, take more work, but the similar matches you get will be much closer. You probably don't need this for semantic similarity, it's better for a case like plagiarism detection, Chunking, or reducing a document down to lightweight parse trees, can also be used, but this is more useful for things like the authorship problem.

Mapping words to concepts using a thesaurus such as WordNet, then classifying documents based on the similarity between concepts used. This mostly ends up being more efficient than word-based similarity classification, since the mapping from words to concepts is reductive, but the preprocessing step might be rather time-consuming.

Discourse parsing, which involves parsing documents for their semantic structure, you can also run similarity classifiers on discourse trees, the same way you can be used on chunked documents.

Hope this answer helps.