Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)
I've looked at Algorithms of the Intelligent Web that describes (page 55) an interesting algorithm - called DocRank - for creating a PageRank like score for business documents (i.e. documents without links like PDF, MS Word documents, etc...). In short it analyzes term frequency intersection between each document in a collection.

Can anyone else identify interesting algorithms described elsewhere, or wants to share something novel here, to apply against these types of documents to improve search results?

Please forgo answers involving things like click tracking or other actions NOT about analyzing the actual documents.

1 Answer

0 votes
by (33.1k points)

For your case, there are some techniques to solve your problem:

First Technique: step-wise similarity

If you want to gather a number of techniques and rank them along two axes - inherent complexity or ease of implementation. This technique would be high on the first axis but might underperform against state-of-the-art techniques.

We determined that the combination of low-frequency keyword intersection combined with the similarity of the document is a fairly strong predictor of the document's content. If two documents have a similar set of very low-frequency terms (e.g., domain-specific terms, like 'decision manifold', etc.) and they have similar inbound traffic profiles, that combined with a strongly probative similarity of the documents.

The better insight on this will be provided through the Machine Learning Algorithms. Since questions are quite roughly based on this, mastering the course would help you to crack Machine Learning Interview Questions as well.

Hope this answer helps you! 

Browse Categories

...