Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
in AI and Deep Learning by (50.2k points)

I have millions of short (up to 30 words) documents which I need to split into several known categories. It's possible, that a document matches several of the categories (seldom, but possible). It's also possible that a document doesn't match any of the categories (also seldom). I also have millions of documents that have already been categorized. What algorithm should I use to do the job? I don't need to do it fast. I need to be sure that the algorithm categorizes correctly (as far as possible).

What algorithm should I use? Is there an implementation of in C#?

1 Answer

0 votes
by (108k points)

Text classification with very high accuracy rates are possible (north of 95%) but to achieve those you need to do a lot of text preprocessing before you run your machine learning models.

The key challenge with text processing is to process the text sufficiently that the algorithm of choice can detect enough structure within the text decipher a neat signal.

You can refer the following link for Text Classifier Algorithms in Machine Learning:

You can find an example here.

Browse Categories