Explore Courses Blog Tutorials Interview Questions
0 votes
in Machine Learning by (19k points)

I'm looking for a method of classifying scanned pages that consist largely of text.

Here are the particulars of my problem. I have a large collection of scanned documents and need to detect the presence of certain kinds of pages within these documents. I plan to "burst" the documents into their component pages (each of which is an individual image) and classify each of these images as either "A" or "B". But I can't figure out the best way to do this.

More details:

  • I have numerous examples of "A" and "B" images (pages), so I can do supervised learning.
  • It's unclear to me how to best extract features from these images for the training. E.g. What are those features?
  • The pages are occasionally rotated slightly, so it would be great if the classification was somewhat insensitive to rotation and (to a lesser extent) scaling.
  • I'd like a cross-platform solution, ideally in pure python or using common libraries.
  • I've thought about using OpenCV, but this seems like a "heavy weight" solution.

1 Answer

0 votes
by (33.1k points)

There are 3 steps to solve your problem case.

  • Feature Extraction - When you have a large dataset to choose from in the object detection field. Then, I would recommend the SIFT/SURF class of features. You should also find Harris corners etc. suitable. 
  • Classifier Selection - Here you can use the Random Forest classifier. The concept is quite simple to understand and it is highly flexible and non-parametric. The tuning of the model requires very few parameters and you can also run it in a parameter selection mode during supervised training
  • Implementation -  Using complete python implementations for image processing is never going to be very fast. I recommend using a combination of OpenCV for feature detection and R for statistical work and classifiers.

Hope this answer helps you! For more details and insights, study Python Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.5k questions

29.9k answers


99.1k users

Browse Categories