Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I'm looking for a method of classifying scanned pages that consist largely of text.

Here are the particulars of my problem. I have a large collection of scanned documents and need to detect the presence of certain kinds of pages within these documents. I plan to "burst" the documents into their component pages (each of which is an individual image) and classify each of these images as either "A" or "B". But I can't figure out the best way to do this.

More details:

  • I have numerous examples of "A" and "B" images (pages), so I can do supervised learning.
  • It's unclear to me how to best extract features from these images for the training. E.g. What are those features?
  • The pages are occasionally rotated slightly, so it would be great if the classification was somewhat insensitive to rotation and (to a lesser extent) scaling.
  • I'd like a cross-platform solution, ideally in pure python or using common libraries.
  • I've thought about using OpenCV, but this seems like a "heavy weight" solution.

1 Answer

0 votes
by (33.1k points)

There are 3 steps to solve your problem case.

  • Feature Extraction - When you have a large dataset to choose from in the object detection field. Then, I would recommend the SIFT/SURF class of features. You should also find Harris corners etc. suitable. 
  • Classifier Selection - Here you can use the Random Forest classifier. The concept is quite simple to understand and it is highly flexible and non-parametric. The tuning of the model requires very few parameters and you can also run it in a parameter selection mode during supervised training
  • Implementation -  Using complete python implementations for image processing is never going to be very fast. I recommend using a combination of OpenCV for feature detection and R for statistical work and classifiers.

Hope this answer helps you! For more details and insights, study Python Tutorial.

Browse Categories

...