I'm looking for a method of classifying scanned pages that consist largely of text.
Here are the particulars of my problem. I have a large collection of scanned documents and need to detect the presence of certain kinds of pages within these documents. I plan to "burst" the documents into their component pages (each of which is an individual image) and classify each of these images as either "A" or "B". But I can't figure out the best way to do this.
More details:
- I have numerous examples of "A" and "B" images (pages), so I can do supervised learning.
- It's unclear to me how to best extract features from these images for the training. E.g. What are those features?
- The pages are occasionally rotated slightly, so it would be great if the classification was somewhat insensitive to rotation and (to a lesser extent) scaling.
- I'd like a cross-platform solution, ideally in pure python or using common libraries.
- I've thought about using OpenCV, but this seems like a "heavy weight" solution.