2 views
As a self-development exercise, I want to develop a simple classification algorithm that, given a particular cell of a Dilbert cartoon, is able to identify which characters are present in the cartoon (Dilbert, PHB, Ratbert, etc.).

I assume the best way to do this is to (1) apply some algorithm to the image, which converts it into a set of features, and (2) use a training set and one of many possible machine learning algorithms to correlate the presence/absence of certain features with a particular character being present in the cell.

So my questions are - (a) is this the correct approach, (b) since there's a number of classification algorithms and ML algorithms to test, what is a good methodology for finding the right one, and (c) which algorithms would you start with, given that we're essentially conducting a classification exercise on a cartoon.

by (33.1k points)

This project is more challenging than most ML problems because here you will actually have to create your training data set from the raw data. For instance, take a frame, name two characters in that frame. Extract those two characters from that frame and append to each the appropriate class label (e.g., "1" for Dilbert).

Step 1

To extract the individual characters from each of the frames comprising the Dilbert cartoon, I would suggest a spectral decomposition of each frame. If you are not familiar with this technique, at its core, it's just an eigenvector decomp.

You can python (or R, given that you can use python-to-R bindings like RPy) then I would strongly encourage you to look at sklearn. In particular, this excellent library has several algorithms for image segmentation, one of which is based on spectral clustering. For this step in your project, you would most likely want to look at these two scikit-learn modules

sklearn.feature_extraction

sklearn.cluster.spectral_clustering

Included with these two modules are two good example scripts, one segmenting a digital photograph and the other segmenting an image comprised of three partially superimposed circles with minimal contrast w/r/t each other and w/r/t the background--both, I suspect are more difficult problems that the decompositions you will need to perform. In other words, sklearn has two complete, well-documented example scripts included in the source distribution, both of which process data similar to yours. Either or both would be an excellent template for this step.

Step 2

So that's the first step; here's the second: sort all of the components of the decomposed images into groups, one group for each Dilbert character. Next, assign a class label to each Group, e.g., if there are four characters from your decomposition step, then a decent choice for class labels is "0", "1", "2", and "3." Append those class labels to the component matrices (the decomposition products from step 1) so each character matrix is mapped to its corresponding class (Dilbert character).

Step 3

Select a suitable ML technique. You have many choices for this step; the only criteria are that the technique is in the supervised category (because you have assigned class labels to your data) and that it functions as a classifier (i.e., it returns a class label, versus a regressor which outputs a numerical value). Given this is a personal project, I would choose the one that seems most interesting to you. A few that satisfy the criteria I just mentioned are multi-layer perceptron (neural network), support vector machine (SVM), and k-nearest neighbors (kNN).

Step 4

train, validate, and test your classifier

Alternative Technique: Template Matching

Once Step 1 is completed (each image is decomposed into a set of objects, some of which will no doubt represent the characters) you can manually sift through these decomposition products and collect exemplars for each character in the cartoon. The are the templates.

Next, you compare objects segmented from an image with this set of unique templates. In scikit-image, another scipy scikit, you can use the method match_template, to which you pass in a template image and a candidate image, and this method returns a 2D array showing the pixel-by-pixel correlation (between -1 and 1).