I find myself having to learn new things all the time. I've been trying to think of ways I could expedite the process of learning new subjects. I thought it might be neat if I could write a program to parse a Wikipedia article and remove everything but the most valuable information.
I started by taking the Wikipedia article on PDFs and extracting the first 100 sentences. I gave each sentence a score based on how valuable I thought it was. I ended up creating a file following this format:
I then parsed this file and attempted to find various functions that would correlate each sentence with the value I had given it. I've just begun learning about machine learning and statistics and whatnot, so I'm doing a lot of fumbling around here. This is my latest attempt: https://github.com/JesseAldridge/Wikipedia-Summarizer/blob/master/plot_sentences.py.
I tried a bunch of stuff that didn't seem to produce much of any correlation at all -- average word length, position in the article, etc. Pretty much the only thing that produced any sort of useful relationship was the length of the string (more specifically, counting the number of the lowercase letter 'e's seemed to work best). But that seems kind of lame because it seems obvious that longer sentences would be more likely to contain useful information.
At one point I thought I had found some interesting functions, but then when I tried removing outliers (by only counting the inner quartiles), they turned out to produce worse results then simply returning 0 for every sentence. This got me wondering about how many other things I might be doing wrong... I'm also wondering whether this is even a good way to be approaching this problem.
Do you think I'm on the right track? Or is this just a fool's errand? Are there any glaring deficiencies in the linked code? Does anyone know of a better way to approach the problem of summarizing a Wikipedia article? I'd rather have a quick and dirty solution than something perfect that takes a long time to put together. Any general advice would also be welcome.