# Distinguishing overfitting vs good prediction

1 view

These are questions on how to calculate & reduce overfitting in machine learning. I think many new to machine learning will have the same questions, so I tried to be clear with my examples and questions in hope that answers here can help others.

I have a very small sample of texts and I'm trying to predict values associated with them. I've used sklearn to calculate tf-idf, and insert those into a regression model for prediction. This gives me 26 samples with 6323 features - not a lot.. I know:

>> count_vectorizer = CountVectorizer(min_n=1, max_n=1)

>> term_freq = count_vectorizer.fit_transform(texts)

>> transformer = TfidfTransformer()

>> X = transformer.fit_transform(term_freq)

>> print X.shape

(26, 6323)

Inserting those 26 samples of 6323 features (X) and associated scores (y), into a linear regression model, gives good predictions. These are obtained using leave-one-out cross-validation, from cross_validation.LeaveOneOut(X.shape, indices=True) :

using ngrams (n=1):

human  machine  points-off  %error

8.67    8.27 0.40       1.98

8.00    7.33 0.67       3.34

...     ... ...        ...

5.00    6.61 1.61       8.06

9.00    7.50 1.50       7.50

mean: 7.59    7.64 1.29     6.47

std : 1.94    0.56 1.38     6.91

Pretty good! Using ngrams (n=300) instead of unigrams (n=1), similar results occur, which is obviously not right. No 300-words occur in any of the texts, so the prediction should fail, but it doesn't:

using ngrams (n=300):

human  machine  points-off  %error

8.67    7.55 1.12       5.60

8.00    7.57 0.43       2.13

...     ... ...        ...

mean:  7.59 7.59    1.52 7.59

std :  1.94 0.08    1.32 6.61

Question 1: This might mean that the prediction model is overfitting the data. I only know this because I chose an extreme value for the ngrams (n=300) which I KNOW can't produce good results. But if I didn't have this knowledge, how would you normally tell that the model is over-fitting? In other words, if a reasonable measure (n=1) were used, how would you know that the good prediction was a result of being overfit vs. the model just working well?

Question 2: What is the best way of preventing over-fitting (in this situation) to be sure that the prediction results are good or not?

Question 3: If LeaveOneOut cross validation is used, how can the model possibly over-fit with good results? Over-fitting means the prediction accuracy will suffer - so why doesn't it suffer on the prediction for the text being left out? The only reason I can think of: in a tf-idf sparse matrix of mainly 0s, there is strong overlap between texts because so many terms are 0s - the regression then thinks the texts correlate highly.

Please answer any of the questions even if you don't know them all. Thanks!

by (33.2k points)

Model Overfitting:

There is one useful rule of thumb which tells that you may be overfitting when your model's performance on its training set is much better when tried on its held-out validation set or in a cross-validation setting.

The procedure for testing for overfitting is, a plot training set and validation set error as a function of training set size. If there is a stable gap at the right end of the plot, then you're probably overfitting.

To avoid overfitting, use a held-out test set. That set will do an evaluation of the model when you're completely done with model selection. This held-out set should not be used for model training. The accuracy score you get on the test set is the model's final evaluation.

As you can tune the model as much as you want in this cross-validation setting until it performs nearly perfectly in cross-validation. That’s why we use a cross-validation setting.