I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent of the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classification, by taking data overfitting problems into account.

That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might get stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.

An obvious solution would be separating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.

White papers, book pointers, and PDFs much appreciated!