+1 vote
2 views
in Machine Learning by (330 points)

How can I divide data into training and validation sets, Should I divide it 50%-50% for both or is there another criteria to divide it in training data and validation data or this thing depends over application? Currently I am using 80% training data and 20% validation data, Is there anyone who's experienced in machine learning advice me on this? 

1 Answer

+2 votes
by (10.9k points)
edited by

@kavita,There are two main concerns regarding the division-

1.With Less training data,your parameter estimates have greater variance.

2.With Less testing data, your performance statistic will have greater variance.

It should be divided in such a way that neither variance is too high.According to Pareto principle, 80/20 is the common occurring ratio.

Let’s assume you have enough data for a proper split, following are some instructive ways to get a handle on variances:

  1. split the data into training and testing.
  2. Then slit the training data into validation and training.
  3. Subsample random selections of training data, train the classify and then record a performance on the validation set.
  4. Try a different type of splits, you will notice greater performance with more data.
  5. To get a handle on variance follows the same procedure but in reverse.
If you are a beginner and want to know more about Machine Learning, then check out this course by Intellipaat which will teach you ML from basics: Machine Learning Course
Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...