Splitting the dataset correctly is important for building of a reliable machine-learning model. If you do not divide the dataset properly, there may be overfitting issues in your model. It may underperform or fail to generalize to new data. You can split the dataset by dividing the data into train, validation, and test sets. This can be done by using the train_test_split from Scikit-learn.
- Training Set is used for training the model.
- Validation Set is used for hyperparameter tuning and to make sure that there is no overfitting.
- Test Set is used to evaluate the performance of the model on unseen data.
In this blog, we are going to take you through different methods for splitting data. So let’s get started!
Table of Contents
Method 1: Standard Train-Validation-Test Split
A common approach to split data is to split the data in the ratio 80:10:10, where:
- 80% of the data is used as the training data.
- 10% is used as the validation data.
- 10% is used as the testing data.
Example:
Output:
Explanation:
The above code is used to split the dataset into 80% training, 10% validation, and 10% test sets. It uses train_test_split from Scikit-Learn, which ensures proper data partition for model evaluation.
Method 2: Stratified Splitting for Imbalanced Datasets
A normal split might result in too few minority-class examples in the training set if your dataset is imbalanced (e.g., 90% class A, 10% class B). For maintaining class proportions across sets, you can use stratified sampling.
Example:
Output:
Explanation:
The above code is used to split an imbalance dataset into 80% training, and 10% test sets. It uses stratified sampling, which helps to maintain class distribution across all subsets.
Method 3: Time-Series Data Splitting
Random splitting is incorrect for time-series data. This is because it can leak future information into the training set. So, instead of random splitting, you can split the data chronologically.
Example:
Output:
Explanation:
The above code is used to create a dummy time-series dataset. It splits it chronologically into 80% training, 10% validation, and 10% test sets. This helps to preserve the temporal order.
Method 4: Custom Splitting Based on Dataset Size
The splitting of the dataset should change according to the size of the dataset.
- Small dataset (<10,000 samples) -> It uses a 90% train set and 10% validation set (Test set may not be needed; use k-fold cross-validation)
- Medium dataset (10,000 – 100,000 samples) -> It uses an 80% train set, 10% validation set, and 10% test set.
- Large dataset (> 100,000 samples) -> It uses 70% training set, 15% validation set, and 15% test set.
Example:
Output:
Example:
The above code is used to define an adaptive_split function. It dynamically adjusts train_validation_test split ratios based on the size of the dataset. It then applies it to a sample dataset of 50,000 elements.
Why is Data Splitting Important?
Splitting of data is an important step in the development of a machine learning model. It helps to ensure that the model is trained effectively. It generalizes well on unseen data, and it does not suffer from issues like overfitting or data leakage. Proper splitting of data gives you an unbiased evaluation of the performance of the model before deploying it in real-world applications.
The reasons that describe the importance of splitting data are given below:
- Ensuring Generalization to Unseen Data
Memorizing training data is not enough for machine learning models; they should also generalize well to new, unseen data. If you train and test on the same dataset, high accuracy will be achieved by the model, but it will fail to perform on real-world data. Splitting data into train, validation, and test sets will help you check whether the model is learning general patterns instead of memorizing specific examples.
Example: Overfitting vs. Generalization:
- Overfitting: Model performance is good on the training data but poor on new data.
- Generalization: The model learns patterns that are applicable to unseen data.
- To Prevent Overfitting
When the model learns too many specific details from the training data, overfitting occurs. This leads to poor performance on new data. To prevent this, you can use validation and test sets.
- Training Set: It is used to train the model.
- Validation Set: Used for hyperparameter tuning and to select the best model.
- Test Set: It is used for final performance evaluation on unseen data.
Example:
Overfitting is detected if the training accuracy is 99%. If the test accuracy is 65%, the
The model is overfitting.
- Providing an Unbiased Estimate of Model Performance
If the model is evaluated on the same data that is used for training, you will get a biased estimate of the performance. The appearance of the model may be highly accurate, but in reality, it might perform poorly on new data.
After the use of a separate test set, you will get a realistic idea of how the behavior of the model when deployed.
Example: Biased vs. Unbiased Performance Estimates
Evaluation Method | Accuracy Score |
Training Data (Biased Estimate) | 98% |
Test Data (Unbiased Estimate) | 85% |
- Reducing Data Leakage
When information from outside the training dataset is used to create the model, data leakage occurs. This leads to unrealistic high performance during the model training but poor real-world performance.
Example of Data Leakage:
- Use of future stock prices as input features to help predict stock trends.
- Application of data preprocessing (like scaling) to the entire dataset before splitting the data.
How to prevent data leakage?
1. You should always split the data before the application of transformations like
scaling.
2. You must ensure that the test data remains unseen until the final evaluation.
Example: Incorrect Approach (Data Leakage)
Explanation:
The above code applies feature scaling incorrectly to the entire dataset before splitting.
Example: Correct Approach (No Data Leakage)
Output:
Explanation:
The above code is used to correctly split the data first which helps to prevent data leakage. It then fits the scaler on the training set and applies the same transformation to the test set. This ensures proper preprocessing.
- Different Splitting Strategies
Different types of data require different splitting strategies.
Splitting Data | Best Used For |
Random Splitting | General Datasets |
Stratified Splitting | Imbalanced classification problems |
Time-Based Splitting | Time-series data |
K-Fold-Cross-Validation | Small datasets |
Example: Stratified Splitting for Imbalanced Data
Output:
Explanation:
The above code is used to split an imbalanced dataset into training (80%), validation (10%), and test (10%) sets using stratified sampling. This helps to maintain the original class distribution across all sets.
Conclusion
Splitting of data correctly is important for model performance. Using different types of techniques like stratified sampling, time-based splits, adaptive splitting, and cross-validation helps to ensure reliable evaluations and prevents data leakage. Experimenting with different splitting techniques will help improve the generalization of machine learning models. Moreover, you should always ensure that preprocessing steps like normalization or feature scaling are applied only to the training set before being used as test or validation sets. Model issues can be diagnosed by proper data splitting, such as overfitting and underfitting. This leads to better hyperparameter tuning and maintains the overall robustness of the model. If you want to learn more about this technology, then check out our Comprehensive Data Science Course.
FAQs
1. Why is it important to split data into training, validation, and test sets?
It is important to split data because the splitting of data ensures proper evaluation of the model by training on one set, hyperparameter tuning on another, and testing generalization on unseen data. This helps to prevent overfitting, which ensures reliable performance estimates.
2. What is the ideal ratio for splitting data into train, validation, and test sets?
The common splitting ratio for splitting data into training, validation, and test sets is 80:10:10, where 80% belongs to training, 10% belongs to validation, and 10% belongs to test sets. It may vary depending on the size of the dataset, with smaller datasets using more training data (e.g., 90:5:5) and larger ones that allow balanced splits (e.g., 70:15:15).
3. How do you ensure class balance when splitting data?
To ensure class imbalance while splitting, you can use stratified sampling (e.g., stratify=y in train_test_split ) which helps to maintain the same class distribution across train, validation, and test sets, especially for imbalanced datasets.
4. Should data preprocessing be done before or after splitting the dataset?
You should always split the dataset first, and then go for data preprocessing (like scaling or encoding) only on the training set before applying them to validation and test sets, which helps to prevent data leakage.
5. How should time-series data be split into train, validation, and test sets?
Time series data should be split chronologically (e.g., the first 80% for training, the next 10% for validation, and the last 10% for testing). This helps to avoid data leakage and ensure realistic model evaluation.