XGBoost (Extreme Gradient Boosting) is a powerful and efficient machine learning algorithm. It works by combining the predictions of numerous simple models to create a strong and accurate prediction. Imagine you are trying to decide whether a fruit is an apple or an orange. One model looks at the color, another model looks at the size of the fruit, and another at the shape. Each model gives its own opinion on the fruit. The XGBoost algorithm combines all three opinions to make a better prediction.
In this blog, we will explain how the XGBoost algorithm enhances the concept of Gradient Boosting in Machine Learning. We will also demonstrate an example use case, which consists of loading the dataset, training the XGBoost model, its performance evaluation, and interpretation of the results. So let’s get started!
Table of Contents:
What is Gradient Boosting?
Before diving into XGBoost, it is important for you to understand the concept of Gradient Boosting. Gradient Boosting is an Ensemble Learning technique that helps you to build a strong predictive model by successively adding weaker models (typically decision trees). Each new tree is trained in such a way that it corrects the errors made by the trees that were built before. In Gradient Boosting, the “gradient” refers to using the gradient descent to minimize the loss function. Here, each new tree tries to correct the mistakes made by the previous trees. They focus on where the model went wrong and try to improve those predictions.
Boost your tech career in Machine Learning – Sign up now!
Practical projects, job-ready skills, and expert guidance.
What Is the XGBoost Algorithm in Machine Learning?
XGBoost, which stands for Extreme Gradient Boosting, is a powerful machine learning tool that builds decision trees in a sequential way to improve predictions of the model. It is designed in such a way that it can work fast, handle large datasets, and can also run across multiple computers. XGBoost is widely used for tasks like predicting values (regression), sorting items into specific categories (classification), and ranking things in order (ranking).
Origin and Motivation Behind XGBoost
The XGBoost algorithm was developed by Tianqi Chen and collaborators. It made its first appearance around 2014 as an optimized version of gradient boosting. Some of the motivations behind the development of XGBoost are described below:
- Scalability and Performance: At times, traditional gradient boosting libraries struggle with large datasets and take a lot of time to train. XGBoost introduced various optimizations, like, handling of sparse data in an efficient way, which made it significantly faster to train models.
- Regularization: While the traditional Gradient Boosting allowed some form of shrinkage via learning rate, XGBoost added explicit regularization terms like L1 and L2 to the objective function. This helped in removing overfitting.
- Handling Missing Data: XGBoost is also capable of handling missing data on its own. When there are some missing values, it finds the best way to split the data by sending the missing parts to the most suitable branch of the tree during the training process.
- Flexibility: XGBoost also supports many objective functions like logistic squared error, ranking losses, and even user-defined functions.
- Community and Competitions: The performance of XGBoost on Kaggle and other data science competitions is very impressive. Hence, it is a favourite choice for many data scientists.
Core Concepts and Terminologies in XGBoost
Some of the core concepts and terminologies involved in XGBoost are explained below:
Decision Trees
XGBoost used decision trees (specifically, CART – Classification and Regression Trees) as weak learners. Each tree splits the data into different sections and assigns a value to each section. In classification tasks, these values represent the scores that are translated into probabilities through logistic transformation, whereas in regression, they represent direct predictions.
Additive Training and the Objective Function
In XGBoost, the prediction of the model at iteration t is:
Where ft stands for the new tree added in the iteration t, the objective function to minimize the errors is:
Here,
- ℓ is the loss function (e.g., logistic loss for classification, squared error for regression).
- Ω(f) denotes the regularization term for a tree f, which is defined below:
Where T denotes the number of leaves in the tree, wj denotes the weight of the leaf j, γ denotes the parameter that adds a penalty for each leaf present in the tree, and λ is responsible for controlling L2 regularization of weights of the leaf.
XGBoost uses a mathematical trick, which is called a second-order Taylor approximation, for simplifying the loss function. It helps to quickly figure out the best values that are to be assigned to each leaf (also called weights) and decide where you should split the tree. Although the math behind these equations can be complex, the main idea behind this is that it makes the training process faster and more efficient.
Regularization
Unlike other gradient boosting frameworks, XGBoost explicitly includes L1 and L2 penalties on leaf weights (⍺ and 𝜆, respectively), as well as a penalty (γ) for growing new leaves. This additional regularization helps to reduce overfitting, especially when you have many trees and deep trees.
Handling Missing Values
XGBoost automatically learns how to handle missing values. During the construction of trees, for each split, it tries to send the missing values to both directions (the left child and the right child) and picks the one that has a higher gain. This built-in awareness of sparsity makes XGBoost convenient for real-world data having missing values.
Tree Pruning
Traditional decision tree algorithms grow until they meet a certain criterion and then prune back optionally. XGBoost uses a max_depth parameter to limit the tree growth, and it also uses a technique called maximally allowed loss reduction (controlled by γ) during finding splits. If adding a split does not reduce the objective (loss + regularization) by at least γ, then the split is not made. This helps to prune unwanted branches effectively.
How does XGBoost work?
XGBoost builds a series of decision trees, where each new tree tries to fix the errors made by the previous trees. The step-by-step working process of XGBoost is explained below:
- Start with a simple model: The first step includes training the first tree on the data. If it is a regression problem (number prediction), the first tree predicts the average value of the target variable.
- Check the errors: After the predictions are made by the first tree, you need to calculate how far these predictions are from the actual ones. The difference between the predicted values and the original values is called error.
- Train the next tree on the errors: The second tree is trained in such a way that it learns from the errors made by the first tree. It focuses on the parts where the first tree was wrong.
- Repeat the process: The process keeps on continuing with each new tree, trying to correct the errors made by the previous tree. You have to stop adding trees when you know that the model is good enough or when you have reached the limit.
- Add everything together: At last, all the predictions made by the trees are combined. For regression tasks, this means summing up the predictions. For classification tasks, these predictions are turned into probabilities.
Get 100% Hike!
Master Most in Demand Skills Now!
Training an XGBoost Model
Now, let us understand the end-to-end usage of XGBoost on a binary classification task, predicting whether the tumour is malignant or benign using the Breast Cancer dataset from scikit-learn. The steps for the training process of the model are given below:
- Loading and Preparing Data
Example:
Output:
Explanation:
The above code loads the breast cancer dataset from Scikit-learn. It splits the data into 80% training and 20% testing data. After that, it prints the number of samples in each set.
- Initializing the XGBoost Classifier
Example:
Explanation:
The above code creates an XGBoost classifier. It is not trained and evaluated in this step. The code only initializes the model, but it does not fit itself to the data or make predictions. Hence, the code does not generate an output.
- Training and Prediction
Example:
Explanation:
The code performs training and predictions, but it does not show any results on the screen. So, there is no output displayed.
- Evaluating Model Performance
Example:
Output:
Explanation:
On the Breast Cancer dataset, the XGBoost classifier gets an accuracy of 0.9561 (95.61%) on the test set. This shows that XGBoost has a strong predictive power.
- Interpreting Feature Importance
Instead of just checking the accuracy of the model, it is also advised to check which features had the biggest impact on the predictions. XGBoost helps you to do this by using feature_importance_ attribute. This shows the importance of each feature.
Example:
Output:
Explanation:
In this output, the “mean concave points” and “worst concave points” are the two most important features of the classification task. Each of these columns in the original dataset is used to capture a measure related to the concavity of the border of the tumour. You can have a good understanding of the model by visualizing the most important features. XGBoost provides you with built-in plotting tools to help you do that.
Hyperparameter Tuning
While the default settings of XGBoost are effective in many scenarios, tuning its hyperparameters can boost the performance of the model. Given below are the most commonly used hyperparameters:
- Learning Rate (eta): It controls how quickly the model adapts to the problem. Lower values lead to a slower, but reliable process of learning, requiring more trees to converge. The range of values lies between 0.01 to 0.3.
- Number of Trees (n_estimators): It helps to specify the number of boosting rounds or trees. A higher number of boosting rounds helps improve the accuracy, but also increases the risk of overfitting. The value ranges from 100 to 1000.
- Maximum Tree Depth (max_depth): It is used to determine the complexity of each tree. Deeper trees can learn more complex patterns in the data, but they might fit too closely to the training data as well. This leads to poor performance on the training data.
- Subsample: It is used to define the fraction of the training data that is used to grow each tree. Values that are less than 1.0 introduce randomness, which helps to reduce overfitting. The value ranges from 0.5 to 1.0.
- Column Subsampling: These settings (like colsample_bytree) are used to control the number of features (columns) used by the model while building trees. Values range between 0.5 and 1.0.
- Regularization Parameters (lambda and alpha): The lambda parameter applies L2 regularization (default = 1), while the alpha parameter applies L1 regularization (default = 0). By increasing these values, you can prevent overfitting, especially in high-dimensional tasks.
- Gamma: It is used to set the minimum loss reduction that is required to make a further partition on a leaf node. The algorithm becomes more conservative with higher values.
- Scale Pos Weight:It is useful for problems with imbalanced classification. This parameter is used to adjust the balance between positive and negative classes by weighing the positive classes more heavily.
Hyperparameter tuning can be done by using methods like GridSearchCV or RandomizedSearchCV from the scikit-learn library for optimizing the performance of the model. Given below is an example code illustrating Hyperparameter tuning.
Example:
Output:
XGBoost vs Gradient Boosting
The difference between XGBoost and Gradient Boosting is given below in a tabular format:
Aspect |
Gradient Boosting |
XGBoost |
Speed |
It is slower as it processes trees sequentially. |
You can use it for faster training due to parallel processing. |
Efficiency |
It works well, but often takes more memory and time. |
It’s more memory-efficient and is optimized for performance. |
Regularization |
There are no built-in regularization options. |
You can control overfitting with L1 and L2 regularization. |
Handling Missing Data |
You have to handle missing values manually. |
It automatically learns how to handle missing values. |
Tree Pruning |
It grows trees fully and prunes them back. |
It uses a smarter approach with max depth and gain. |
Scalability |
It’s fine for small to medium datasets. |
You can use it for large datasets and distributed computing. |
Built-in Features |
You can use external tools for tuning and evaluation. |
It provides built-in tools for cross-validation, plotting, and more. |
Community and Support |
It provides community support, but is less active when compared to other libraries in Scikit-learn. |
You can benefit from a strong community and wide usage in competitions. |
XGBoost vs Random Forest
The difference between XGBoost and Random Forest is given below in a tabular format:
Aspect |
Random Forest |
XGBoost |
Learning Style |
All the trees can be built in a parallel way. |
The trees are built one after another, with each tree improving on the last one. |
Speed |
You can train the model faster as the trees are built in parallel. |
The model training process is slow, but you can get accurate results. |
Accuracy |
It performs well but sometimes misses subtle patterns. |
It gets better accuracy as it focuses on the errors step by step. |
Overfitting |
It is less prone to overfitting. |
The model can overfit if you do not train the model properly. |
Interpretability |
It is easier to understand and explain. |
It is harder to interpret, especially with many boosting rounds. |
Handling Missing Data |
It doesn’t handle missing data well by default. |
It learns how to deal with missing values automatically. |
Hyperparameter Tuning |
It works well with minimal tuning. |
You have to tune it carefully for the best results. |
Use Case |
It works well when you need a quick and robust model. |
It is preferred when high accuracy is crucial and tuning is acceptable. |
Advantages of XGBoost
Some advantages of XGBoost are given below:
- You can handle very large datasets with millions of rows without slowing down.
- It uses multiple CPU cores or even GPUs for training models faster.
- It helps you to adjust settings and also add controls to improve the model’s performance and prevent overfitting.
- It shows the features that are most important for making predictions, thus helping with model understanding.
- It is widely used and supported in many programming languages like Python, R, and Java.
Disadvantages of XGBoost
Some disadvantages of XGBoost are given below:
- A lot of computing power can be used by XGBoost. In this way, it may fail when there are not enough resources in the system.
- It is highly prone to noisy data or outliers. Therefore, the data needs to be cleaned before you train the model.
- There are chances for the model to overfit (memorize the data too well) if you have a small dataset or if you use too many trees.
- The logic of its conclusions may not be easy to grasp all the time, despite showing what is most important, making its use in healthcare or finance complicated.
Conclusion
XGBoost is considered to be one of the most powerful and versatile machine learning algorithms available nowadays. It delivers high performance across a wide range of tasks because of its ability to handle large datasets, support for regularization, and many advanced features like handling missing values and parallel processing. Although you should tune the model carefully, it can be complex with traditional models. You can use it for classification, regression, and ranking problems. Hence, it is important to have a good understanding of XGBoost as you can use it to enhance the quality and reliability of your model efficiently..
What is XGBoost in Machine Learning – FAQs
Q1. Can I use XGBoost for time series forecasting?
Yes, you can use XGBoost for time series forecasting by framing the problem as supervised learning by using lag features.
Q2. Is multi-class classification supported by XGBoost?
Yes, it supports multi-class classification by using the multi:softprob or multi:softmax objective.
Q3. Is it suitable to use XGBoost for small datasets?
Although XGBoost can be used for small datasets, it is recommended that you use simpler models because there is no risk of overfitting.
Q4. Can XGBoost be used with categorical variables?
XGBoost cannot be used with categorical variables directly. You need to encode them before training them.
Q5. Does XGBoost support early stopping?
Yes, it supports early stopping based on the performance validation. This helps to avoid overfitting and save training time.