XGBoost, which stands for Extreme Gradient Boosting, is a modern open-source implementation of Gradient Boosting Machine that works largely the same as the standard GBM. Like regular GBM, it fits to the residuals of previous trees and then predicts on a new observation by using a linear combination of the trees weighted by the learning rate. XGBoost provides parallel tree boosting and has become a popular machine learning algorithm for its scalability, accuracy, and speed.
Developed by Tianqi Chen, the XGBoost library implements the GBM algorithm. XGBoost improves upon the standard GBM framework by incorporating several additional features and optimizations, including:
- Regularization techniques such as L1 (Lasso) and L2 (Ridge) regularization to prevent overfitting and improve model generalization. This is done through the use of a regularization term in the objective function that penalizes large model weights.
- Tree pruning to remove splits that do not contribute to the overall performance of the tree. This reduces the complexity of the tree and can lead to faster and more accurate predictions.
- Parallel processing: XGBoost can use parallel processing to speed up training on large datasets by utilizing multiple CPU cores.
- Handling missing data: XGBoost has built-in capabilities for handling missing data, allowing it to make predictions even when some features are missing.
- Built-in cross-validation capabilities allowing for more accurate model evaluation and parameter tuning.
The following picture compares XGBoost with other GBM algorithms:
In the recent years, XGBoost library has become increasingly popular as it helped several teams win Kaggle structured data competition. XGBoost algorithm has been implemented in multiple coding language including R, Python, Scala, Julia, Perl.
Following are some of the interesting resources on XGBoost:
- Official Github Repository
- Official XGBoost documentation
- XG Boost Research Publication by Tianqi Chen
- Slide-deck explaining XGBoost