Gradient Boosting Machines (GBM) is an ensemble-based supervised machine learning algorithm that is suitable for both regression and classification problems. The following are some of the key hyperparameters that are commonly used in GBM models:
Key GBM hyper-parameters:
- Learning rate (Shrinkage): This hyperparameter controls the rate at which the boosting algorithm learns from its mistakes. In every iteration, the GBM model is updated by adding the newly trained decision tree to the existing model using a learning rate parameter η. The learning rate,η, is usually set to a lower value (between 0.001 to 0.01) to allow the algorithm to more gradually compensate for its mistakes. Lower values of learning rate typically result in a more conservative model that takes longer to train but generalizes better to new data.
- Number of trees / iterations: This hyperparameter determines the number of decision trees that are created in the ensemble. More trees can lead to better performance, but can also increase the risk of overfitting. For every iteration, the training and validation error or accuracy scores can be plotted, and there is often a point at which the validation error begins to rise after a certain number of iterations. That optimum point can be noted for this parameter. Another approach to find the optimum level is to set the number of trees to a large value but introduce an early stopping criteria on which to terminate training when the out of sample deviance fails to improve after a certain number of iterations.
- Maximum depth of trees: This hyperparameter sets the maximum depth of each decision tree. The depth controls the complexity of individual decision trees in the ensemble. Larger values allow for more complex trees to be created but also increase the risk of overfitting. Smaller values are usually preferred, as the idea of boosting is to transform simple decision trees into more powerful learners by the conclusion of the ensemble. A value of depth = 1 performs effectively, resulting in each tree being a stump composed of only one split.
More GBM hyper-parameters:
- Minimum number of samples required to split a node: This hyperparameter sets the minimum number of samples that are required to split an internal node. Increasing this value can help prevent overfitting, however too high values can lead to underfitting. Hyper-parameter tuning needs to be done to find the optimum value
- Number of features at each split: This hyperparameter determines the number of features to be considered at each split. Smaller number of features (as compared to all features) at each split helps in de-correlating the tree as the strong predictors might not even be a candidate for many of the trees. These de-correlated trees result in superior model performance thereby increasing prediction accuracy. In common practice, m = √f or m = f/2 is chosen where ‘m’ refers to the number of features at each split and ‘f’ refers to total number of feature
- Minimum number of samples in the leaf node: By setting a minimum number of samples required in a leaf node, we can ensure that the model only splits nodes when there is sufficient data to support the split. If the number of samples in the leaf node is too small, the model may be fitting to the noise in the data rather than the underlying patterns. This parameters helps in prevent the model from overfitting to noise in the data and improve its ability to generalize to new data.
- Subsample ratio: This hyperparameter sets the fraction of the training data that is used to train each tree. A lower value can help reduce overfitting. The subsample value ranges between (0 to 1]. Subsample ratio = 1.0 means all training data is used
Above listed are the most important hyperparameters for a GBM model. For the complete list, please refer to the Scikit link.
It is important to tune these hyperparameters carefully to achieve the best performance for a given problem.
Hyperparameter tuning for GBM
To tune these hyperparameters, you can use techniques such as grid search, random search, or Bayesian optimization together with cross-validation. These techniques involve searching over a range of hyperparameter values and selecting the best performing hyperparameters based on a validation set.
It’s important to note that hyperparameter tuning can be time-consuming and computationally expensive, so it’s important to strike a balance between tuning enough to get good performance without spending too much time searching for the optimal hyperparameters.