Related Questions:
– What is Gradient Boosting (GBM)?
– What are the advantages and disadvantages of a GBM model?
– What are the key hyper-parameters for a GBM model?
In Gradient Boosted Trees, the term “Gradient” refers to the gradient (or slope) of the loss function that is being optimized. The negative gradient indicates the direction of steepest descent, which is the direction that will result in the greatest decrease in loss function. By moving in the direction of the negative gradient, the algorithm is able to iteratively minimize the loss function and improve the accuracy of the model’s predictions.
A loss function is a measure of how well the model is able to predict the output variable based on the input features. Most commonly used loss functions are Mean Squared Error for Regression and Cross-Entropy for Classification . The goal of the algorithm is to minimize this loss function and increase model accuracy.
Taking the Regression example, the loss function L, in the GBM algorithm is given by:

where, L is the loss function, is the actual value and
is the predicted value of y.
The algorithm starts with a single decision tree, which is typically a simple one, such as a decision stump, i.e a tree with only one split. The residuals, or errors of the initial model are then calculated as:
residual (r1) =
The goal of the next tree is to predict the residual, which is given by . The closer the prediction of the second tree to the residual
, the more effective the final prediction is. In order to predict the residual
, the above loss function is minimized by differentiating with respect to
:
The prediction from the new tree, , which is the negative gradient of the loss function, is then added to the existing model. This process is repeated until a specified number of trees have been added to the model or until the loss function has been sufficiently minimized. By combining multiple decision trees that are fit to the negative gradient of the loss function, the model is able to learn complex non-linear relationships between the input features and the output variable. The Gradient Boosting model is explained in further detail in this post.
Also, this Stackexchange post provides an interesting insight into this topic.