In Random Forest, decision trees are constructed independently and the results are aggregated through either averaging or majority vote after all trees are created. In comparison, GBM introduces a dependency between iterations that is controlled by a learning rate parameter. GBM usually does not predict very well on the first few iterations, as the weak learners tend to be simple decision trees that do not fully capture all relationships within the feature space. However, on subsequent iterations, GBM attempts to compensate for the observations on which it predicts the poorest by giving them increased weight on subsequent trees in the ensemble. Thus, the algorithm is able to learn from its errors over time by fitting to the residuals of previous iterations, and it continues in this sequential manner until a stopping criterion is reached.
GBM also differs from Random Forest in that it usually considers the entire set of features on each iteration rather than only a subset. Since the model structure is formed from the residuals of past iterations, it is not required to introduce additional variation to the feature space in the construction of the ensemble. It also is usually not necessary to subsample the observations. Choosing a subsample size less than the full number of records can further reduce prediction variance, though with a large enough training set that has full coverage of the feature space, variance is less of a concern. Overfitting is better controlled in GBM by carefully tuning hyperparameters such as learning rate and maximum depth.
GBM is less computationally efficient compared to Random Forest because it cannot be easily parallelized across iterations due to the dependency inherent in the model’s structure. However, alternative implementations of GBM such as LightGBM and XGBoost have the capability of being run on GPUs at a fraction of the time. XGBoost also introduces additional regularization to further prevent overfitting to the training data. A properly tuned GBM algorithm is often difficult to beat in terms of predictive performance on structured data sets. It also has the advantage of requiring little in terms of data preprocessing and can handle both numeric and categorical features, though depending on the software package used, the latter might have to be coded in some numeric fashion.