Bagging, or “Bootstrap Aggregation”, refers to an ensemble design structure in which each instance of the ensemble, such as an individual decision tree in a Random Forest, is created on a different subset of the original dataset. A bootstrap sample, or subsample, of the data is created by taking a random sample that includes some fraction of the original sample size. However, each observation can be chosen more than once within a subsample, or not chosen at all. This concept is also referred to as sampling with replacement. This subtlety creates additional variance in the dataset with the intent of reducing prediction variance on unseen data. An overall prediction for a given observation is determined by aggregating the resulting predictions from each instance of the ensemble. In a regression problem, this is usually performed by taking the average output, and in classification, majority vote is most often used.
The most commonly used Bagging algorithm in supervised machine learning is Random Forest. It builds on the idea of subsampling one step further by also considering a subset of the available features on each decision tree. This can especially improve model performance when there are multiple highly influential features among the candidate set of predictions in a non-linear decision boundary. The creation of an ensemble model using an approach such as bagging is one remedy to address the issue of overfitting that often occurs when only a single decision tree is used to make predictions.