Random forest is a type of supervised machine learning algorithm that is used for both classification and regression problems, including both binary and multi-class classification. It belongs to the family of ensemble learning techniques, which combines the predictions of multiple individual models to achieve better overall performance.
The algorithm works by creating a forest (collection) of decision trees, where each tree is trained on a randomly sampled subset of the training data and a random subset of the features. The randomness in the sampling ensures that each tree is slightly different from the others, which helps to reduce overfitting and increase the overall accuracy. When making a prediction for a new instance, each tree in the forest independently predicts the target variable, and the final prediction is determined by aggregating the individual predictions through voting (for classification problems) or averaging (for regression problems).
The key steps in the random forest algorithm are as follows:
- Generate training data samples using bootstrapping
- Randomly select a subset of the features.
- Build a decision tree on the bootstrapped training data and features.
- Repeat steps 1-3 to create a forest of decision trees.
- To make a prediction for a new instance, pass the instance through each tree in the forest and aggregate the individual predictions.
Infographic depicting Random Forest
Intuition behind Random Forest
Random Forest is built on top of Decision Trees. Decision Trees, while being highly interpretable, suffer from the problem of low prediction accuracy because the trees tend to overfit the training data (also known as high variance). To overcome this problem, the concept of Bagging was introduced, where several decision trees are generated on different subsets of training data, and the outputs from these trees are combined together to make the final prediction. In general, averaging the output from different trees tends to reduce the model variance.
However, there can be instances where Bagging does not help in reducing the variance much. This happens when a couple of strong predictive features exist in the training data, along with several other weak predictors. In this scenario, the top nodes of most of the trees will be similar, featuring those strong predictors as their top splitting nodes. This results in the generation of highly correlated trees. When the results of such correlated trees are combined, it doesn’t lead to much reduction in variance.
In order to overcome this problem, Random Forest was introduced where random sampling of training features was done in addition to the traditional Bagging. Let’s say we have a total of ‘p’ features. When individual trees are built, each time a split in the tree is considered, a random subset of only ‘m’ training features are chosen as split candidates from the full set of ‘p’ features. Doing so generates trees that don’t look like each other, and thus are de-correlated from one another as the strong predictors might not even be a candidate for many of the trees. These de-correlated trees, in addition to bagging, result in superior model performance and thereby increasing prediction accuracy. In common practice, m = √p or m = p/2 is chosen. Show below is an illustrative example of the performance of Random Forest algorithm for different values of m.
Finally, in order to generate regression outcomes, the average value from all the trees is used. For classification, the majority vote from all the trees is used.
Advantages and Disadvantages of Random Forest
The main advantages of random forest include its ability to handle high-dimensional data, its robustness to outliers and noise, and its ability to capture complex nonlinear relationships between the features and the target variable. Additionally, random forest can provide estimates of feature importance, which can be useful for feature selection and understanding the underlying data structure.
However, random forest can be computationally expensive, especially for large datasets with many features. Additionally, the algorithm can suffer from overfitting if the number of trees in the forest is too large or if the individual trees are too complex. Therefore, it is important to carefully tune the hyper-parameters of the algorithm, such as the number of trees, the maximum depth of the trees, and the size of the random subsets.
In conclusion, Random Forest is a powerful machine learning algorithm that is widely used for classification and regression tasks, particularly when dealing with high-dimensional and noisy data with non-linear decision boundary. Its ability to reduce overfitting, handle missing data, and provide estimates of feature importance make it a popular choice among practitioners. However, careful parameter tuning and model evaluation are important to ensure optimal performance on the specific problem at hand.