What is Random Forest?

Random Forest is a collection of Decision Trees, where each tree is initialized with a randomly selected subset of training data and model features. Random Forest is a supervised machine learning algorithm, which can be used for solving both regression and classification problems, including both binary and multi-class classification. It uses an ensemble of decision trees to generate non-linear decision boundaries.

Random Forest is built on top of Decision Trees. Decision Trees, while being highly interpretable, suffer from the problem of low prediction accuracy because the trees tend to overfit the training data (also known as high variance). To overcome this problem, Random Forest introduces Bagging, where several decision trees are generated on different subsets of training data, and the outputs from these trees are combined together to make the final prediction. In general, averaging the output from different trees tends to reduce the model variance. 

However, there can be instances where Bagging does not help in reducing the variance much. This happens when a couple of strong predictive features exist in the training data, along with several other weak predictors. In this scenario, the top nodes of most of the trees will be similar, featuring those strong predictors as their top splitting nodes. This results in the generation of highly correlated trees. When the results of such correlated trees are combined, it doesn’t lead to much reduction in variance.  

In order to overcome this problem, Random Forest also does random sampling of training features in addition to the traditional Bagging. Let’s say we have a total of ‘n’ features. When individual trees are built, each time a split in the tree is considered, a random subset of only ‘m’ training features are chosen as split candidates from the full set of ‘n’ features. Doing so generates trees that don’t look like each other, and thus are decorrelated from one another as the strong predictors might not even be a candidate for many of the trees. These decorrelated trees, in addition to bagging, result in superior model performance and thereby increasing prediction accuracy. In common practice, m = √n or m = n/2 is chosen. 

Finally, in order to generate regression outcomes, the average value from all the trees is used. For classification, the majority vote from all the trees is used.