Explain the difference between Gini, Entropy, and Information Gain

Gini Score and Entropy are both measures that quantify the impurity of a node in a decision tree. They are the two most commonly used criteria in determining the best split to perform at each step in the construction of a classification tree. The term entropy originates from a physics context and refers to the notion of chaos, or disorder, present within a system. The analog for a decision tree is the impurity, or heterogeneity of observations belonging to multiple classes, of a node. The only practical difference between Entropy and Gini lies in the formula, and as a result, Gini ranges between 0 and 0.5, and Entropy between 0 and 1. In both cases, values closer to 0 indicate greater purity of the nodes, and values near the upper bound employ more impurity. 

In the binary classification setting, where p is the probability of belonging to one class, Entropy is calculated by using the following formula:

In the multi-class setting, the formula generalizes to a sum across all of the classes, or

The intuition can be presented by considering how entropy changes as a result of the above formula when the proportion of observations in a node that belong to 1 class changes across the possible range of proportions. In the extreme cases, where almost all observations belong to one of the classes, the entropy is very low, indicating greater homogeneity of that node. When the class distribution approaches 50/50, the entropy is at its peak of 1, which happens as a result of using log base 2 in the formula. 

A similar approach can be taken to understand the Gini calculation, which in the binary classification case, is formulated by:

The curve behaves the same way as with entropy but does not exceed 0.5, which occurs when the 50% of the observations in a node belong to each class. Both formulas can be extended to multiclass classification, where the proportion of observations belonging to each class would be denoted p1, p2, p3, etc, instead of being able to use 1-p as the complement of p to represent the proportion of the two classes.

In the regression case, appropriate error metrics that can be used as splitting criteria include standard MSE (Mean Squared Error) or MAE (Mean Absolute Error). Finally, Information Gain is simply the reduction in the error metric being considered achieved from creating the chosen split compared to an alternative transformation of the tree. Intuitively, the optimal split is the one that results in the largest information gain.