The website is in Maintenance mode. We are in the process of adding more features.
Any new bookmarks, comments, or user profiles made during this time will not be saved.

Machine Learning Resources

Explain the difference between Entropy, Gini, and Information Gain

Bookmark this question

Gini Score and Entropy are both measures that quantify the impurity of a node in a decision tree. They are the two most commonly used criteria in determining the best split to perform at each step in the construction of a classification tree.


The term entropy originates from a physics context and refers to the notion of chaos, or disorder, present within a system. The analog for a decision tree is the impurity, or heterogeneity of observations belonging to multiple classes, of a node. The only practical difference between Entropy and Gini lies in the formula, and as a result, Gini ranges between 0 and 0.5, and Entropy between 0 and 1. In both cases, values closer to 0 indicate greater purity of the nodes, and values near the upper bound employ more impurity. 

In the binary classification setting, where p is the probability of belonging to one class, Entropy is calculated by using the following formula:

In the multi-class setting, the formula generalizes to a sum across all of the classes, or

The intuition can be presented by considering how entropy changes as a result of the above formula when the proportion of observations in a node that belong to 1 class changes across the possible range of proportions. In the extreme cases, where almost all observations belong to one of the classes, the entropy is very low, indicating greater homogeneity of that node. When the class distribution approaches 50/50, the entropy is at its peak of 1, which happens as a result of using log base 2 in the formula. 


A similar approach can be taken to understand the Gini calculation, which in the binary classification case, is formulated by:

The curve behaves the same way as with entropy but does not exceed 0.5, which occurs when the 50% of the observations in a node belong to each class. Both formulas can be extended to multiclass classification, where the proportion of observations belonging to each class would be denoted p1, p2, p3, etc, instead of being able to use 1-p as the complement of p to represent the proportion of the two classes.

In the regression case, appropriate error metrics that can be used as splitting criteria include standard MSE (Mean Squared Error) or MAE (Mean Absolute Error).

Information Gain

Information gain is a key metric used in decision tree algorithms to choose the feature that best splits the dataset at each node. It measures the reduction in entropy or impurity before and after the split. Intuitively, the optimal split is the one that results in the largest information gain. Therefore, the feature with the highest information gain is chosen for the split at each decision point in the tree, leading to a more efficient and accurate classification or regression model.

Leave your Comments and Suggestions below:

Please Login or Sign Up to leave a comment

Partner Ad  

Find out all the ways
that you can

Explore Questions by Topics

Partner Ad

Learn Data Science with Travis - your AI-powered tutor |