If the overall distribution of the outcome is heavily tilted towards one class compared to the other, the classification problem is considered to be imbalanced. In such situations, accuracy is usually not the best metric to use in order to evaluate the algorithm’s performance. For example, if 99% of the observations belong to one class, a very uninformative approach that simply predicts each observation to fall into the majority class would yield an accuracy of 99%. However, very little insight is gained from such an approach, especially if more of the practical interest lies in predicting the rare class. Even a slightly more sophisticated algorithm with a few input features is likely to perform better on the majority class simply because it is able to learn from a significantly larger sample size. Thus, there are a few general guidelines that are important to keep in mind when dealing with imbalanced classification:
- Evaluate algorithm using precision or recall: these metrics specifically measure the model’s ability to correctly classify the positive class. Assuming there is more of a desire to detect the rare class, such as in the presence of a disease, precision and recall give a better indication of the model’s ability to serve a useful purpose in such a context. For example, if 99% of the observations belong to the negative class, it is much more interesting to know how the model performs on the other 1%, which is exactly what precision and recall measure.
- Sampling Techniques:
Oversampling: Another option is to generate additional observations from the rare class. The simplest way to do so is by duplicating existing examples that belong to the minority class, but this doesn’t actually add any new information. Therefore, a better way is to generate synthetic samples using a technique such as SMOTE, or Synthetic Minority Oversampling Technique. SMOTE creates additional data points by identifying regions of the feature space where the actual observations of the rare class are concentrated and generating new samples from there. Thus, new data points are created that have similar feature values as to the actual observations from the rare class, but it is not simply duplicating existing records. While SMOTE addresses the issue of imbalance in the class distribution, it can only really improve a classifier’s performance if the algorithm is able to identify features that are correlated with belonging to the rare class.
Undersampling: Conversely, undersampling from the majority class can also be used to address an imbalanced class distribution. This involves randomly removing observations from the majority class until there is more of a balance. A potential downside to this approach is a loss of information, so it should only be used if there is enough data so that losing observations doesn’t adversely affect the classifier’s performance. In practice, both undersampling and SMOTE can be used together to address the issue. - Weighted Cost Function: One more approach for dealing with imbalanced data is to assign weights to the cost associated with misclassifying observations depending on which class they belong to. If the principal motivation is to correctly classify the rare class, a higher penalty can be placed on misclassifications made on the rare class. If not constrained to a reasonable degree, this can bring down the classifier’s performance on the majority class, so depending on the context, it might need to be determined how much emphasis needs to be placed on classification performance on the rare class compared to overall.