How to determine threshold/decision rule for a classification model?

One of the decisions that often has to be made in a classification problem is determining what threshold to use in order to classify observations into appropriate class designations. It often depends on the specific algorithm being used, as some machine learning classifiers are only designed to produce class estimates, while others have the capability of predicting a probability of belonging to each class. The following are a few options but do not constitute an exhaustive list.

  • 0.5 Cutoff: Perhaps the simplest decision rule is to classify any observations with a predicted probability above 0.5 as belonging to the positive class and below 0.5 to the negative class. This might be an acceptable threshold for a logistic regression on a well-balanced dataset, but if there is any skew present in the distribution, it will be unlikely to optimize the tradeoff between different types of errors.

  • Optimize a predetermined metric: Based on the context of the classification problem, it is often known ahead of time what, if anything, is most of interest to optimize. If there is no strong inclination to weight the positive and negative classes differently, or if a Type I and Type II error carry a similar cost, it might be recommended to choose a threshold that simply optimizes the overall accuracy of the model. On the other hand, if there is a different cost associated with each type of error, it might make sense to focus on reducing that error either by minimizing the number of false positives or false negatives by basing the threshold off of the appropriate error metric. Such a threshold can be obtained by identifying that corresponding to the largest area under the curve on an ROC curve or precision-recall curve. 

  • Aggregate approaches: If there is more interest in just predicting the total number of observations belonging to each class rather than the individual probabilities of specific observations, a potential approach in this case is to simply sum up the predicted probabilities over all of the observations to arrive at an aggregate estimate of the predicted number that fall into the positive class. As a simple example, consider two observations, where a classifier assigns the first a predicted probability of 0.9 and the second 0.1. If the classification task is not interested in doing any intervention at the individual level that might treat the observation with a higher predicted probability any differently than one with a lower probability, summing the two values would simply predict that of those observations, 1 would be predicted to be positive, which is a valid interpretation at the aggregate level. 

To illustrate why the sum of probabilities approach might be preferred for an aggregate class estimate, consider the following example dataset with predicted probabilities assigned. If each observation is converted using 0.5 as the threshold to flip the raw probability to the 1 class, only 2 observations would be assigned positive class labels. However, many of the observations that are flipped to 0 have probabilities just below the 0.5 cutoff. Therefore, if the raw probabilities are summed up, the total prediction is 4.68 (in a large dataset, this would probably be rounded to a whole number), which is more reflective of the overall distribution of raw probabilities than if treating each observation as a separate Bernoulli case.