One of the most useful tools for evaluating the performance of any classification algorithm is the confusion matrix. It is best demonstrated in a binary classification setup, but it can easily be extended to multi-class classification. In a basic confusion matrix, one dimension corresponds to the true classes of the observations, and the other to the predicted classes as outputted from a classification algorithm.
The cells on the diagonal of the matrix correspond to correct classifications, meaning the observations that actually belong to the positive class that the model also predicts are positive, as well as those belonging to the negative class predicted to be negative. The off-diagonal elements represent misclassifications, both in the case where the model predicts observations to be positive when they are actually negative (false positive), and vice versa (false negative).

In this example, there are a total of 360 observations in the dataset. A confusion matrix can be produced both for the training data as well as the validation, but it is a better representation of the model’s true performance to evaluate it on the latter. Once the confusion matrix is created, there are several metrics that can be extracted from it that give insight into how well the algorithm is performing.
- Accuracy: Accuracy is the most straight-forward evaluation metric for a classification problem, and it simply measures the overall proportion of observations that were correctly classified. The accuracy can be calculated from a confusion matrix by summing the diagonal cells (true positives and true negatives) and then dividing by the total number of observations. In general, accuracy is calculated by
Accuracy = (True Positives + True Negatives) / Total Observations,
and in the example above, the accuracy is (100 + 150) /360 = .694
The opposite of accuracy is the misclassification rate, which is simply 1 – accuracy.
- Recall (or Sensitivity): The recall measures the proportion of actual observations that belong to the positive class that were correctly classified by the algorithm. In other words, it measures how sensitive the algorithm is in detecting true positives. In general,
Recall = True Positives / (True Positives + False Negatives),
and in the example above, the Recall is 150/(150 + 50) = .75 - Precision: The precision can be thought of as the reverse of the sensitivity, in that it measures the proportion of observations that the algorithm predicts to be positive that actually are positive labels. In general, precision is calculated by
Precision = True Positives / (True Positives + False Positives),
and in this example, the Precision is 150/(150 + 60) = .714
- F1 Score: One of the problems in binary classification is that it is impossible to optimize either precision or recall without adversely affecting the other. This is the general tradeoff between false positives and false negatives, or Type I and Type II error. For example, maximizing the recall would require minimizing the number of false negatives, yet creating a decision threshold that does so would subsequently lead to more false positives and thus a lower precision score.
The F1 score is a weighted average of both the precision and recall, and if the each is given equal weight, the formula reduces to:
F1 = 2 * (precision * recall) / (precision + recall), which is the harmonic mean of the two measures.
If either precision or recall is deemed to be more important for a specific classification task, they can be weighted accordingly. - Specificity: The specificity is the analog of recall for the negative class. It measures the proportion of the observations that belong to the negative class that were correctly predicted to be negative labels. In general, Specificity is calculated by
Specificity = True Negatives / (True Negatives + False Positives),
and in this example, the specificity is 100 / (100 + 60) = .625
- False Positive Rate: The false positive rate measures the proportion of actual negative observations that were predicted to be positive. In other words, it is 1 – specificity, or
False Positive Rate = False Positives / (False Positives + True Negatives),
and in our example, FPR = 60 / (60 + 100) = .375