AUC ROC vs AUC PR: Offline Model Evaluation
Below, we compare the tradeoffs of AUC-ROC and AUC-PR, two popular metrics for evaluating binary classification models. We also walk through a concrete example illustrating when each metric is most appropriate.
Introduction
Machine learning is all about making predictions. But how do we know if our predictions are any good? What makes one method of making predictions better than another? This is a crucial question. We have to convince ourselves, and more importantly, others that our model is actually useful. For instance, suppose we want to introduce a new model to recommend YouTube videos. There may be existing models already in place, such as recommending popular videos, similar videos to user watch history, or videos that are trending. How do we know if our new model is better than the existing ones? This is where evaluation comes in.
In the space of predictions, the simplest type of prediction is a binary variable, i.e. yes or no. For instance, we could predict
- Will a user click on a video?
- Is this email spam?
- Is this transaction fraudulent?
- Is this image a cat or a dog?
- Will a user churn from a service?
In reality, a model cannot make perfect predictions. What is done in practice is that a model will output a probability of the positive class, i.e. the probability that the answer is yes.
How do we measure how good these probabilities are? We usually await the labels to be revealed, and then we can compare the predicted probabilities, denoted , to the actual outcomes, denoted .
One simple way to evaluate the model is to choose a threshold, say 0.5, and classify all predictions above that threshold as positive (yes) and all predictions below that threshold as negative (no). Then we can compute the accuracy:
This is a simple metric, and a pretty good one. However, there are some drawbacks:
- The number is arbitrary.
- It does not take into account the confidence of the predictions. For instance, a prediction of and a prediction of are both classified as positive, but the latter is much more confident.
- It does not take into account the class imbalance. For instance, if only 1% of the examples are positive, then a model that always predicts negative will have an accuracy of 99%, which is not very useful.
A useful exercise is to list which one of (1), (2), and (3) are resolved by the following metrics:
- Log Loss
- F1 Score
- Precision
- Recall
- AUC-ROC
- AUC-PR
For the rest of the post, we will focus on the last two metrics.
AUC-ROC and AUC-PR
Both AUC-ROC and AUC-PR are threshold-independent metrics, implying they resolve (1) above. At a high-level, they both measure the ability of the model as a ranker, i.e. how well the model ranks the positive examples higher than the negative examples. However, they do so in different ways and have different interpretations.
AUC-ROC computes the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. AUC-PR computes the average precision across all recall levels. To see the difference between the two, consider the following example.
Here we have 8 samples, with 6 negative and 2 positive examples. 3 of the negative samples are ranked higher than the positive ones, and the other 3 are ranked lower. Note we do not care what the actual predicted probabilities are, only the ranking matters.
To achieve a recall of .5, we need to include 4 points and to achieve a recall of 1, we need to include 5 points. Thus the AUC-PR is the average of the precision at recall levels .5 and 1, which gives
On the other hand, a randomly sampled positive example is ranked higher than a randomly sampled negative example with probability , which gives an AUC-ROC of .
Now, and here is the key point, suppose we add a bunch of negative examples that are lower ranked than all of the samples above. This will not change the AUC-PR, as the above computation still holds. However, the AUC-ROC will increase slowly to 1 as the number of negative examples increases. This is because the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example increases as we add more negative examples that are ranked lower than the positive examples.
Tradeoffs and Conclusion
In the real world, we could have a lot of negative examples and very few positive examples. In this case, AUC-ROC can be misleadingly high, as it can be dominated by the large number of negative examples that are ranked lower than the positive examples. In contrast, AUC-PR is more sensitive to the class imbalance and can provide a more informative evaluation of the model’s performance in such cases.