Confusion matrix & f1-score

Last modified on 01 Oct 2021.

Confusion matrix
Precision & Recall
- When to use?
F1-Score
- When to use F1-Score?
- How to choose f1-score value?
Accuracy / Specificity
- When to use?
Confusion Matrix & F1-Score with Scikit-learn
References

Confusion matrix

	actual (yes)	actual (no)
predict (yes)	TP	FP
predict (no)	FN	TN

True Positive (TP): what we predict Positive is really Positive.
True Negative (FN): what we predict Negative is really Negative.
False Negative (FN): what we predict Negative is actually Positive.
False Positive (FP): what we predict Positive is actually Negative.

This guy is pregnant?

How to remember?

True/False indicates what we predicted is right/wrong.
Positive/Negative is what we predicted (yes or no).

Type I / Type II errors

FP = Type I error = rejection of true null hypothesis = negative results are predicted wrongly = what we predict positive is actually negative.
FN = Type II error = non-rejection of a false null hypothesis = positive results are predicted wrongly = what we predict negative are actually positive.

Why CM is important?

Give a general view about our model, “is it really good?” thanks to precision and recall!

Precision & Recall

	actual (yes)	actual (no)
predict (yes)	TP	FP	Precision
predict (no)	FN	TN
	Recall

Precision: How many of our positive predictions are really true? (Check the accuracy of our positive predictions).

$\mathrm {precision} = \dfrac{\mathrm{true\, positive}}{\mathrm{positively\, predicted\, results}} = \dfrac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}.$
Recall: How many of positive results belong to our predictions? (Do we miss some negative predictions?)

$\mathrm {recall} = \dfrac{\mathrm{true\, positive}}{\mathrm{positively\, actual\, results}} = \dfrac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}.$

When to use?

Precision is importantly used when the “wrongly predicted yes” (FP) influences much (e.g. This email is spam? – results yes but actually no and we lost important emails!).
Recall is importantly used when the “wrongly predicted no” (FN) influences much (e.g. In the banking industry, this transaction is fraudulent? – results no but actually yes and we lost money!).

F1-Score

High precision and low recall or vice versa? F1-Score gives us a balance between precision and recall.

$f_1 = \left({\frac {\mathrm {recall} ^{-1}+\mathrm {precision} ^{-1}}{2}}\right)^{-1}=2\times {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}.$

F1-score depends on how we label the class “positive”. This email is spam? is very different from This email is not spam?

When to use F1-Score?

When you need a balance between precision and recall.
When we have a “skewed class” problem (uneven class distribution, too many “yes” and very few “no”, for example).
One of precision and recall is improved but the other changes too much, then f1-score will be very small!

How to choose f1-score value?

Normally, $f_1\in (0,1]$ and it gets the higher values, the better our model is.

The best one ( $f_1=1$ ), both precision and recall get $100\%$ .
One of precision and recall gets very small value (close to 0), $f_1$ is very small, our model is not good!

What if we prefer one of precision and recall than the other? We consider $f_{\beta}$ ^[ref]

$f_{\beta} = ( 1 + \beta^2)\frac{\text{precision}\cdot\text{recall}}{\beta^2\cdot\text{precision} + \text{recall}}$

$f_1$ is a special case of $f_{\beta}$ when $\beta=1$ :

When precision is more important than recall, we choose $\beta < 1$ (usually choose $\beta=0.5$ ).
When recall is more important than precision, we choose $\beta > 1$ (usually choose $\beta=2$ ).

Accuracy / Specificity

Accuracy: How accurate our predictions to the whole predictions?

$\mathrm{accuracy} = \dfrac{TP + TN}{TP + TN + FP + FN}$
Specificity: How many negative results belong to our predictions?

$\mathrm{specificity} = \dfrac{TN}{FP + TN}$

When to use?

Accuaracy is used when we have symmetric datasets.
Specificity is used when we care about TN values and don’t want to make false alarms of the FP values (e.g. drug test).

Confusion Matrix & F1-Score with Scikit-learn

from sklearn.metrics import confusion_matrix
n_classes = target.shape[0]
confusion_matrix(y_true, y_pred, labels=range(n_classes))

Precision / Reacall / f1-score / support

from sklearn.metrics import classification_report
classification_report(y_test, y_pred)

ROC curve,

from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
%matplotlib inline

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# create plot
plt.plot(fpr, tpr, label='ROC curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random guess')
_ = plt.xlabel('False Positive Rate')
_ = plt.ylabel('True Positive Rate')
_ = plt.title('ROC Curve')
_ = plt.xlim([-0.02, 1])
_ = plt.ylim([0, 1.02])
_ = plt.legend(loc="lower right")

References

Classification: Precision and Recall - Google Developers, Machine Learning Crash Course.
Classification: Check Your Understanding (Accuracy, Precision, Recall) - Google Developers, Machine Learning Crash Course.
F-measure versus Accuracy - NLP blog.
Accuracy, Precision, Recall or F1? - Koo Ping Shung, Towards Data Science.
Dealing with Imbalanced data: undersampling, oversampling and proper cross-validation - Marco Altini.
Accuracy, Recall, Precision, F-Score & Specificity, which to optimize on? - Salma Ghoneim, Towards Data Science.