Evaluation

Evaluation of models is one of the most crucial parts of music classification. No matter how many state-of-the-art models are available, the practical performance of the application can be different depending on which model we choose. Hence, proper evaluation metrics that are fit for purpose are essential in the model selection. In this section, we explore widely used evaluation metrics of music classification. Along with the concepts and definitions of evaluation metrics, their implementation using scikit-learn library is provided together.

Let’s explore different evaluation metrics with an example of a binary classification task. We want to assess a classifier that detects vocals in music. Our dataset has ten songs with vocal (blue) and ten songs without vocal (orange). The green circle is a decision boundary of the model. The model predicts that the items in the green circle are vocal music, and the items at the outside circle are instrumental music.

An example of single-label binary classification

import numpy as np
y_true = np.array([False, False, False, False, False, False, False, False, False ,False, True, True, True, True, True, True, True, True, True, True])
y_pred = np.array([False, False, False, False, False, False, False, True, True, True, False, False, True, True, True, True, True, True, True, True])

As shown in the figure below, we can separate the predictions into four categories.

Four categories of predictions

  • True positives (TP): Correctly predicted vocal music (upper left).

  • False positives (FP): Predicted as vocal music but they are non-vocal music (upper right).

  • False negatives (FN): Predicted as non-vocal music but they are vocal music (lower left).

  • True negatives (TN): Correctly predicted non-vocal music (lower right).

TP = (y_true & y_pred).sum()
FP = (~y_true & y_pred).sum()
FN = (y_true & ~y_pred).sum()
TN = (~y_true & ~y_pred).sum()
print('True Positive: %d' % TP)
print('False Positive: %d' % FP)
print('False Negative: %d' % FN)
print('True Negative: %d' % TN)
True Positive: 8
False Positive: 3
False Negative: 2
True Negative: 7

Accuracy

Accuracy is an intuitive evaluation metric to assess classification models. It measures how many items are correctly predicted. The formula of accuracy is:

accuracy = (TP + TN) / (TP + TN + FP + FN)
print('Accuracy: %.4f' % accuracy)

from sklearn.metrics import accuracy_score
sklearn_accuracy = accuracy_score(y_true, y_pred)
print('Accuracy (sklearn): %.4f' % sklearn_accuracy)
Accuracy: 0.7500
Accuracy (sklearn): 0.7500

Precision

Precision measures how many retrieved items are truly relevant. Among 11 retrieved items in the green circle, 8 of them are vocal music, and 3 of them are not. The formula of precision is:

Precision is also known as positive predictive value.
precision = TP / (TP + FP)
print('Precision: %.4f' % precision)

from sklearn.metrics import precision_score
sklearn_precision = precision_score(y_true, y_pred)
print('Precision (sklearn): %.4f' % sklearn_precision)
Precision: 0.7273
Precision (sklearn): 0.7273

Recall

Recall measures how many relevant items are correctly retrieved. Among 10 songs with vocal, 8 of them are correctly predicted as vocal music. The formula of recall is:

Recall is also known as sensitivity or true positive rate. And the opposite term is specificity or true negative rate: how many rejected items are truly negative, i.e., TN / (FP + TN).
recall = TP / (TP + FN)
print('Recall: %.4f' % recall)

from sklearn.metrics import recall_score
sklearn_recall = recall_score(y_true, y_pred)
print('Recall (sklearn): %.4f' % sklearn_recall)

sensitivity = TP / (TP + FN)
specificity = TN / (FP + TN)
print('Sensitivity: %.4f' % sensitivity)
print('Specificity: %.4f' % specificity)
Recall: 0.8000
Recall (sklearn): 0.8000
Sensitivity: 0.8000
Specificity: 0.7000

Tip

  • High precision is directly related to user experience. When retrieved items are truly relevant, users can trust the system.

  • However, a high precision / low recall system only retrieves a few positive items, which end up with low diversity. A lot of relevant items (False negatives) will be discarded.

F-measure

F-measure or F-score is an evaluation metric of binary classification. The traditional F-measure (F1-score) is defined as the harmonic mean of precision and recall. The maximum value is 1.0, and the lowest is 0 (either precision or recall is zero).

F1 = 2 * precision * recall / (precision + recall)
print('F1-score: %.4f' % F1)

from sklearn.metrics import f1_score
sklearn_F1 = f1_score(y_true, y_pred)
print('F1-score (sklearn): %.4f' % sklearn_F1)
F1-score: 0.7619
F1-score (sklearn): 0.7619

Tip

Depending on system requirements, either precision or recall may be more critical. Fbeta-measure controls the balance of precision and recall using a coefficient beta.

High precision vs high recall?

The model outputs the likelihood of the input to have vocal between 0 and 1. Hence, to make a final decision, we need to set a threshold. With a high threshold, the model becomes more strict, which means the green circle becomes smaller. The retrieved results by the model for a given query “vocal music” will be reliable. However, the model only retrieves a few songs among the entire vocal tracks (i.e., high precision and low recall). This can be observed from the precision-recall curve below. As the threshold gets closer to 1.0, precision goes higher while recall goes lower.

Threshold-varying precision-recall curve

On the other hand, if the threshold gets lower, it results in high recall and low precision, which means the system returns any item to be positive. Like this, appropriate decision making of threshold is crucial.

Area under receiver operating characteristic curve (ROC-AUC)

As we checked from the precision-recall curve, the model’s performance varies by a decision boundary (threshold). The receiver operating characteristic curve (ROC curve) reflects the model’s threshold-varying characteristics. The ROC curve is created by plotting true positive rate (TPR) against false positive rate (FPR), where TPR is also known as sensitivity or recall, and FPR is calculated as (1 - specificity).

Precision-recall curve

In the figure above, a dotted black line indicates the ROC curve of a random classifier, a blue line indicates a better classifier, and an orange line shows a perfect classifier. As a classifier gets better, the area under the curve (AUC) gets wider. We call this area under the ROC curve as ROC-AUC score.

y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
y_pred_random = np.array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
y_pred_blue = np.array([0.1, 0.3, 0.8, 0.6, 0.1, 0.4, 0.5, 0.1, 0.2, 0.2, 0.4, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.6, 0.8, 0.7])
y_pred_orange = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

from sklearn.metrics import roc_auc_score
roc_auc_random = roc_auc_score(y_true, y_pred_random)
roc_auc_blue = roc_auc_score(y_true, y_pred_blue)
roc_auc_orange = roc_auc_score(y_true, y_pred_orange)
print('ROC-AUC (random): %.4f' % roc_auc_random)
print('ROC-AUC (blue): %.4f' % roc_auc_blue)
print('ROC-AUC (orange): %.4f' % roc_auc_orange)
ROC-AUC (random): 0.5000
ROC-AUC (blue): 0.8450
ROC-AUC (orange): 1.0000

Area under precision-recall curve (PR-AUC)

It is known that ROC-AUC may report overly optimistic results with imbalanced data. Therefore, the area under the precision-recall curve (PR-AUC) is often provided together with ROC-AUC. The precision-recall curve is created by plotting precision against recall at different thresholds. Unlike the ROC-AUC score, which has 0.5 as its lowest value, the lowest bound of PR-AUC differs by data. When a model predicts every item to be positive regardless of threshold, the recall will always be 1.0, and precision will be a ratio of positive items w.r.t. all items. Hence, the lowest value of PR-AUC is the ratio of positive items.

Precision-recall curve

y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])
y_pred_random = np.array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
y_pred = np.array([0.1, 0.3, 0.8, 0.6, 0.1, 0.4, 0.5, 0.1, 0.2, 0.2, 0.4, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.6, 0.8, 0.7])


from sklearn.metrics import roc_auc_score, average_precision_score
roc_auc = roc_auc_score(y_true, y_pred)
roc_auc_random = roc_auc_score(y_true, y_pred_random)
pr_auc = average_precision_score(y_true, y_pred)
pr_auc_random = average_precision_score(y_true, y_pred_random)
print('ROC-AUC (random): %.4f' % roc_auc_random)
print('PR-AUC (random): %.4f' % pr_auc_random)
print('ROC-AUC: %.4f' % roc_auc)
print('PR-AUC: %.4f' % pr_auc)
ROC-AUC (random): 0.5000
PR-AUC (random): 0.1000
ROC-AUC: 0.8472
PR-AUC: 0.2917

Warning

The average precision (sklearn.metrics.average_precision_score) is one method for calculating PR-AUC. There are other methods such as trapezoid estimates and the interpolated estimates.

Tip

When the classification task has multiple labels, we need to aggregate multiple ROC-AUC scores and PR-AUC scores. In scikit-learn library, there is an option called average. Most automatic music tagging research uses the option average='macro', which averages tag-wise metrics. For more details, check their documentation (roc_auc_score, average_precision_score).