# Avoid a Mistake: Correctly Calculate Multiclass Accuracy

Today I held a short laboratory which tackled different metrics used in evaluating classifiers. One of the tasks required that, given the performances of 2 classifiers as **confusion matrices**, the students will calculate the **accuracy** of the 2 models. One model was a **binary classifier** and the other was a **multiclass classifier**.

Many students resorted to googling for an **accuracy formula** which returned the following function:

Then, they calculated a **‘per-class’ accuracy** (for class \(i\), they had \(ACC_i\)) and **macro-averaged** the results like below:

To their surprise, the resulted accuracy for the **multiclass classifier** was **erroneous** and highly different (when compared to `accuracy_score()`

from **sklearn**). However, the accuracy of the **binary classifier** was correct.

As there wasn’t much time available, I told them to use the following **accuracy formula** to match the results of **sklearn** and I’ll send an explanation later:

Some of you might recognize this as **micro-averaged precision**.

The purpose of this article is to serve as a list of DO’s and DONT’s so we can avoid such mistakes in the future.

## What was wrong?

Basically, you’re prone to get invalid results if you **average** accuracy values in an attempt to obtain the **global accuracy**. But… even if you directly calculate the **global accuracy** using the above formula, you’d get skewed values.

Take a look at the following classifier, described using a **confusion matrix**:

\ | Class #0 | Class #1 | Class #2 |
---|---|---|---|

Class #0 |
0 | 100 | 100 |

Class #1 |
100 | 0 | 100 |

Class #2 |
100 | 100 | 0 |

You’ll notice that \(TP = 0\) thus the classifier is doing a really bad job.

If we follow the students’ approach and calculate the **‘per-class’ accuracy** (let’s say **Class #0**), we have:

This already looks suspicious. You’ll get the same results for the other 2 classes, so… on average, \(\color{Red}{ACC = 0.333(3)}\). This is definitely wrong.

If you directly compute **global accuracy** using the same formula (summing all \(TP's\), \(TN's\), …), you get the same result because of the symmetry. This happens mainly because of the \(TN\) in the numerator which grows faster than any other term. In other words, as the number of classes grows, this error grows as well; a similar model, but with **4 classes**, gets a **0.5** accuracy.

Using the second formula, the **global accuracy** becomes:

Which yields, indeed, a better result. Moreover, it generates the same results as `accuracy_score()`

from **sklearn**, given more diverse confusion matrices.

##### If you compute **‘per class’ accuracies** using the second formula and average the values, you’re basically getting a **macro-averaged precision**. Point is, that’s not **accuracy** - so don’t do that.

## Conclusion

I’d recommend avoiding:

- the idea of calculating a
**global accuracy**by averaging**‘per-class’ accuracies** - the red formula, which includes \(TN\), since the other one returns correct values for any number of classes

Overall, you can compute **precision**, **recall**, **F1** in a ‘per-class’ manner. But I’m not so sure it also works with the **accuracy**.