Multiclass Classification in Machine Learning

Multiclass Classification in Machine Learning


The fact that you’re reading this article is evidence of the fact that you’ve finally realised that classification problems in real life are rarely limited to a binary choice of ‘yes’ and ‘no’, or ‘this’ and ‘that’.

If the number of classes that the tuples can be classified into exceeds two, the classification is labelled as Multiclass Classification – so, essentially, it’s a matter of ‘this’ or ‘that’ or ‘that’…

If we were to be analysing sentences in the English language, to classify the words in it within different categories based on what part of speech they are, it would be a multiclass classification. For example, in the sentence “I am swimming in the water”, I can classify “I” as a ‘pronoun’, “am” and “swimming” as ‘verbs’, “in” as a preposition, and so on. Here, the final results of the classification are not limited to merely two, and hence, pose a much bigger and more complex challenge than binary classification problems do.

Where is Multiclass Classification used?

Sentiment analysis, which can be used to flag offensive texts online or to gauge public opinion of a product or person is a more novel use of Multiclass Classification. Along with that, classifying medical diagnoses into various classes based on the seriousness of the ailment, and classifying images of animals on the basis of what species they belong to, classifying images of humans’ faces based on what emotions their faces are depicting, are also wide uses of the classification system.

Further, classification of products on e-marts such as Zepto, Dmart, and BigBasket, for example, into ‘Cleaning Essentials’, ‘Bath and Body’, ‘Packaged Food’, ‘Hygiene and Grooming’, etc., is also a highly used application, along with the classification of malware in cybersecurity.

How is Multiclass Classification different from a Multilabel Classification?

In Multiclass Classification, we classify each tuple into one category only. For example, in a library database, let’s say we have to classify each book on the basis of who the publisher is – there can be only one for every book, right? But the total number of publishers can be in the hundreds. Here, the classes that the tuples of data can be classified into are multiple, but only one for each book.

Alternatively, if I am classifying the books based on what genres they contain, it would be an example of a Multilabel Classification – every book could have multiple genres like romance, comedy, mystery, thriller, drama, etc… So every book can be given multiple labels here.
The below diagram elucidates the above-discussed concept:

Multiclass Classification vs Multilabel Classification

Multiclass Classification Models and Algorithms

The various Multiclass Classification techniques we have at our disposal can be categorised as follows:

  1. Transformation to Binary
  2. Extension from Binary
  3. Hierarchical Classification

Transformation to Binary

Two methods can be used to divide a Multiclass Classification problem into datasets with multiple Binary Classifications, and then train the individual Binary Classification models:

One-vs-Rest (OvR)

The One-vs-Rest (or One-vs-All OvA, or One-against-All OaA) method is essentially training one classifier per class, with the samples belonging to that particular class as positives, and all the other samples as negatives.

Let’s say we have a Multiclass Classification problem with all the tuples being classified as either ‘black’, ‘purple’, ‘pink, or ‘red’. We would then split this problem into 4 Binary Classification problems as so:

  • Binary Classification Problem 1: black vs [purple, pink, red] – (black, not black)
  • Binary Classification Problem 2: purple vs [black, pink, red] – (purple, not purple)
  • Binary Classification Problem 3: pink vs [black, purple, red] – (pink, not pink)
  • Binary Classification Problem 4: red vs [black, purple, pink] – (red, not red)

The disadvantage of this method is that because only one class is considered positive at a time, and the rest are considered negative, this makes the Binary problems an imbalance classification. When such a class imbalance exists, the default behaviour of the Machine Learning Model becomes to over-classify the majority class. This might lead to higher accuracy, but the model that is made is biased, with a higher probability of misclassifying the minority classes. Handling imbalanced datasets is beyond the scope of this article.

One-vs-Rest (OvR)

Like OvR, OvO also splits a Multiclass Classification problem into Binary Classification problems. But unlike OvR, the OvO approach splits the dataset into one dataset for each class versus every other class.
In the above-mentioned example of having to classify the data into the classes of ‘black’, ‘purple’, ‘pink’, or ‘red’, the Binary Classifications will be as follows, a total of –

The formula that is used to calculate the total number of Binary Classifications
The formula that is used to calculate the total number of Binary Classifications

They will be as follows:

  • Binary Classification Problem 1: black vs purple
  • Binary Classification Problem 2: black vs pink
  • Binary Classification Problem 3: black vs red
  • Binary Classification Problem 4: purple vs pink
  • Binary Classification Problem 5: purple vs red
  • Binary Classification Problem 6: pink vs red

At the time of prediction, a voting scheme is applied, and all the individual Binary Classifiers are applied to an unseen testing sample, and the class that gets the highest number of +1 predictions gets predicted by the combined classifier. This method too, though, suffers from certain ambiguities, where some regions of its input space may receive the same number of votes. Additionally, another major downside of this strategy is its computation workload. As each pair of classes requires its own Binary Classifier, datasets with high cardinality (or a number of columns) may take too long to train. 

Extension from Binary

As the name indicates, these types of classifiers are merely an extension of the existing Binary Classifiers. Numerous algorithms have been developed on k-nearest neighbours, naïve Bayes, neural networks, decision trees, SVMs (support vector machines), etc., also called algorithm adaptation techniques, to address Multiclass Classification problems.

Hierarchical Classification

Hierarchical Classification

Hierarchical classification tackles the Multiclass Classification problem by dividing and subdividing the output space into a tree. Each parent node is divided continually until each child node represents a single output class. Several methods, such as the below-mentioned ones, are proposed, based on hierarchical classification –

  1. Local classifier, which is one of the most popular and widely-used approaches for hierarchical classification, is done in two ways – classifier for every parent node, and classifier for every node.
  2. Global classifier, where one classifier which is trained to predict all the classes in the hierarchy can be translated as a multi-label task and handled accordingly.
  3. Flat classification, where we do not try to classify the data into parent nodes and predict the classes present in the leaf nodes directly.

Multiclass Classification implementation in Python

We will be using the Sklearn module for the implementation of a few of the above-discussed models. 

The IPYNB with the code I have shown below can be found here.

from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.model_selection import train_test_split

Make_classification is essentially a tool to generate data with different distributions and profiles to experiment on. Since we are testing various algorithms and want to find which one works in what cases, then the data generators help us generate case-specific data to then test the algorithm.

X, y = make_classification(
    n_samples=1000, n_features=5, n_informative=4, n_redundant=1, n_classes=4)

Here, X and y are as shown below:

Multiclass Classification implementation in Python

X is essentially just the generated samples, and y represents the integer labels for the class membership of each sample.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1121218)

train_test_split just splits the input arrays into random train and test subsets. Here, the test_size 0.3 indicates the proportion of the dataset to include in the test split, and the random_state controls the shuffling applied to the data before applying the split – it simply sets the seed to the random generator, ensuring that our splits are always deterministic; failing to do so will lead to a different split each time. 

e_t_c = ExtraTreesClassifier()
_ =, y_train)
y_pred = e_t_c.predict(X_test)
accuracy = e_t_c.score(X_test, y_test)
cm = confusion_matrix(y_test, y_prediction)

ExtraTreesClassifier is a type of ensemble learning technique to aggregate the result of multiple de-correlated decision trees collected in a “forest”, not entirely unlike the Random Forest Classifier, from which it only differs in the manner of the construction of the decision trees in the forest. 

The accuracy I get for the classifier is 0.6933, and the confusion matrix obtained is as follows – 

multiclass classification Multiclass Classification in Machine Learning If the number of classes that the tuples can be classified into exceeds two, the classification is labelled as Multiclass Classification - so, essentially, it’s a matter of ‘this’ or ‘that’ or ‘that’…

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
cmp = ConfusionMatrixDisplay(
    confusion_matrix(y_test, y_prediction),
    display_labels=["Class 1", "Class 2", "Class 3", "Class 4"],)

This gives us the Confusion Matrix in a slightly more formatted and easy-to-understand way –

confusion matrix

The confusion matrix is a mapping of how many values were actually predicted correctly. The lower the numbers outside the leading diagonal, the better the classifier.

from sklearn.linear_model import Perceptron
from sklearn.multiclass import OneVsRestClassifier
OvR = OneVsRestClassifier(estimator=Perceptron())
_ =, y_train)
accuracy = OvR.score(X_test, y_test)

This is the code for running the OvR algorithm on the dataset. I get an accuracy of 0.473. The value returned from len(OvR.estimators_) is 4, which is the exact number of classes we have, since we have set n_classes to be 4 in the make_classification function.

from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
OvO = OneVsOneClassifier(estimator=GaussianProcessClassifier()), y_train)
OvO_predictions = OvO.predict(X_test)
accuracy = OvO.score(X_test, y_test)
cm = confusion_matrix(y_test, OvO_predictions)
fig, ax = plt.subplots(figsize=(8, 5))
cmp = ConfusionMatrixDisplay(
    confusion_matrix(y_test, OvO_predictions),
    display_labels=["Class 1", "Class 2", "Class 3", "Class 4"],

This is how we run the OvO algorithm on the dataset. The len(OvO.estimators_) returns 6, which is the number we get on substituting the value 4 in the above-discussed formula.

Further, we get the accuracy of the classifier as 0.73, which is a substantial improvement from the previous one. 

The Confusion Matrix obtained is as follows –

confusion matrix output

It is to be noted that we can get varying values of these numbers as the dataset is generated and split randomly. While OvO is known to have greater accuracy than OvR as we have discussed in detail above, the differences in the values might be virtually incomparable for datasets with low cardinality.

Thank you for reading this article, click here to start learning Python in 2022.

Also Read:


Author: Ayush Purawr