ML class imbalance problem

Gunjan Agicha
6 min readMar 26, 2019

Class imbalance: Suppose you have a disease dataset which is rare i.e around 8% positive. In such cases even if you don't train and simply say that nobody is sick, this will give an accuracy of 92%. So in case of class imbalance accuracy is an inaccurate measure of performance.

In this guide, we covered 5 tactics for handling imbalanced classes in machine learning:

  1. Up-sample the minority class
  2. Down-sample the majority class
  3. Change your performance metric
  4. Penalize algorithms (cost-sensitive training)
  5. Use tree-based algorithms

1.up-sample the minority class: it means randomly duplicating the observations from the minority class to match up the number of samples to that of the majority class.

First, we’ll separate observations from each class into different DataFrames.

Next, we’ll resample the minority class with replacement, setting the number of samples to match that of the majority class.

Finally, we’ll combine the up-sampled minority class DataFrame with the original majority class DataFrame.

Create Synthetic Samples (Data Augmentation)

Creating synthetic samples is a close cousin of up-sampling, and some people might categorize them together. For example, the SMOTE algorithm is a method of resampling from the minority class while slightly perturbing feature values, thereby creating “new” samples.

You can find an implementation of SMOTE in the imblearn library.

2. Down-sample Majority Class: Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm.

First, we’ll separate observations from each class into different DataFrames.

Next, we’ll resample the majority class without replacement, setting the number of samples to match that of the minority class.

Finally, we’ll combine the down-sampled majority class DataFrame with the original minority class DataFrame.

3. Change Your Performance Metric: AUROC (Area under ROC curve)

ROC- receiver operating characteristic

Assume we have a probabilistic, binary classifier such as logistic regression.

Before presenting the ROC curve (= Receiver Operating Characteristic curve), the concept of confusion matrix must be understood. When we make a binary prediction, there can be 4 types of outcomes:

  • We predict 0 while the true class is actually 0: this is called a True Negative, i.e. we correctly predict that the class is negative (0). For example, an antivirus did not detect a harmless file as a virus .
  • We predict 0 while the true class is actually 1: this is called a False Negative, i.e. we incorrectly predict that the class is negative (0). For example, an antivirus failed to detect a virus.
  • We predict 1 while the true class is actually 0: this is called a False Positive, i.e. we incorrectly predict that the class is positive (1). For example, an antivirus considered a harmless file to be a virus.
  • We predict 1 while the true class is actually 1: this is called a True Positive, i.e. we correctly predict that the class is positive (1). For example, an antivirus rightfully detected a virus.

To get the confusion matrix, we go over all the predictions made by the model, and count how many times each of those 4 types of outcomes occur:

In this example of a confusion matrix, among the 50 data points that are classified, 45 are correctly classified and the 5 are misclassified.

Since to compare two different models it is often more convenient to have a single metric rather than several ones, we compute two metrics from the confusion matrix, which we will later combine into one:

  • True positive rate (TPR), aka. sensitivity, hit rate, and recall, which is defined as 𝑇𝑃𝑇𝑃+𝐹𝑁TPTP+FN. Intuitively this metric corresponds to the proportion of positive data points that are correctly considered as positive, with respect to all positive data points. In other words, the higher TPR, the fewer positive data points we will miss.
  • False positive rate (FPR), aka. fall-out, which is defined as 𝐹𝑃𝐹𝑃+𝑇𝑁FPFP+TN. Intuitively this metric corresponds to the proportion of negative data points that are mistakenly considered as positive, with respect to all negative data points. In other words, the higher FPR, the more negative data points will be missclassified.

To combine the FPR and the TPR into one single metric, we first compute the two former metrics with many different threshold (for example 0.00;0.01,0.02,…,1.000.00;0.01,0.02,…,1.00) for the logistic regression, then plot them on a single graph, with the FPR values on the abscissa and the TPR values on the ordinate. The resulting curve is called ROC curve, and the metric we consider is the AUC of this curve, which we call AUROC.

The following figure shows the AUROC graphically:

In this figure, the blue area corresponds to the Area Under the curve of the Receiver Operating Characteristic (AUROC). The dashed line in the diagonal we present the ROC curve of a random predictor: it has an AUROC of 0.5.

The AUROC is between 0 and 1, and AUROC = 1 means the prediction model is perfect. In fact, further away the AUROC is from 0.5, the better: if AUROC < 0.5, then you just need to invert the decision your model is making. As a result, if AUROC = 0, that’s good news because you just need to invert your model’s output to obtain a perfect model.

4. Penalize Algorithms (Cost-Sensitive Training):

The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on the minority class.

A popular algorithm for this technique is Penalized-SVM: Support Vector Machine

During training, we can use the argument class_weight=’balanced’ to penalize mistakes on the minority class by an amount proportional to how under-represented it is.

We also want to include the argument probability=True if we want to enable probability estimates for SVM algorithms.

Let’s train a model using Penalized-SVM on the original imbalanced dataset:

5. Use Tree-Based Algorithms

The final tactic we’ll consider is using tree-based algorithms. Decision trees often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes.

In modern applied machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we’ll jump right into those:

Wow! 97% accuracy and nearly 100% AUROC? Is this magic? A sleight of hand? Cheating? Too good to be true?

Well, tree ensembles have become very popular because they perform extremely well on many real-world problems. We certainly recommend them wholeheartedly.

However:

While these results are encouraging, the model could be overfit, so you should still evaluate your model on an unseen test set before making the final decision.

Reference : https://elitedatascience.com/imbalanced-classes

--

--

Gunjan Agicha

MSCS @UTD, Data Science Intern @Autodesk. If you want to get connected, send request here: https://www.linkedin.com/in/gagicha/