{"id":167191,"date":"2023-03-22T10:59:05","date_gmt":"2023-03-22T09:59:05","guid":{"rendered":"https:\/\/liora.io\/en\/?p=167191"},"modified":"2026-02-06T09:05:23","modified_gmt":"2026-02-06T08:05:23","slug":"management-of-unbalanced-classification-problems-i","status":"publish","type":"post","link":"https:\/\/liora.io\/en\/management-of-unbalanced-classification-problems-i","title":{"rendered":"Managing Unbalanced Classification Problems &#8211; Part 1"},"content":{"rendered":"Classification of unbalanced data is a <b>classification problem<\/b> where the training sample contains a strong disparity between the classes to be predicted. This problem is frequently encountered in <b>binary classification problems<\/b>, especially in anomaly detection.\n\nThis paper will be divided into <b>two parts<\/b>: The first one focuses on the choice of metrics specific to this type of data, and the second one details the range of useful methods to obtain a successful model.&nbsp;\n<h3>Part I: Choosing the right metrics<\/h3>\n<h5>What is an evaluation metric?<\/h5>\nAn evaluation metric quantifies the performance of a predictive model. Choosing the right metric is almost mandatory when evaluating <a href=\"https:\/\/liora.io\/en\/machine-learning-what-is-it-and-why-does-it-change-the-world\"><b>Machine Learning models<\/b><\/a>, and the quality of a classification model depends directly on the metric used to evaluate it.\n\nAbout classification problems, metrics generally consist in<b> comparing the real classes<\/b> to the classes predicted by the model. They can also be used to interpret the predicted probabilities for these classes.\n\nOne of the<b> key performance<\/b> <b>concepts<\/b> for classification is the confusion matrix, which is a tabular visualization of the model predictions against the true labels. Each row of the confusion matrix represents the instances of a real class and each column represents the instances of a predicted class.\n\nLet&#8217;s take the example of a <b>binary classification<\/b>, where we have 100 positive instances and 70 negative instances.\n\nThe confusion matrix below corresponds to the results obtained by our model:\n\n<img decoding=\"async\" width=\"800\" height=\"450\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/Predicted-classes-1-1024x576.jpg\" alt=\"Predicted classes\" loading=\"lazy\" srcset=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/Predicted-classes-1-1024x576.jpg 1024w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/Predicted-classes-1-300x169.jpg 300w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/Predicted-classes-1-768x432.jpg 768w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/Predicted-classes-1-1536x864.jpg 1536w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/Predicted-classes-1.jpg 1920w\" sizes=\"(max-width: 800px) 100vw, 800px\">\n\nIt gives an overview of the correct and false predictions.\n\nTo summarize this matrix in a metric, it is possible to use the rate of good predictions or accuracy. Here it is equal to (90+57)\/170 = 0.86.\n\nThe choice of an appropriate metric is <b>not obvious<\/b> for any Machine Learning model, but it is particularly difficult for unbalanced classification problems.\n\nIn the case of data with a <b>strong majority class<\/b>, classical algorithms are often biased because their loss functions try to optimize metrics such as the rate of good predictions, without taking in account the data allocation.\n\nIn the worst case, <b>minority classes<\/b> are treated as outliers of the majority class and the learning algorithm simply generates a trivial classifier that classifies each example into the majority class. The model will appear to perform well but this will only reflect the overrepresentation of the majority class. This is called <b>paradoxical accuracy<\/b>.\n\nIn most cases, the minority class represents the <b>greatest interest<\/b> and that we would like to be able to identify, as in the example of fraud detection.\n\nThe <b>imbalance level varies<\/b> but the use cases are recurrent: disease screening test , failure detection, search engine, spam filtering, marketing targeting&#8230;\n<h5>Practical application: Churn Rate<\/h5>\nLet&#8217;s assume that a service company wants to predict its<b> churn rate<\/b>.&nbsp;\n\nLittle reminder: the <b>churn rate\u2019s ratio<\/b> is the following : lost customers \/ total number of customers, measured over a given period, usually one year.\n\nThe company wants to predict for each customer whether he will end his contract at the end of the year.\n\nWe have a dataset containing personal information and contract characteristics of each customer of the company for the year X, as well as a variable to know if he renewed his contract at the end of the year.&nbsp;\n\nIn our data, the number of &#8216;churners&#8217; corresponds to 11% of the total number of customers.\n\nWe decide to train a <b>first logistic regression model <\/b>on our prepared and normalized data.\n\nSurprise! Our code displays a good prediction rate of 0.90!\n\nThis is a very good score, but our goal is to<b> predict<\/b> the possible departure of customers. Does this result mean that out of 10 churners, 9 will be identified as such by the model? No!\n\nThe only interpretation that can be made is that <b>9 out of 10 customers<\/b> have been well classified by the model.\n\nTo successfully detect naive behavior in a model, the most effective tool is always the confusion matrix.\n\nA first look at the confusion matrix shows us that the good prediction rate obtained is largely influenced by the <b>good behavior of the model <\/b>on the dominant class (0).\n\nIn order to evaluate the model related to the desired behavior on a class, it is possible to use a series of metrics from the confusion matrix, such as precision, recall, and f1-score, defined below.\n\n<img decoding=\"async\" width=\"512\" height=\"89\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/image-3.png\" alt=\"F1-score\" loading=\"lazy\" srcset=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/image-3.png 512w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/image-3-300x52.png 300w\" sizes=\"(max-width: 512px) 100vw, 512px\">\n\nThus for a given class:\n<ul>\n \t<li aria-level=\"1\">A high precision and a high recall -&gt; The class has been&nbsp;<b>well managed<\/b>&nbsp;by the model<\/li>\n \t<li aria-level=\"1\">High precision and low recall -&gt; The&nbsp;<b>class is not well detected<\/b>&nbsp;but when it is, the model is very reliable.&nbsp;<\/li>\n \t<li aria-level=\"1\">Low precision and high recall -&gt; The&nbsp;<b>class is well detected<\/b>, but also includes observations of other classes.<\/li>\n \t<li aria-level=\"1\">Low precision and low recall -&gt; the<b>&nbsp;class has not been handled well&nbsp;<\/b>at all<\/li>\n<\/ul>\nThe F1 score measures both precision and recall.\n\nIn the case of binary classification, the sensitivity and specificity correspond respectively to the recall of the positive and negative classes.\n\nAnother metric, the<b>&nbsp;geometric mean (G-mean)<\/b>, is useful for unbalanced classification problems: it is the root of the product of sensitivity and specificity.\n\nThese different metrics are easily accessible thanks to the&nbsp;<b>imblearn package<\/b>.&nbsp;\n\nThe&nbsp;<b>classification_report_imbalanced()<\/b>&nbsp;function allows displaying a report containing the results on all the metrics of the package.\n\nWe obtain the following table:&nbsp;\n\n<img decoding=\"async\" width=\"512\" height=\"288\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/image-1.png\" alt=\"Confusion matrix\" loading=\"lazy\" srcset=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/image-1.png 512w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/03\/image-1-300x169.png 300w\" sizes=\"(max-width: 512px) 100vw, 512px\">\n\nThe table shows that the recall and f1-score for class 1 are bad, while for class 0 they are high. In addition, the geometric mean is also low.\n\nThus, the trained model <b>does not<\/b> fit our data.\n\nIn&nbsp;<a href=\"https:\/\/liora.io\/en\/management-of-unbalanced-classification-problems-ii\" target=\"_blank\" rel=\"noopener\"><b>Part II<\/b><\/a>, we will discover methods that will allow us to obtain much better results.\n\nYou want to improve your skills to build efficient and reliable models from unbalanced data sets?<strong><a href=\"\/en\/courses\/data-ai\/\"> Discover all our learning modules<\/a><\/strong>!&nbsp;\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><div class=\"wp-block-button \"><a class=\"wp-block-button__link wp-element-button \" href=\"\/en\/courses\/data-ai\/machine-learning-engineer\">Start a Machine Learning Engineer training<\/a><\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Classification of unbalanced data is a classification problem where the training sample contains a strong disparity between the classes to be predicted. This problem is frequently encountered in binary classification problems, especially in anomaly detection. This paper will be divided into two parts: The first one focuses on the choice of metrics specific to this [&hellip;]<\/p>\n","protected":false},"author":79,"featured_media":167193,"comment_status":"open","ping_status":"open","sticky":false,"template":"elementor_theme","format":"standard","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"categories":[2433],"class_list":["post-167191","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/167191","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/users\/79"}],"replies":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/comments?post=167191"}],"version-history":[{"count":2,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/167191\/revisions"}],"predecessor-version":[{"id":206435,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/167191\/revisions\/206435"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media\/167193"}],"wp:attachment":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media?parent=167191"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/categories?post=167191"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}