{"id":167218,"date":"2023-03-22T11:09:30","date_gmt":"2023-03-22T10:09:30","guid":{"rendered":"https:\/\/liora.io\/en\/?p=167218"},"modified":"2026-02-06T09:05:17","modified_gmt":"2026-02-06T08:05:17","slug":"management-of-unbalanced-classification-problems-ii","status":"publish","type":"post","link":"https:\/\/liora.io\/en\/management-of-unbalanced-classification-problems-ii","title":{"rendered":"Managing Unbalanced Classification Problems \u2013 Part 2"},"content":{"rendered":"<strong>This article will be divided into two parts: <a href=\"\/\" data-wplink-url-error=\"true\">The first focuses on the choice of metrics specific to this type of data<\/a>, the second details the range of useful methods to obtain a successful model.<\/strong>\n\nAfter detailing the<b> different problems related<\/b> to data imbalance and demonstrating that the choice of the right performance metric is essential for the evaluation of our models, we will present a non-exhaustive list of useful techniques to fight against this type of problem.\n\n<style><br \/>\n.elementor-heading-title{padding:0;margin:0;line-height:1}.elementor-widget-heading .elementor-heading-title[class*=elementor-size-]>a{color:inherit;font-size:inherit;line-height:inherit}.elementor-widget-heading .elementor-heading-title.elementor-size-small{font-size:15px}.elementor-widget-heading .elementor-heading-title.elementor-size-medium{font-size:19px}.elementor-widget-heading .elementor-heading-title.elementor-size-large{font-size:29px}.elementor-widget-heading .elementor-heading-title.elementor-size-xl{font-size:39px}.elementor-widget-heading .elementor-heading-title.elementor-size-xxl{font-size:59px}<\/style>\n<h5>Collect more data<\/h5>\nThis may sound simplistic, but collecting more data is almost always <b>overlooked<\/b> and can sometimes be effective.\n\nCan you collect more data? Take a few minutes to think about collecting more data for your problem, it could potentially rebalance your classes to some degree.\n<h5>Use resampling methods<\/h5>\nYou can change the dataset you use before training your <b>predictive model<\/b> to have more balanced data.\n\nThis strategy is called resampling and there are two main methods you can use to equalize the classes:&nbsp;\n\nOversampling and Undersampling.&nbsp;\n\n<b>Oversampling methods<\/b> work by increasing the <b>number of observations<\/b> of the minority class(es) in order to achieve a satisfactory ratio of minority class to majority class.\n\n<b>Undersampling methods<\/b> work by decreasing the number of observations of the majority class(es) in order to reach a satisfactory ratio of minority class to majority class.&nbsp;\n\nThese approaches are very easy to implement and quick to execute. They are a great starting point.\n\nOur advice: always <b>try both approaches<\/b> on all your unbalanced datasets, and check if it improves your chosen performance metrics.\n\n<b>Favor downsampling when you have large datasets<\/b>: tens or hundreds of thousands of cases or more.\n\nConsider oversampling when you don&#8217;t have a lot of data: tens of thousands or less.\n\nConsider testing different class ratios. For example, you don&#8217;t have to aim for a 1:1 ratio in a binary classification problem, try other ratios.\n<h5>Synthetic sample generation<\/h5>\nThere are algorithms to generate synthetic samples automatically. The most popular of these algorithms is<b> SMOTE<\/b> (for Synthetic Minority Over-sampling Technique). As the name suggests, SMOTE is an oversampling method. It works by <b>creating synthetic samples<\/b> from the minority class instead of creating simple copies.&nbsp;\n\nTo learn more about SMOTE, see <strong><a href=\"https:\/\/arxiv.org\/pdf\/1106.1813.pdf\">the original article<\/a><\/strong>.\n\nThe<b> ClusterCentroids algorithm<\/b> is an Undersampling algorithm that uses Clustering methods to generate a number of centroids from the original data, in order to lose as little information as possible about the majority class, when it needs to be reduced.\n<h5>Rethink the problem, find an other way to solvre the problem<\/h5>\nSometimes resampling methods are <b>not efficient enough<\/b>, and in this case, it is necessary to rethink the problem. It may be that the algorithm used is not suitable for your data.&nbsp;\n\nDo not hesitate to test other algorithms, possibly combined with the resampling methods seen above.\n\n<b>Tree-based ensemble models<\/b> such as RandomForest are generally more suitable for unbalanced data.\n\nIt is also possible to <b>play with the probabilities<\/b>. For example, if we want to be able to predict the vast majority of potential churners, even if it means misclassifying some <b>non-churners<\/b>, we can modify the probability threshold above which customers are considered as churners.\n\nThe lower the threshold, the higher the precision of our class, but the recall will decrease.\n<h5>Use a penalized model<\/h5>\nPenalized classification <b>imposes an additional cost<\/b> on the model for classification errors made on the minority class during training. These penalties can bias the model to pay more attention to the minority class.\n\nIn most classes of <b>scikit-learn algorithms<\/b>, it is possible to simply use the `class_weight` parameter. It allows penalizing errors made on a class by a new weight.&nbsp;\n\nThe higher the weight of a class, the more errors in this class are penalized, and the more importance is given to it.\n\nThe weights should be given in dictionary form, e.g. `{0:1, 1:5}`, to give 5 times the weight to errors made on class 1.&nbsp;\n\nThe argument `&#8221;balanced&#8221;` allows us to associate with each class a weight inversely proportional to its frequency.\n<h5>Use methods generating subsampled subsets<\/h5>\nAnother solution proposed by the <b>imblearn.ensemble<\/b> module, is the use of classes containing ensemble models such as<b> Boosting or Bagging<\/b> which are trained at each step of the algorithm on a sample automatically rebalanced between the different classes.\n\nThese model implementations make it possible to dispense with <b>resampling methods<\/b> before training and to apply them automatically to each selection of data by the algorithm.\n<h5>Think out of the box, be creative<\/h5>\nYou can test all of these techniques, combine them, or even think about <b>relabeling the majority class data<\/b> into subclasses to get a more balanced multi-class classification problem.\n\nIn some cases, you can also think about using other <a href=\"https:\/\/liora.io\/en\/machine-learning-what-is-it-and-why-does-it-change-the-world\"><b>Machine Learning methods<\/b><\/a> like Anomaly Detection or Active Learning.\n\nWe have presented many techniques, which you can choose from when working with this kind of data. Feel free to test these methods individually, and start with the simplest ones!\n\nWant to improve your skills in building powerful and reliable models from unbalanced data sets?\n\nDon&#8217;t hesitate to <a href=\"\/en\/appointment\"><b>contact us<\/b><\/a> for more information!\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><div class=\"wp-block-button \"><a class=\"wp-block-button__link wp-element-button \" href=\"\/en\/courses\/data-ai\/\">Start a Training in Data Science<\/a><\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>This article will be divided into two parts: The first focuses on the choice of metrics specific to this type of data, the second details the range of useful methods to obtain a successful model. After detailing the different problems related to data imbalance and demonstrating that the choice of the right performance metric is [&hellip;]<\/p>\n","protected":false},"author":79,"featured_media":167193,"comment_status":"open","ping_status":"open","sticky":false,"template":"elementor_theme","format":"standard","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"categories":[2433],"class_list":["post-167218","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/167218","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/users\/79"}],"replies":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/comments?post=167218"}],"version-history":[{"count":2,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/167218\/revisions"}],"predecessor-version":[{"id":206434,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/167218\/revisions\/206434"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media\/167193"}],"wp:attachment":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media?parent=167218"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/categories?post=167218"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}