{"id":167184,"date":"2023-03-10T10:13:24","date_gmt":"2023-03-10T09:13:24","guid":{"rendered":"https:\/\/liora.io\/en\/?p=167184"},"modified":"2026-07-25T19:55:24","modified_gmt":"2026-07-25T18:55:24","slug":"resampling-a-method-for-balancing-data","status":"publish","type":"post","link":"https:\/\/liora.io\/en\/resampling-a-method-for-balancing-data","title":{"rendered":"Resampling: A method for balancing data"},"content":{"rendered":"Unbalanced data is very common in <b>Machine Learning<\/b>. Unfortunately, they complicate predictive analysis. So to balance these data sets, several methods have been implemented.&nbsp;\n\n<style><br \/>\n.elementor-heading-title{padding:0;margin:0;line-height:1}.elementor-widget-heading .elementor-heading-title[class*=elementor-size-]>a{color:inherit;font-size:inherit;line-height:inherit}.elementor-widget-heading .elementor-heading-title.elementor-size-small{font-size:15px}.elementor-widget-heading .elementor-heading-title.elementor-size-medium{font-size:19px}.elementor-widget-heading .elementor-heading-title.elementor-size-large{font-size:29px}.elementor-widget-heading .elementor-heading-title.elementor-size-xl{font-size:39px}.elementor-widget-heading .elementor-heading-title.elementor-size-xxl{font-size:59px}<\/style>\n<h3>How to manage unbalanced data with resampling?<\/h3>\nUnbalanced data are characterized by samples where a strong disparity is observed. For example, the ratio between classes is not 50\/50, but rather 90\/10. It is from this point on that data will pose problems in <a href=\"https:\/\/liora.io\/en\/machine-learning-what-is-it-and-why-does-it-change-the-world\"><b>Machine Learning<\/b><\/a> and <a href=\"https:\/\/liora.io\/en\/all-about-deep-learning\"><b>Deep Learning<\/b><\/a>.\n\nThis often refers to relatively rare events, such as insurance fraud or the detection of disease. For<b> example<\/b>, in a population, a large majority may be<b> healthy<\/b>, and only 0.1% may have multiple sclerosis. The healthy people are then the majority class and the sick people are considered the minority class. While these events are quite common, they represent only a<b> tiny fraction of the sample studied<\/b>. Thai is why it could be difficult to predict them\n\nIn Machine Learning, <b>unbalanced data<\/b> are very common, especially for binary classification. This is why several methods are being developed to better manage unbalanced classification problems. In particular through data resampling.\n<h3>What are the two resampling methods?<\/h3>\nResampling consists in modifying the data set before training the predictive model. This balances the data to <strong>make the prediction easier<\/strong>. For this purpose, there are two resampling methods.\n<h5>Oversampling<\/h5>\nThis involves <b>increasing the data<\/b> belonging to the minority class until reaching a certain balance. Or at least a satisfactory rate to make reliable predictions.\n\nData Scientists can use two resampling methods. To wit:\n<ul>\n \t<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Random oversampling<\/b>: here, the minority data are cloned several times at random. This technique is especially useful for linear models, such as <b>logistic regression<\/b>.<\/li>\n \t<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Synthetic oversampling<\/b>: the idea is, again, to <b>add data from minority classes<\/b>. But instead of copying them identically, the algorithm creates separate but similar data.<\/li>\n<\/ul>\n<h5>Subsampling<\/h5>\nOn the contrary, undersampling will <b>decrease the data<\/b> from the majority class to balance the ratio.\n\n<b>Random undersampling is mainly used<\/b>. This means that the majority of data are removed randomly. This resampling technique should be preferred when you have large data sets (at least several tens of thousands of cases).\n\nIf this method is the most common, you can also use<b> undersampling<\/b> of border observations or clustering-based undersampling. That is to say, specifically remove certain majority data.\n\nIn both cases, it is not necessary to obtain a perfect 50\/50 balance. It is possible to remove majority data or add minority data to obtain a ratio of 45\/55 or even 40\/60.\n\nIn addition, do not hesitate to<b> test several class ratios<\/b> to select the one that will give you the best prediction performance.\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><div class=\"wp-block-button \"><a class=\"wp-block-button__link wp-element-button \" href=\"https:\/\/liora.io\/en\/courses\/data-ai\/data-scientist\">Discover Data Science courses<\/a><\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Unbalanced data is very common in Machine Learning. Unfortunately, they complicate predictive analysis. So to balance these data sets, several methods have been implemented.&nbsp; How to manage unbalanced data with resampling? Unbalanced data are characterized by samples where a strong disparity is observed. For example, the ratio between classes is not 50\/50, but rather 90\/10. [&hellip;]<\/p>\n","protected":false},"author":79,"featured_media":167186,"comment_status":"open","ping_status":"open","sticky":false,"template":"elementor_theme","format":"standard","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"categories":[2433],"class_list":["post-167184","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/167184","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/users\/79"}],"replies":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/comments?post=167184"}],"version-history":[{"count":3,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/167184\/revisions"}],"predecessor-version":[{"id":209425,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/167184\/revisions\/209425"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media\/167186"}],"wp:attachment":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media?parent=167184"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/categories?post=167184"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}