{"id":168529,"date":"2026-01-28T12:49:37","date_gmt":"2026-01-28T11:49:37","guid":{"rendered":"https:\/\/liora.io\/en\/?p=168529"},"modified":"2026-02-06T07:27:41","modified_gmt":"2026-02-06T06:27:41","slug":"data-leakage-definition-and-prevention","status":"publish","type":"post","link":"https:\/\/liora.io\/en\/data-leakage-definition-and-prevention","title":{"rendered":"Data leakage: Understanding it and preventing it"},"content":{"rendered":"<style><br \/>\n.elementor-heading-title{padding:0;margin:0;line-height:1}.elementor-widget-heading .elementor-heading-title[class*=elementor-size-]>a{color:inherit;font-size:inherit;line-height:inherit}.elementor-widget-heading .elementor-heading-title.elementor-size-small{font-size:15px}.elementor-widget-heading .elementor-heading-title.elementor-size-medium{font-size:19px}.elementor-widget-heading .elementor-heading-title.elementor-size-large{font-size:29px}.elementor-widget-heading .elementor-heading-title.elementor-size-xl{font-size:39px}.elementor-widget-heading .elementor-heading-title.elementor-size-xxl{font-size:59px}<\/style>\n<h5>Data leakage is a worrying phenomenon that can jeopardize your company&#8217;s security. Find out how to protect your sensitive data against data leakage and computer attacks.<\/h5>\n<h2 class=\"wp-block-heading\" id=\"h-what-is-data-leakage\">What is Data Leakage ?<\/h2>\nData leakage is one of the <b>most important points of vigilance<\/b> when designing a predictive model. The creation of a <b>predictive model<\/b> stems from an operational need, and the aim is to create a predictive tool to meet business expectations. <b>Performance<\/b> and <b>transparency<\/b> are the watchwords of a good predictive model.&nbsp;\n\nPerformance measurement is an <b>essential step<\/b> in model development, as it lies at the heart of the predictive modeling problem. It ensures the tool&#8217;s usability by guaranteeing its robustness, as it enables us to assess the operational character of the models. Indeed, the better a model performs, the more reliable and therefore usable it is. To assess its performance, we use metrics to <b>measure the quality of prediction<\/b> by comparing predicted values with actual values.\n\nDuring the design phase, we have a <b>certain amount of data<\/b> at our disposal. This data should enable us to both train and test the performance of our model. To obtain an accurate measure of performance, it is essential to have a <b>sufficient quantity of data<\/b> on which to test the model. This data must not be known to the model, and it must not be trained under any circumstances.\n\nFor this stage to go as smoothly as possible, you need to be very rigorous in the <b>data preparation stage<\/b>. Right from the start of the project, you must ensure that part of the data is set aside. If this is not done properly, data not intended for training could leak out and be used to train the model. This would then <b>bias the results<\/b> of the model when evaluated. This is what is known as data leakage in <a href=\"https:\/\/liora.io\/en\/machine-learning-what-is-it-and-why-does-it-change-the-world\">Machine Learning<\/a>.\n\n<style><br \/>\n.elementor-widget-image{text-align:center}.elementor-widget-image a{display:inline-block}.elementor-widget-image a img[src$=\".svg\"]{width:48px}.elementor-widget-image img{vertical-align:middle;display:inline-block}<\/style>\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"741\" height=\"427\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/06\/machine-learning_1.jpg\" alt=\"machine-learning\" loading=\"lazy\" srcset=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/06\/machine-learning_1.jpg 741w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/06\/machine-learning_1-300x173.jpg 300w\" sizes=\"(max-width: 741px) 100vw, 741px\">\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><div class=\"wp-block-button \"><a class=\"wp-block-button__link wp-element-button \" href=\"\/en\/courses\/data-ai\/data-scientist\">Learn how to effectively manage data leakage<\/a><\/div><\/div>\n\n<h2 class=\"wp-block-heading\" id=\"h-how-can-i-determine-if-there-has-been-a-data-leak\">How can I determine if there has been a data leak?<\/h2>\nA very good indicator is abnormally high model performance. Getting a very high score for a model that predicts a customer&#8217;s subscription, for example, sports predictions, should give us a red flag. It&#8217;s virtually<b> impossible to obtain very high scores<\/b> on issues such as these, as the degree of chance involved in the realization of an event is very high. So we need to take a step back from the results we&#8217;ve obtained and take care to<b> check<\/b> how we arrived at that score.\n<h2 class=\"wp-block-heading\" id=\"h-what-precautions-should-be-taken\">What precautions should be taken?<\/h2>\nThe <b>train-test split technique<\/b> (also known as hold-out) involves dividing the available data into two parts: one dedicated to training and the other to evaluation. Only after the model has been trained can the test data be consulted; before this stage, they must have been carefully set aside.&nbsp;\n\nAs previously mentioned, it&#8217;s only after this crucial data separation stage has been completed that we can proceed with <b>data preparation <\/b>(the preprocessing phase). During this stage, we decide which treatments are to be applied to our variables before training the chosen algorithm.\n\n<img decoding=\"async\" width=\"800\" height=\"489\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/06\/data-leakage_1.jpg\" alt=\"data-leakage\" loading=\"lazy\" srcset=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/06\/data-leakage_1.jpg 830w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/06\/data-leakage_1-300x183.jpg 300w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/06\/data-leakage_1-768x469.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\">\n<h2 class=\"wp-block-heading\" id=\"h-why-is-it-not-possible-to-use-all-the-available-data\">Why is it not possible to use all the available data?<\/h2>\nTo better understand how this works, let&#8217;s take a look at the imputation of the missing values phase. Let&#8217;s imagine that we want to impute all the missing values of a variable by its median. If we calculate the median on all the data (training and test sets combined), then the value of the median will be different from that calculated only on the training set. This will result in a data leak, as the position indicator contains information from the test set. This example extends, of course, to all the pre-processing steps that precede model training: imputation of missing values, treatment of extreme values, normalization, etc.\n\nOf course, this precaution also applies to the cross-validation technique. Validation sets must be set aside to remain unknown to the model.\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\">Conclusion<\/h2>\nPerformance is largely determined by the quality of the data, so it is important to ensure that they are prepared before training the model. Nevertheless, this is a delicate stage, as it is prone to data leakage. Great care must be taken to ensure that no information contained in the test set is used to train the model. To ensure a model&#8217;s true performance, we can only rely on this approach.&nbsp;\n\nIf you&#8217;d like to find out more about model prediction, don&#8217;t hesitate to read <a href=\"https:\/\/liora.io\/en\/data-warehouse-2\">our article on data warehouse<\/a> or <a href=\"https:\/\/liora.io\/en\/database-what-is-it\">our article on databases<\/a>.\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><div class=\"wp-block-button \"><a class=\"wp-block-button__link wp-element-button \" href=\"\/en\/courses\/data-ai\/data-scientist\">Discover the Data Scientist training<\/a><\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Data leakage is one of the most important points of vigilance when designing a predictive model. The creation of a predictive model stems from an operational need, and the aim is to create a predictive tool to meet business expectations. Performance and transparency are the watchwords of a good predictive model.<\/p>\n","protected":false},"author":85,"featured_media":168530,"comment_status":"open","ping_status":"open","sticky":false,"template":"elementor_theme","format":"standard","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"categories":[2433],"class_list":["post-168529","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/168529","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/users\/85"}],"replies":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/comments?post=168529"}],"version-history":[{"count":4,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/168529\/revisions"}],"predecessor-version":[{"id":205356,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/168529\/revisions\/205356"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media\/168530"}],"wp:attachment":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media?parent=168529"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/categories?post=168529"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}