{"id":177192,"date":"2026-02-18T22:32:04","date_gmt":"2026-02-18T21:32:04","guid":{"rendered":"https:\/\/liora.io\/en\/?p=177192"},"modified":"2026-02-18T22:32:05","modified_gmt":"2026-02-18T21:32:05","slug":"train_test_split-tutorial-on-how-to-use-this-function","status":"publish","type":"post","link":"https:\/\/liora.io\/en\/train_test_split-tutorial-on-how-to-use-this-function","title":{"rendered":"&#8220;train_test_split: Tutorial on how to use this function"},"content":{"rendered":"<p><strong>A Machine Learning model is capable of learning autonomously from one dataset, with the aim of predicting behavior on another dataset. To do this, it finds underlying relationships between independent explanatory variables and a target variable in the initial dataset. It then uses these patterns to predict or classify new data. <\/strong><\/p>\n<!-- \/wp:post-content -->\n\n<!-- wp:heading -->\n<h2 id=\"h-how-do-i-define-the-train-test-split-function\" class=\"wp-block-heading\">How do I define the train_test_split function?<\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>To verify the effectiveness of a <a href=\"https:\/\/liora.io\/en\/automl-and-machine-learning-automation-a-threat-to-data-scientists\">Machine Learning model,<\/a> the initial dataset is divided into two sets: a training set and a test set. The training set is used to fit, i.e. train, the model on part of the data. The test set is used to evaluate the model&#8217;s performance on the other part of the data. The train_test_split function in <a href=\"https:\/\/liora.io\/en\/scikit-learn-discover-the-python-library-dedicated-to-machine-learning\">Python&#8217;s ScikitLearn (sklearn) library allows this separation into two sets.<\/a><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>First of all, you need to import the train_test_split function from sklearn&#8217;s model_selection package using the following code:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:code {\"style\":{\"spacing\":{\"margin\":{\"top\":\"var:preset|spacing|columns\",\"bottom\":\"var:preset|spacing|columns\"}}},\"fontSize\":\"xsmall\"} -->\n<pre class=\"wp-block-code has-xsmall-font-size\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><code>from sklearn.model_selection import train_test_split<\/code><\/pre>\n<!-- \/wp:code -->\n\n<!-- wp:paragraph -->\n<p>Once imported, the <strong>function takes several arguments:<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"width\":\"auto\",\"height\":\"750px\",\"align\":\"center\"} -->\n<figure class=\"wp-block-image aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/03\/train-test-split1.png\" alt=\"\" style=\"width:auto;height:750px\" title=\"\"><\/figure>\n<!-- \/wp:image -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 id=\"h-1-arrays-extracted-from-the-dataset-to-be-split\" class=\"wp-block-heading\">1) Arrays extracted from the dataset to be split.<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>In <strong>supervised learning<\/strong>, these arrays are the input array X, consisting of the explanatory variables in columns, and the output array y, consisting of the target variable (i.e. the labels).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In unsupervised learning, the only array in argument is the input array X, made up of the explanatory variables in columns.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Note:<\/strong> Be careful with dimensions! X must be a two-dimensional array. y must be a one-dimensional array equal to the number of rows in X. To do this, use the .reshape function.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 id=\"h-2-test-set-size-test-size-and-training-set-size-train-size\" class=\"wp-block-heading\">2) Test set size (test_size) and training set size (train_size).<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The<strong> size of each set<\/strong> is either a decimal number between 0 and 1 representing a <a href=\"https:\/\/liora.io\/en\/datasets-top-5-places-to-find-quality-datasets\">proportion of the dataset<\/a>, or an integer number representing a number of examples in the dataset.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Note:<\/strong> It is sufficient to define just one of these arguments, the second being complementary.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 id=\"h-3-the-random-state-random-state\" class=\"wp-block-heading\">3) The random state (random_state).<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The random state is a number that controls how the<strong> pseudo-random generator divides the data.<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Note<\/strong>: Choosing an integer as the random state ensures that the data is split in the same way each time the function is called. This makes the code reproducible.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 id=\"h-4-le-shuffle-shuffle\" class=\"wp-block-heading\">4) Le shuffle (shuffle).<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Shuffle is a Boolean that selects whether or not data should be shuffled before being separated. If the data is not shuffled, it is separated according to the order in which it was initially displayed.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Note:<\/strong> The default value is True.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 id=\"h-5-stratify-stratify\" class=\"wp-block-heading\">5) Stratify (stratify).<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The Stratify parameter selects whether the <strong>data are separated<\/strong> so as to keep the same <a href=\"https:\/\/liora.io\/en\/excel-to-power-bi-how-to-transform-a-pivot-table-in-excel-into-a-dataset-that-can-be-used-by-power-bi\">proportions of observations in each class in the train and test sets as in the initial dataset.<\/a><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Remarks:<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li>This parameter is particularly useful when dealing with &#8220;unbalanced&#8221; data with very unbalanced proportions between the different classes.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li>The default value is None.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p>The<strong> train_test_split function<\/strong> returns a number of outputs equal to twice its number of inputs, in array form. In supervised learning, it returns four outputs: X_train, X_test, y_train and y_test. In unsupervised learning, it returns two outputs: <strong>X_train and X_test.<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:buttons {\"className\":\"is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"} -->\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><!-- wp:button -->\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/liora.io\/en\/courses\/data-ai\/machine-learning-engineer\">Learn to use Machine Laarning models<\/a><\/div>\n<!-- \/wp:button --><\/div>\n<!-- \/wp:buttons -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\">How to evaluate model performance with the train_test_split function?<\/h2><!-- wp:image {\"id\":207310,\"sizeSlug\":\"large\"} --><figure class=\"wp-block-image size-large\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"572\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/code-screen-computer-programming-1024x572.jpg\" alt=\"Computer screen displaying running programming code.\" class=\"wp-image-207310\" srcset=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/code-screen-computer-programming-1024x572.jpg 1024w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/code-screen-computer-programming-300x167.jpg 300w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/code-screen-computer-programming-768x429.jpg 768w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/code-screen-computer-programming-1536x857.jpg 1536w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/code-screen-computer-programming-2048x1143.jpg 2048w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/code-screen-computer-programming-440x246.jpg 440w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/code-screen-computer-programming-785x438.jpg 785w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/code-screen-computer-programming-210x117.jpg 210w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/code-screen-computer-programming-115x64.jpg 115w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure><!-- \/wp:image -->\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Once the<strong> train_test_split function<\/strong> has been defined, it returns a train set and a test set. This splitting of the data makes it possible to<a href=\"https:\/\/liora.io\/en\/cracking-the-code-on-underfitting-in-machine-learning\"> evaluate a Machine Learning model from two different angles.<\/a><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The model is trained on the train set returned by the function. Then its predictive capabilities are evaluated on the test set returned by the function. Several metrics can be used for this evaluation. In the case of linear regression, the coefficient of determination,<strong> RMSE and MAE are preferred.<\/strong> In the <a href=\"https:\/\/liora.io\/en\/classification-algorithms-definition-and-main-models\">case of classification,<\/a> accuracy, precision, recall and F1-score are preferred. These scores on the test set are therefore used to determine how well the model performs and how much it needs to be improved before it can predict on a new dataset.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The train and test sets returned by the train_test_split function also play an essential role in <a href=\"https:\/\/liora.io\/en\/overfitting-what-is-it-how-can-i-avoid-it\">detecting overfitting<\/a> or underfitting. As a reminder, overfitting (or overlearning) describes a situation where the model built is too complex (with too many explanatory variables, for example), such that it learns the training data perfectly but fails to generalize to other data.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Conversely,<strong> underfitting (or underlearning)<\/strong> describes a situation where the model is too simple or poorly chosen (choosing a linear regression on data that does not respect its assumptions, for example), such that it learns poorly. These two problems can be corrected by various techniques, but they must first be identified, which is made possible by the train_test_split function.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In fact, we can compare the model&#8217;s performance on the train set and the test set created by the function. If performance is good on the train set but poor on the test set, we&#8217;re probably dealing with overfitting. If performance is as bad on the train set as on the test set, we&#8217;re probably <a href=\"https:\/\/liora.io\/en\/cracking-the-code-on-underfitting-in-machine-learning\">dealing with underfitting<\/a>. The two sets returned by the function are therefore essential in detecting these recurring problems in Machine Learning.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\">How can I solve a complete Machine Learning problem using the train_test_split function?<\/h2><!-- wp:image {\"id\":207311,\"sizeSlug\":\"large\"} --><figure class=\"wp-block-image size-large\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"572\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/table-of-evaluation-metrics-for-machine-learning-models-1024x572.jpg\" alt=\"Table of evaluation metrics for machine learning models, including RMSE, MAE, precision and recall.\" class=\"wp-image-207311\" srcset=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/table-of-evaluation-metrics-for-machine-learning-models-1024x572.jpg 1024w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/table-of-evaluation-metrics-for-machine-learning-models-300x167.jpg 300w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/table-of-evaluation-metrics-for-machine-learning-models-768x429.jpg 768w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/table-of-evaluation-metrics-for-machine-learning-models-1536x857.jpg 1536w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/table-of-evaluation-metrics-for-machine-learning-models-2048x1143.jpg 2048w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/table-of-evaluation-metrics-for-machine-learning-models-440x246.jpg 440w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/table-of-evaluation-metrics-for-machine-learning-models-785x438.jpg 785w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/table-of-evaluation-metrics-for-machine-learning-models-210x117.jpg 210w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/table-of-evaluation-metrics-for-machine-learning-models-115x64.jpg 115w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure><!-- \/wp:image -->\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Now that we&#8217;ve understood the use and functionality of the train_test_split function, let&#8217;s put it into practice with a real Machine Learning problem.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">Step 1: Understanding the problem<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>We choose to solve a supervised learning problem where the expected labels are known. More precisely, we focus on binary classification. The objective is to predict whether or not an individual has breast cancer, based on body characteristics.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">Step 2: Data recovery<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>We use the &#8220;breast_cancer&#8221; dataset included in the Sklearn library.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:code {\"style\":{\"spacing\":{\"margin\":{\"top\":\"var:preset|spacing|columns\",\"bottom\":\"var:preset|spacing|columns\"}}},\"fontSize\":\"xsmall\"} -->\n<pre class=\"wp-block-code has-xsmall-font-size\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><code>import numpy as np\nfrom sklearn.datasets import load_breast_cancer\ndf=load_breast_cancer()<\/code><\/pre>\n<!-- \/wp:code -->\n\n<!-- wp:paragraph -->\n<p>With the following lines of code, we retrieve the features and the target variable:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:code {\"style\":{\"spacing\":{\"margin\":{\"top\":\"var:preset|spacing|columns\",\"bottom\":\"var:preset|spacing|columns\"}}},\"fontSize\":\"xsmall\"} -->\n<pre class=\"wp-block-code has-xsmall-font-size\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><code>print(\"features :\", df.feature_names)\nprint (\"target :\", df.target_names)<\/code><\/pre>\n<!-- \/wp:code -->\n\n<!-- wp:paragraph -->\n<p>We obtain that the target variable to be predicted takes two values (&#8220;malignant&#8221; and &#8220;benign&#8221;) and that the problem is indeed a binary classification.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":4} -->\n<h3 class=\"wp-block-heading\">Step 3: Creating X and y<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>We create a<strong> two-dimensional input array X<\/strong> and a one-dimensional output array y. For this dataset, the binary encoding of the target variable is performed by sklearn and can be retrieved directly.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:code {\"style\":{\"spacing\":{\"margin\":{\"top\":\"var:preset|spacing|columns\",\"bottom\":\"var:preset|spacing|columns\"}}},\"fontSize\":\"xsmall\"} -->\n<pre class=\"wp-block-code has-xsmall-font-size\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><code>X, y = load_breast_cancer(return_X_y=True)\nprint(\"X :\", X)\nprint(\"y :\", y)\nprint(\"Dimensions de X :\", X.shape)\nprint(\"Dimensions de y :\", y.shape)<\/code><\/pre>\n<!-- \/wp:code -->\n\n<!-- wp:paragraph -->\n<p>We check that the dimensions of X and y match: y has the same number of rows as X.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":4} -->\n<h3 class=\"wp-block-heading\">Step 4: Creating train and test assemblies<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>We divide the data into a train set and a test set.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Since we supply the train_test_split function with two arrays X and y, it returns four elements. We choose a test set consisting of 10% of the data. We choose a number of type &#8220;int&#8221; as the random state to ensure code reproducibility. We don&#8217;t use the function&#8217;s final parameters, which are unnecessary for such a simple problem.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:code {\"style\":{\"spacing\":{\"margin\":{\"top\":\"var:preset|spacing|columns\",\"bottom\":\"var:preset|spacing|columns\"}}},\"fontSize\":\"xsmall\"} -->\n<pre class=\"wp-block-code has-xsmall-font-size\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><code>from sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=123)<\/code><\/pre>\n<!-- \/wp:code -->\n\n<!-- wp:heading {\"level\":4} -->\n<h3 class=\"wp-block-heading\">Step 5: Classification model<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>To solve the classification task, we build a k-nearest neighbor model. We train the model on the train set using the .fit() method. Then we test the model&#8217;s performance on the test set with the .predict() method. This gives us the predicted classes for the observations in the test set.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:code {\"style\":{\"spacing\":{\"margin\":{\"top\":\"var:preset|spacing|columns\",\"bottom\":\"var:preset|spacing|columns\"}}},\"fontSize\":\"xsmall\"} -->\n<pre class=\"wp-block-code has-xsmall-font-size\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><code>from sklearn.neighbors import KNeighborsClassifier\nclf = KNeighborsClassifier()\nclf.fit(X_train, y_train)\nprediction = clf.predict(X_test)<\/code><\/pre>\n<!-- \/wp:code -->\n\n<!-- wp:heading {\"level\":4} -->\n<h4 class=\"wp-block-heading\">Step 6: Model evaluation<\/h4>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>We choose accuracy as our metric. Accuracy represents the number of correct predictions out of the total number of predictions. We calculate it on the train set and on the test set using the .score() method, which compares the real classes in the dataset with the classes predicted by the clf classifier.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:code {\"style\":{\"spacing\":{\"margin\":{\"top\":\"var:preset|spacing|columns\",\"bottom\":\"var:preset|spacing|columns\"}}},\"fontSize\":\"xsmall\"} -->\n<pre class=\"wp-block-code has-xsmall-font-size\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><code>print(clf.score(X_train, y_train))\nprint(clf.score(X_test, y_test))<\/code><\/pre>\n<!-- \/wp:code -->\n\n<!-- wp:paragraph -->\n<p>We obtain an accuracy of 0.95 on the train set and 0.93 on the test set. So the model has good classification performance.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>What&#8217;s more, the accuracy on the test set is very slightly lower than that on the train set. This means that the model generalizes well to new data. So we&#8217;re not facing an overfitting problem.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>So the train_test_split function is easy to use and highly effective in solving a complete Machine Learning problem.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h2 class=\"wp-block-heading\">Are there any limits to the train_test_split function?<\/h2><!-- wp:image {\"id\":207312,\"sizeSlug\":\"large\"} --><figure class=\"wp-block-image size-large\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"572\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/chart-performance-precision-models-1024x572.jpg\" alt=\"Graph showing the accuracy of models as a function of random states, illustrating the variation in performance.\" class=\"wp-image-207312\" srcset=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/chart-performance-precision-models-1024x572.jpg 1024w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/chart-performance-precision-models-300x167.jpg 300w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/chart-performance-precision-models-768x429.jpg 768w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/chart-performance-precision-models-1536x857.jpg 1536w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/chart-performance-precision-models-2048x1143.jpg 2048w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/chart-performance-precision-models-440x246.jpg 440w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/chart-performance-precision-models-785x438.jpg 785w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/chart-performance-precision-models-210x117.jpg 210w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2026\/02\/chart-performance-precision-models-115x64.jpg 115w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure><!-- \/wp:image -->\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Despite this, the train_test_split function has a major limitation linked to its random_state parameter. In fact, when the value given to random state is an integer, the data are separated by a pseudo-random generator initialized with this integer, called seed. The separation performed is reproducible by keeping the same seed. However, it has been shown that the choice of seed has an influence on the performance of the <a href=\"https:\/\/liora.io\/en\/boosting-business-with-3-essential-machine-learning-algorithms\">associated Machine Learning model<\/a>: different seeds can create different sets and variable scores.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>One solution to this problem is to use the train_test_split function several times with different values for the random_state. We can then calculate the average of the scores obtained.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Thus the train_test_split function in the sklearn Python library is essential for conducting a Data Science project and evaluating a Machine Learning model when used properly !<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:buttons {\"className\":\"is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"} -->\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><!-- wp:button -->\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/liora.io\/en\/courses\/data-ai\/data-scientist\">Discover our Data Scientist training<\/a><\/div>\n<!-- \/wp:button --><\/div>\n<!-- \/wp:buttons -->\n\n<!-- wp:html -->\n<script type=\"application\/ld+json\">\n{\n  \"@context\": \"https:\/\/schema.org\",\n  \"@type\": \"FAQPage\",\n  \"mainEntity\": [\n    {\n      \"@type\": \"Question\",\n      \"name\": \"What is train_test_split?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"train_test_split is a function from the scikit-learn library used to split datasets into training and testing subsets for machine learning models.\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"Why use train_test_split?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"train_test_split is used to evaluate model performance by training the model on one portion of the data and testing it on unseen data.\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"How does train_test_split work?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"train_test_split randomly divides data into two subsets based on a specified test_size parameter, ensuring that the model is validated on separate data.\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"What parameters can be used with train_test_split?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"Common parameters include test_size, train_size, random_state, and shuffle, which control the proportion and randomness of the split.\"\n      }\n    }\n  ]\n}\n<\/script>\n\n<!-- \/wp:html -->","protected":false},"excerpt":{"rendered":"<p>A Machine Learning model is capable of learning autonomously from one dataset, with the aim of predicting behavior on another dataset. To do this, it finds underlying relationships between independent explanatory variables and a target variable in the initial dataset. It then uses these patterns to predict or classify new data. How do I define [&hellip;]<\/p>\n","protected":false},"author":82,"featured_media":207313,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"categories":[2433],"class_list":["post-177192","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/177192","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/users\/82"}],"replies":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/comments?post=177192"}],"version-history":[{"count":5,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/177192\/revisions"}],"predecessor-version":[{"id":207822,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/177192\/revisions\/207822"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media\/207313"}],"wp:attachment":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media?parent=177192"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/categories?post=177192"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}