{"id":174026,"date":"2023-11-27T09:24:08","date_gmt":"2023-11-27T08:24:08","guid":{"rendered":"https:\/\/liora.io\/en\/?p=174026"},"modified":"2026-02-06T08:45:33","modified_gmt":"2026-02-06T07:45:33","slug":"chi-squared-test-find-out-more-about-this-essential-statistical-test","status":"publish","type":"post","link":"https:\/\/liora.io\/en\/chi-squared-test-find-out-more-about-this-essential-statistical-test","title":{"rendered":"Chi squared test: Find out more about this essential statistical test"},"content":{"rendered":"<p><strong>The chi squared test (or chi 2) is a statistical test for variables that take a finite number of possible values, making them categorical variables. As a reminder, a statistical test is a method used to determine whether a hypothesis, known as the null hypothesis, is consistent with the data or not.<\/strong><\/p>\t\t\n\t\t\t<h3>What is the purpose of the Chi squared test?<\/h3>\t\t\n\t\t<p>The advantage of the <strong>Chi squared test<\/strong> is its wide range of applications:<\/p><ul><li><strong>Test of goodness<\/strong> of fit to a predefined law or family of laws, for example: Does the size of a population follow a normal distribution?<\/li><li><strong>Test of independence,<\/strong> for example: Is hair color independent of gender?<\/li><li><strong>Homogeneity test:<\/strong> <a href=\"https:\/\/liora.io\/en\/datasets-top-5-places-to-find-quality-datasets\">Are two sets of data identically distributed?<\/a><\/li><\/ul>\t\t\n\t\t\t<h3>How does the Chi squared test work?<\/h3>\t\t\n\t\t<p>Its principle is to compare the proximity or divergence between the distribution of the sample and a theoretical distribution using the Pearson statistic <strong>[latex] chi_{Pearson} [latex],<\/strong> which is based on the <strong>chi-squared distance.<\/strong><\/p><p><strong>First problem:<\/strong> Since we have only a limited amount of data, we cannot perfectly know the distribution of the sample, but only an approximation of it, the empirical measure.<\/p><p>The empirical measure [latex] widehat{mathbb{P}}_{n,X} [latex] represents the frequency of different observed values:<\/p>[latex] forall x in mathbb{X} quad   widehat{mathbb{P}}_{n,X} (x) = frac{1}{n} sum_{k=1}^{n} 1_{X_{k} =x}[\/latex]<p style=\"text-align: center;\">Empirical measurement formula<\/p><p style=\"text-align: center;\">with<\/p>[latex] X_{1},&#8230; ,{X_n}[\/latex] = the sample[latex] {mathbb{X}} [\/latex] = all possible values<p>The Pearson statistic is defined as :<\/p>[latex] chi_{Pearson} = n times chi_{2}(widehat{mathbb{P}}_{n,X}, P_{theorique} ) = n times sum_{x in  mathbb{X}} frac{(widehat{mathbb{P}}_{n,X} (x)- P_{theorique}(x))^{2}}{P_{theorique}(x)}[\/latex]<p style=\"text-align: center;\">Pearson&#8217;s statistical formula<\/p><p style=\"text-align: left;\">Under the null hypothesis, which means that there is equality between the distribution of the sample and the theoretical distribution, this Pearson statistic will converge to the <strong>chi-squared<\/strong> distribution with d degrees of freedom.<\/p><p style=\"text-align: left;\">The number of degrees of freedom, d, depends on the dimensions of the problem and is generally equal to the number of possible values minus 1.<\/p><p>As a reminder, the chi-2 law with d degrees of freedom<\/p><p>centred reduced independent.<\/p>[latex] chi^{2}_{loi}(d)\n[\/latex]<p>is that of a sum of squares of d Gaussians<\/p>[latex] chi^{2}_{loi}(d) := sum_{k=1}^{d} X_{k} quad avec quad X_{k}  sim  mathbb{N}(0,1)[\/latex]<p style=\"text-align: left;\">Otherwise, this statistic will diverge to infinity, reflecting the distance between empirical and theoretical distributions.<\/p>[latex] Sous quad H_{0}  quad  lim_{nrightarrow  infty } chi_{Pearson} = chi^{2}_{loi}(d). \\\nSous quad H_{1}  quad  lim_{nrightarrow  infty } chi_{Pearson} = infty\n[\/latex]<p style=\"text-align: center;\">Limit formula<\/p>\t\t\n\t\t\t<h3>What are the benefits of the Chi squared test?<\/h3>\t\t\n\t\t<p><strong>So, we have a simple decision rule:<\/strong> if the Pearson statistic exceeds a certain threshold, we reject the initial hypothesis (the theoretical distribution does not fit the data), otherwise, we accept it.<\/p><p>The advantage of the <strong>chi-squared test<\/strong> is that this threshold depends only on the chi-squared distribution and the confidence level alpha, so it is independent of the distribution of the sample.<\/p>\t\t\n\t\t\t<h3>The test of independence:<\/h3>\t\t\n\t\t<p>Let&#8217;s take an example to illustrate this test: we want to determine if the genders of the first two children, X and Y, in a couple are independent?<\/p><p>We have gathered the <a href=\"https:\/\/liora.io\/en\/what-is-a-dataset-how-do-i-work-with-it\">data in a contingency table:<\/a><\/p>[latex] begin{array}{|c|c|c|c|}\n\thline X \/ Y &amp; Child 2 : son &amp; Child 2: daughter &amp; Total \\\n\thline Child 1: son &amp; 857  &amp;  801 &amp; 1658 \\\n\thline Child 1: daughter  &amp; 813 &amp; 828 &amp; 1641\\\n\thline Total  &amp; 1670 &amp; 1629 &amp; 3299\nend{array}\n[\/latex]<p>The<strong> Pearson statistic<\/strong> will determine if the empirical measure of the joint distribution (X, Y) is equal to the product of the empirical marginal measures, which<strong> characterizes independence:<\/strong><\/p>[latex] chi_{Pearson} = n times chi^2\n(widehat{mathbb{P}}_{X times Y}, widehat{mathbb{P}}_{X} times widehat{mathbb{P}}_{Y}) = \nsum_{x in {daughter, son}, yin {daughter, son}} frac{(Observation_{x,y} &#8211; Theory_{x,y})^{2}}{Theory_{x,y}}\n[\/latex]<p>Here, Observation(x,y) represents the frequency of the value (x, y):<\/p>[latex] forall x, y in {daughter, son} quad\nObservation_{x,y} = frac{1}{n} sum_{k=1}^{n} 1_{(X_{k},Y_{k}) = (x, y)}\n[\/latex]<p>for exemple:<\/p>[latex] Observation(daughter, daughter) = frac{828}{3299} = 0.251\n[\/latex]<p>For <strong>Theory(x, y), X and Y are assumed<\/strong> to be independent, so the theoretical distribution should be the product of the <strong>marginal distributions:<\/strong><\/p>[latex] forall x, y in {daughter, son} quad\nTheory_{x,y} = Observation^{X} times Observation^{Y} = sum_{yin{daughter, son}} Observation_{x,y} times\nsum_{xin{daughter, son}} Observation_{x,y}\n[\/latex]<p>Thus, the theoretical probability for (son, son) is:<\/p>[latex] Theory(son, son) = frac{857+801}{3299} times frac{857+813}{3299} = frac{1658 times 1670}{3299^{2}} = 0.254[\/latex]<p>Let&#8217;s calculate the test statistic using the following Python code:<\/p><p>In our case, the variables X and Y have only 2 possible values: daughters or sons, so the dimension of the problem is (2-1)(2-1), which is 1.<\/p><p>Therefore, we compare the test statistic to the<strong> chi-squared quantile<\/strong> with 1 degree of freedom using the chi2.ppf function from <a href=\"https:\/\/liora.io\/en\/scipy-all-about-the-python-machine-learning-library\">scipy.stats.<\/a><\/p><p>If the test statistic is lower than the quantile and the p-value is greater than the significance level of 0.05, we cannot reject the null hypothesis with 95% confidence.<\/p><p>Thus, we conclude that the gender of the first two children is independent.<\/p>\t\t\n\t\t\t<h3>What are its limits?<\/h3>\t\t\n\t\t<p>While the <strong>chi squared test<\/strong> is very practical, it does have limitations. It can only detect the existence of correlations but does not measure their strength or causality.<\/p><p>It relies on the approximation of the chi-squared distribution with the Pearson statistic, which is only valid if you have a sufficient amount of data. In practice, the validity condition is as follows:<\/p>[latex] forall x in mathbb{X} quad n times P_{theoretical}(x) (1- P_{theoretical}(x)) geq 5[\/latex]<p>The Fisher exact test can address this limitation but requires significant computational power. In practice, it is often limited to 2&#215;2 contingency tables.<\/p><p>Statistical tests are crucial in <a href=\"https:\/\/liora.io\/en\/data-science-in-education-how-data-is-transforming-schools\">Data Science to assess the relevance of explanatory variables<\/a> and validate modeling assumptions. You can find more information about the chi-squared test and other statistical tests in our module 104 &#8211; Exploratory Statistics.<\/p>\t\t\n\t\t\t\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><div class=\"wp-block-button \"><a class=\"wp-block-button__link wp-element-button \" href=\"\/formation\/data-ia\/data-scientist\">Discover our Data Scientist training<\/a><\/div><\/div>\n\n\t\t\t<h3>References:<\/h3>\t\t\n\t\t<p><a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.chi2.html\">https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.chi2.html<\/a><\/p><p><a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.chi2_contingency.html\">https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.chi2_contingency.html<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>The chi squared test (or chi 2) is a statistical test for variables that take a finite number of possible values, making them categorical variables. As a reminder, a statistical test is a method used to determine whether a hypothesis, known as the null hypothesis, is consistent with the data or not. What is the [&hellip;]<\/p>\n","protected":false},"author":76,"featured_media":174027,"comment_status":"open","ping_status":"open","sticky":false,"template":"elementor_theme","format":"standard","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"categories":[2433],"class_list":["post-174026","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/174026","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/users\/76"}],"replies":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/comments?post=174026"}],"version-history":[{"count":1,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/174026\/revisions"}],"predecessor-version":[{"id":206216,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/174026\/revisions\/206216"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media\/174027"}],"wp:attachment":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media?parent=174026"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/categories?post=174026"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}