{"id":178140,"date":"2024-09-18T12:29:13","date_gmt":"2024-09-18T11:29:13","guid":{"rendered":"https:\/\/liora.io\/en\/?p=178140"},"modified":"2026-02-12T10:22:22","modified_gmt":"2026-02-12T09:22:22","slug":"calculate-correlation-between-two-variables-how-do-you-measure-dependence","status":"publish","type":"post","link":"https:\/\/liora.io\/en\/calculate-correlation-between-two-variables-how-do-you-measure-dependence","title":{"rendered":"Calculate correlation between two variables: How do you measure dependence?"},"content":{"rendered":"\n<p><strong>In data science, it is vital to discover and quantify the extent to which two variables are linked. These relationships can be complex and are not necessarily visible. Some of these dependencies, such as linear regressions, weaken the performance of Machine Learning algorithms. It therefore becomes imperative to prepare your data better.<\/strong><\/p>\n\n\n\n<p>Here we will look at how to obtain dependency between two categorical variables and between <strong>categorical and continuous variables.<\/strong> First of all, remember that a categorical variable is a variable which has a finite number of distinct categories or groups. For example, the gender of individuals, the type of equipment or the method of payment. In contrast, <strong>continuous variables<\/strong> can theoretically take on an infinite number of values.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center wp-container-core-buttons-is-layout-a89b3969\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/liora.io\/en\/courses\/data-ai\/data-analyst\">Build Your Future in Data Analytics<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-correlation-between-two-categorical-variables\">Correlation between two categorical variables :<\/h2>\n\n\n\n<p>To find out whether two<strong> categorical variables<\/strong> are related, we use the famous chi-square test. If you&#8217;re not familiar with statistical tests, don&#8217;t panic! A statistical test is a procedure for deciding between two hypotheses. It consists of rejecting or not rejecting a statistical hypothesis, called the null hypothesis H0, based on a set of data.<\/p>\n\n\n\n<p>In the test we are interested in, the null hypothesis is simply &#8220;the two variables being tested are independent&#8221;. Finally, the test is accompanied by a test statistic which is used to decide whether or not to reject the null hypothesis. Because of the way the test is constructed, this statistic has the good taste to <a href=\"https:\/\/liora.io\/en\/chi-squared-test-find-out-more-about-this-essential-statistical-test\">follow a chi-square distribution with a certain degree of freedom.<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-but-how-do-you-decide-whether-or-not-to-reject-the-null-hypothesis\">But how do you decide whether or not to reject the null hypothesis?<\/h3>\n\n\n\n<p>Without <a href=\"https:\/\/liora.io\/en\/sequences-and-series-understanding-these-two-mathematical-concepts\">going into the mathematical details,<\/a> each statistical test has a so-called p-value. This can be seen as a reference value for deciding whether or not to reject the null hypothesis.<\/p>\n\n\n\n<p>If the p-value is below 5%, then the null hypothesis is rejected. The 5% threshold is used by practitioners and may vary depending on the sector of activity.<\/p>\n\n\n\n<p>The <strong>test is easily implemented in <a href=\"https:\/\/liora.io\/en\/python-the-most-popular-language\">Python<\/a><\/strong> using the scipy library and its chi2_contingency function. It allows you to quickly obtain the p-value of the test as well as the associated statistic and degree of freedom. In practice, the chi-square test requires a little work on the data beforehand. To perform the test, you first need to determine the contingency table. This is a cross-tabulation between the two variables. It is easily obtained using the <a href=\"https:\/\/liora.io\/en\/pandas-the-python-library\">Pandas crosstab<\/a> function. The test is then performed using the contingency table:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2020\/05\/Capture-d\u2019e\u0301cran-2020-05-24-a\u0300-19.42.25.png\" alt=\"python\" \/><\/figure>\n\n\n\n<p>In our example above, the p-value is well below 5%, so we can reject the hypothesis that the two variables being tested are independent.<\/p>\n\n\n\n<p>Finally, we can also measure the level of correlation between the two variables using <strong>Cramer&#8217;s V.<\/strong> This is calculated using the test statistic, the degree of freedom and the dimensions of the contingency table. It returns a value between 0 and 1. If the value returned is greater than 0.9, the relationship can be described as very strong. If the value is less than 0.10, the relationship can be described as weak.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center wp-container-core-buttons-is-layout-a89b3969\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/liora.io\/en\/courses\/data-ai\/data-analyst\">Unlock Your Potential as a Data Analyst<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-correlation-between-two-continuous-variables\">Correlation between two continuous variables :<\/h2>\n\n\n\n<p>As with<strong> categorical variables<\/strong>, there is a test to determine whether two continuous variables are independent: Pearson&#8217;s correlation test. The null hypothesis to be tested is identical: &#8220;the two variables tested are independent&#8221;. As with the chi-square test, it is accompanied by a test statistic and a p-value that determines whether or not the null hypothesis is rejected.<\/p>\n\n\n\n<p>This test can be<strong> implemented very easily<\/strong> using the scipy library and its pearsonr function. There is no need to work on the<strong> data beforehand,<\/strong> provided it contains no missing values. Here is an example of implementation using python :<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2020\/05\/Capture-d\u2019e\u0301cran-2020-05-24-a\u0300-19.42.19.png\" alt=\"python\" \/><\/figure>\n\n\n\n<p>In our example, the p-value is less than 5%. This means that the variables are not independent. The Pearson coefficient measures the level of correlation between the two variables. It returns a value between -1 and 1. If it is close to 1, this means that the variables are correlated, close to 0 that the variables are uncorrelated and close to -1 that they are negatively correlated. In our example, the coefficient has a value of 0.80319, which means that the variables are highly correlated.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center wp-container-core-buttons-is-layout-a89b3969\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/liora.io\/en\/courses\/data-ai\/data-analyst\">Build Your Future in Data Analytics<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-correlation-between-a-continuous-variable-and-a-categorical-variable\">Correlation between a continuous variable and a<br>categorical variable :<\/h2>\n\n\n\n<p>To study this type of correlation, a one-factor <a href=\"https:\/\/liora.io\/en\/analysis-of-variance-anova-a-basic-tool-for-data-analysis\">analysis of variance (ANOVA)<\/a> is used to compare sample means. The aim of this test is to conclude whether a categorical variable influences the distribution of a continuous variable to be explained.<\/p>\n\n\n\n<p>Imagine that you have 3 variables. The first gives a customer number, the second a category (1, 2 or 3) and the last the amount spent. The question is: does the category variable have an influence on the amounts spent? Let&#8217;s denote \u00b51, \u00b52 and \u00b53 the average amounts spent for each of the 3 categories. A simple reasoning consists in saying that if the category variable has no influence on the sums spent, then the averages should be identical.<\/p>\n\n\n\n<p>In other words,<strong> \u00b51 = \u00b52 = \u00b53<\/strong>. This is exactly the hypothesis we test when we use analysis of variance. As with the chi-square and <a href=\"https:\/\/liora.io\/en\/pearson-and-spearman-correlations-a-guide-to-understanding-and-applying-correlation-methods\">Pearson tests,<\/a> this test is accompanied by a test statistic and a p-value which determines whether or not the null hypothesis is rejected. This test is <strong>easily implemented in Python<\/strong> using the statsmodels library. Here is an example of implementation:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\" style=\"margin-top:var(--wp--preset--spacing--columns);margin-bottom:var(--wp--preset--spacing--columns)\"><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2020\/05\/Capture-d\u2019e\u0301cran-2020-05-24-a\u0300-19.42.38.png\" alt=\"python\" \/><\/figure>\n\n\n\n<p>In our example, df indicates the degree of freedom of the test statistic F, which follows a Fisher distribution. PR(&gt;F) indicates the p-value of the test. This is less than 5%, so we can conclude that the pledged variable has an influence on main_category.<\/p>\n\n\n\n<p>You now have all the tools you need to <a href=\"https:\/\/liora.io\/en\/classification-algorithms-definition-and-main-models\">study correlations within a dataset.<\/a> Liora will give you the opportunity to go further by learning how to manage a data project from A to Z. Find out more about our training courses!<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center wp-container-core-buttons-is-layout-a89b3969\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/liora.io\/en\/courses\/data-ai\/data-analyst\">Start Your Data Analyst Career Today<\/a><\/div>\n<\/div>\n\n\n\n<script type=\"application\/ld+json\">\n{\n  \"@context\": \"https:\/\/schema.org\",\n  \"@type\": \"FAQPage\",\n  \"mainEntity\": [\n    {\n      \"@type\": \"Question\",\n      \"name\": \"Correlation between two categorical variables\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"To find out whether two categorical variables are related, we use the famous chi\u2011square test, which determines whether the variables are independent based on a p\u2011value and statistics derived from a contingency table.\u00a0([liora.io](https:\/\/liora.io\/en\/calculate-correlation-between-two-variables-how-do-you-measure-dependence))\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"Correlation between two continuous variables\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"To determine correlation between continuous variables, Pearson\u2019s correlation test can be used, returning a coefficient between \u22121 and 1 that indicates the strength and direction of the correlation.\u00a0([liora.io](https:\/\/liora.io\/en\/calculate-correlation-between-two-variables-how-do-you-measure-dependence))\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"Correlation between a continuous variable and a categorical variable\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"To study the relationship between a continuous and a categorical variable, a one\u2011factor analysis of variance (ANOVA) is used to test whether the categorical variable influences the distribution of the continuous one.\u00a0([liora.io](https:\/\/liora.io\/en\/calculate-correlation-between-two-variables-how-do-you-measure-dependence))\"\n      }\n    }\n  ]\n}\n<\/script>\n\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In data science, it is vital to discover and quantify the extent to which two variables are linked. These relationships can be complex and are not necessarily visible. Some of these dependencies, such as linear regressions, weaken the performance of Machine Learning algorithms. It therefore becomes imperative to prepare your data better. Here we will [&hellip;]<\/p>\n","protected":false},"author":47,"featured_media":192249,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"categories":[2433],"class_list":["post-178140","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/178140","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/users\/47"}],"replies":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/comments?post=178140"}],"version-history":[{"count":4,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/178140\/revisions"}],"predecessor-version":[{"id":206548,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/178140\/revisions\/206548"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media\/192249"}],"wp:attachment":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media?parent=178140"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/categories?post=178140"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}