Pearson Correlation Made Easy: A Practical Guide with Agri Analyze

Summary

Pearson correlation, covering its theory, properties, formulas, and assumptions. It describes how correlation measures the strength and direction of a linear relationship between two continuous variables and how scatter diagrams help visualize these relationships. The article presents the Pearson correlation coefficient,its interpretation and the method for testing its significance using a t-test. A solved numerical example is included for better understanding. The blog also provides a step-by-step guide to performing correlation analysis using the Agri Analyze tool. Finally, it showcases automated outputs such as heatmaps, correlation matrices, smart interpretations.

1.Introduction

Correlation refers to a statistical measure that describes the extent to which two variables change together. It is the degree of linear relationship between two continuous variables in a bivariate distribution. It is a way to quantify the degree to which two variables are related.

Correlation can be positive (both variables increase or decrease together), negative (one variable increases while the other decreases), or zero (no relationship between the variables). The correlation coefficient, typically denoted as r and it ranges from -1 to 1:

Where, r = 1 indicates perfect positive correlation

r = -1 indicates perfect negative correlation

r = 0 indicates no correlation

Properties of Correlation Coefficient:

  • The value of correlation always ranges between -1 to +1.
  • Correlation is independent of change in origin and scale.
  • Correlation is a unit-free measure.
  • In a two-variable framework, the correlation coefficient is the geometric mean of the two regression coefficients.

Visualizing Relationship Using Scatter Diagram:

In correlation problems, the first step is to investigate whether there is any relationship between the variables, say X and Y. For this purpose, a scatter diagram is used.

From the scatter diagram, it is possible to determine the presence of correlation between X and Y and also its nature, whether the correlation is positive, negative, or zero, and whether the relationship is linear or curvilinear. In the scatter diagrams, patterns illustrating positive correlation, negative correlation, and no correlation can be observed. When the trend is linear, the relationship between X and Y is referred to as linear correlation. When the trend is curvilinear, the relationship is termed curvilinear or non-linear correlation. Such non-linear relationships may take various forms, such as quadratic, cubic, or other higher-order relationships.

Correlation Diagram 1
Correlation Diagram 2

2. Pearson Correlation:

The scatter diagram will give only a vague idea about the presence or absence of correlation and the nature (positive or negative) of correlation. It will not indicate about the strength or degree of relationship between two variables. The index of the degree of relationship between two continuous variables is known as correlation coefficient. The correlation coefficient is symbolized as r in case of a sample and as 'rho' in case of population. The correlation coefficient, r is known as Pearson's Correlation coefficient, since it was developed by Karl Pearson. It is often referred to as Product-moment correlation in order to distinguish it from other measures of inter-relationship.

The correlation coefficient, \( r \), is defined as the ratio of the covariance of the variables \( X \) and \( Y \) to the product of their standard deviations. Symbolically,

\[ r = \frac{ \frac{1}{n-1}\sum (x - \bar{x})(y - \bar{y}) }{ \sqrt{\frac{1}{n-1}\sum (x - \bar{x})^2} \sqrt{\frac{1}{n-1}\sum (y - \bar{y})^2} } \]

It can be simplified as:

\[ r = \frac{ \sum xy - \frac{\sum x \sum y}{n} }{ \sqrt{\sum x^2 - \frac{(\sum x)^2}{n}} \sqrt{\sum y^2 - \frac{(\sum y)^2}{n}} } \]

The numerator of the correlation coefficient formula is termed as the sum of products of X and Y and is abbreviated as SP(XY). In the denominator, the first term is called the sum of squares of X or SS(X), and the second term is called the sum of squares of Y or SS(Y). The simplified formula is commonly used for computational purposes.

The denominator of the formula is always positive, whereas the numerator may be either positive or negative. As a result, the correlation coefficient r can take positive or negative values.

The correlation coefficient r is used under certain assumptions:

  • The variables under study are continuous random variables and are normally distributed.
  • The relationship between the variables is linear.
  • Each pair of observations is independent and unconnected with other pairs.

Testing the Significance of the Correlation Coefficient: A Step-by-Step Guide

To test the significance of the correlation coefficient, typically perform a hypothesis test to determine whether the observed correlation is statistically significant. The steps for testing the significance of the correlation coefficient r are as follows:

Steps for Testing the Significance of the Correlation Coefficient:

  1. Formulate the Hypotheses:
    • Null Hypothesis (H0): \( \rho = 0 \) (There is no linear relationship between the variables)
    • Alternative Hypothesis (H1): \( \rho \neq 0 \) (There is a linear relationship between the variables)
  2. Calculate the Test Statistic: The test statistic for the correlation coefficient is given by:

    \[ t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \]

    Where,

    • r is the sample correlation coefficient
    • n is the number of data points (sample size)
  3. Determine the Degrees of Freedom: The degrees of freedom (df) for this test is \( n - 2 \).
  4. Find the Critical Value: Use a t-distribution table or statistical software to find the critical value of t for a given significance level (α) and n − 2 degrees of freedom. Common significance levels are 0.05, 0.01 and 0.10.
  5. Compare the Test Statistic to the Critical Value:
    • If \( |t| \) is greater than the critical value, reject the null hypothesis (H0). This means the correlation is statistically significant.
    • If \( |t| \) is less than or equal to the critical value, do not reject the null hypothesis. This means there is not enough evidence to conclude that the correlation is significant.
  6. Interpret the Results: Based on the comparison, draw conclusions about the significance of the correlation coefficient.

Solved example of Pearson Correlation

Problem statement: There are two variables X and Y each having 5 observations. Compute the Pearson correlation and also test its significance using t test. The data is shared below X: 10, 20, 30, 40, 50 and Y: 20, 25, 15, 35, 30

Pearson Correlation Coefficient

The Pearson correlation coefficient measures the linear relationship between two variables. It is calculated using the formula:

\[ r = \frac{\sum (x - \bar{x})(y - \bar{y})} {\sqrt{\sum (x - \bar{x})^2 \sum (y - \bar{y})^2}} \]

1. Calculate the mean:

\[ \bar{X} = \frac{10 + 20 + 30 + 40 + 50}{5} = 30 \]

\[ \bar{Y} = \frac{20 + 25 + 15 + 35 + 30}{5} = 25 \]

2. Calculate the differences from the means and their products:

X Y \( X_i - \bar{X} \) \( Y_i - \bar{Y} \) \( (X_i - \bar{X})(Y_i - \bar{Y}) \) \( (X_i - \bar{X})^2 \) \( (Y_i - \bar{Y})^2 \)
10 20 -20 -5 100 400 25
20 25 -10 0 0 100 0
30 15 0 -10 0 0 100
40 35 10 10 100 100 100
50 30 20 5 100 400 25

3. Sum the columns:

\[ \sum (X_i - \bar{X})(Y_i - \bar{Y}) = 100 + 0 + 0 + 100 + 100 = 300 \]

\[ \sum (X_i - \bar{X})^2 = 400 + 100 + 0 + 100 + 400 = 1000 \]

\[ \sum (Y_i - \bar{Y})^2 = 25 + 0 + 100 + 100 + 25 = 250 \]

4. Calculate the Pearson correlation coefficient:

\[ r = \frac{300}{\sqrt{1000 \times 250}} = 0.6 \]

5. Calculate the Test Statistic:

The test statistic for the correlation coefficient is calculated using the formula:

\[ t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \]

where r is the Pearson correlation coefficient and n is the number of pairs.

For our data:

\[ t = \frac{0.6\sqrt{5-2}}{\sqrt{1-0.6^2}} = 1.299 \]

6. Degrees of Freedom: \[ df = n - 2 = 5 - 2 = 3 \]

7. Determine the Critical Value:

Using a t-distribution table, find the critical t-value for a two-tailed test at a chosen significance level (typically \( \alpha = 0.05 \)) with 3 degrees of freedom.

The critical value is approximately \( \pm 3.182 \).

8. Conclusion: Since \( |t| \) is less than the critical value, we do not reject the null hypothesis. There is insufficient evidence to conclude that there is a significant linear relationship between X and Y at the 0.05 significance level.

9. Summary: Based on the test of significance, the Pearson correlation coefficient of 0.6 is not statistically significant at the 0.05 significance level. Therefore, we cannot conclude that there is a significant linear relationship between the variables X and Y for this dataset.

Steps to perform Pearson Correlation Analysis using Agri Analyze

A more complex data for 4 variables is considered with 150 observations was considered for demonstration. The snap is given below:

Step 1: To create a CSV file with columns for Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.

Sepal.Length Sepal.Width Petal.Length Petal.Width
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4
4.6 3.4 1.4 0.3
5.0 3.4 1.5 0.2
4.4 2.9 1.4 0.2
4.9 3.1 1.5 0.1
5.4 3.7 1.5 0.2
4.8 3.4 1.6 0.2
4.8 3.0 1.4 0.1

Step 2: Click on ANALYTICAL TOOL ->CORRELATION AND REGRESSION ANALYSIS ->PEARSON CORRELATION

Step 3: Open link https://www.agrianalyze.com/PearsonCorrelation.aspx (For first time users free registration is mandatory)

Step 4: Link Here to download sample file Sample File Download

pearson Correlation

Step 5: Click submit, pay a nominal fee, and download the output report with detailed interpretation.

Output Report: Link of the output report

pearson correlation

Video Tutorial: Link of the Youtube Tutorial

Quiz : Pearson Correlation Quiz

The blog is written by:
Darshan Kothiya, PhD Scholar, Department of Agricultural Statistics, BACA, AAU, Anand