There are many situations in our daily life where we know from experience, the direct association between certain variables but we can’t put a certain measure to it. For example, you know that the chances of you going out to watch a newly released movie is directly associated with the number of friends who go with you because the more the merrier!

**Table of content**

**Suggested videos**

But there are many other factors too, like your interest in that movie, your budget etc. Thus to analyze the situation in detail, you need to note down your similar past experiences and form a sort of distribution from that data. It is at this point that you require a Correlation Coefficient, which will now provide you with a value, based on which you can calculate the possibility of you not going for the movie this time if your friends don’t turn up! Karl Pearson’s Coefficient of Correlation is one such type of parameter which we’ll be studying in this section.

*Learn more about* *Rank Correlation here** in detail*

**Introduction to Coefficient of Correlation**

The Karl Pearson’s product-moment correlation coefficient (or simply, the Pearson’s correlation coefficient) is a measure of the strength of a linear association between two variables and is denoted by *r *or r*xy*(x and y being the two variables involved).

This method of correlation attempts to draw a line of best fit through the data of two variables, and the value of the Pearson correlation coefficient, *r*, indicates how far away all these data points are to this line of best fit.

**Browse more Topics under Correlation And Regression**

- Scatter Diagram
- Karl Pearson’s Coefficient of Correlation
- Rank Correlation
- Probable Error and Probable Limits
- Regression Lines, Regression Equations and Regression Coefficients

**Karl Pearson Correlation Coefficient Formula**

The coefficient of correlation rxy between two variables x and y, for the bivariate dataset (xi,yi) where i = 1,2,3…..N; is given by –

$$r_{(x,y)} = \frac{\text{cov}(x,y)}{\sigma_x \sigma_y}$$

where,

*⇒ cov(x,y):* the covariance between x and y

– \(\Sigma_{i = 1}^{N}\frac{(x_i – \bar{x})(y_i – \bar{y})}{N} = \frac{\Sigma x_iy_i}{N} – \bar{x}\bar{y}\)

Here, \(bar{x}\) and \(\bar{y}\) are simply the respective means of the distributions of x and y.

*⇒ σ**x** and σ**y *are the standard deviations of the distributions x and y.

– \(\sigma_x = \sqrt{\Sigma \frac{(x_i – \bar{x})^2}{N}} = \sqrt{\frac{\Sigma x_i^2}{N} – \bar{x}^2}\)

– \(\sigma_y = \sqrt{\Sigma \frac{(y_i – \bar{y})^2}{N}} = \sqrt{\frac{\Sigma y_i^2}{N} – \bar{y}^2}\)

**Alternate Formula**

*If some data is given in the form of a class-distributed* *frequency** distribution, you may use the following formulae –*

*⇒ cov(x,y):* the covariance between x and y

– \(\frac{\Sigma_{i,j} x_iy_if_{ij}}{N} – \bar{x}\bar{y}\)

Here, \(\bar{x}\) and \(\bar{y}\) are simply the respective means of the distributions of x and y.

*⇒ σ**x** and σ**y *are the standard deviations of the distributions x and y.

– \(\sigma_x =\sqrt{\frac{\Sigma_i f_{io} x_i^2}{N} – \bar{x}^2}\)

– \(\sigma_y =\sqrt{\frac{\Sigma_j f_{io} y_i^2}{N} – \bar{y}^2}\)

where,

xi: The central value of the i’th class of x

yj: The central value of the j’th class of y

fio,fij: Marginal Frequencies of x and y

fij: Frequency of the (i,j)th cell

In any case, the following equality must always hold:

Total frequency = N = \(\Sigma_{i,j} f_{ij}\) = \(\Sigma_i f_{io}\) = \(\Sigma_j f_{jo}\)

**A Single Formula for Discrete Datasets –**

$$r_{xy} = \frac{N\Sigma x_iy_i – \Sigma x_i \Sigma y_i}{\sqrt{N\Sigma x_i^2 – (\Sigma x_i)^2} \sqrt{N\Sigma y_i^2 – (\Sigma y_i)^2}}$$

*Let us understand more about* *Scatter Diagram here*

**Properties of the Pearson’s Correlation Coefficient**

⇒ *r*** is unit-less**. Thus, we may use it to compare association between totally different bivariate distributions as well. For eg – you may compare how much of you not going for a movie is related to your friends not joining you, and to you not being much interested for the movie yourself, both at the same time, with the Pearson’s correlation coefficients obtained from both the cases. In economics therefore, where the cost price or the market shares depend on lots of different factors, this parameter is of utmost importance in ascertaining the connection between various quantities.

⇒ **The value of ***r*** always lies between +1 and -1.** Depending on its exact value, we see the following degrees of association between the variables-

**r value variation:**

STRENGTH OF ASSOCIATION | NEGATIvE r | POSITIvE r |

weak | -0.1 to -0.3 | 0.1 to 0.3 |

average | -0.3 to -0.5 | 0.3 to 0.5 |

strong | -0.5 to -1.0 | 0.5 to 1.0 |

A value greater than 0 indicates a positive association i.e. as the value of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association i.e. as the value of one variable increases, the value of the other variable decreases.

⇒ The Pearson product-moment correlation does not take into consideration whether a variable has been classified as a dependent or independent variable. It treats all variables equally.

⇒ **A change of origin of the system, or any scaling of the variables doesn’t affect the value of ***r. ***The sign might change depending on the sign of scaling done.**

Basically, if the bivariate system (x, y) is converted to another bivariate system (u, v) by a change of origin or scaling or both, in the following way –

$$u = \frac{x – a}{b}, v = \frac{y – c}{d}$$

Then the correlation coefficient takes on the following value – $$r_{(u,v)} = \frac{bd}{|b||d|} \ r_{(x,y)}$$

**Assumptions**

While calculating the Pearson’s Correlation Coefficient, we make the following assumptions –

- There is a linear relationship (or any linear component of the relationship) between the two variables
- We keep Outliers either to a minimum or remove them entirely

An outlier is a data point that does not fit the general trend of your data but would appear to be an extreme value and not what you would expect compared to the rest of your data points. you can detect outliers by plotting the two variables against each other on a graph and visually inspecting the graph for extreme points.

you can then either remove or manipulate that particular point as long as you can justify why you did so. Outliers can have a very large effect on the line of best fit and the Pearson correlation coefficient, which can lead to very different conclusions regarding your data. Both of the above points for a given pair of variables can be analyzed easily by studying their scatter plots.

**Solved Example on Coefficient of Correlation**

**Question: **An experiment conducted on 9 different cigarette smoking subjects resulted in the following data –

Subject Number | Cigarettes smoked per week | Number of years lived |

(averaged over the last 5 years of their life) | ||

1 | 25 | 63 |

2 | 35 | 68 |

3 | 10 | 72 |

4 | 40 | 62 |

5 | 85 | 65 |

6 | 75 | 46 |

7 | 60 | 51 |

8 | 45 | 60 |

9 | 50 | 55 |

Calculate the correlation of coefficient between the number of cigarettes smoked and the longevity of a test subject.

**Solution**

Let us first assign random variables to our data in the following way –

x – the number of cigarettes smoked

y – years lived

We’ll be using the single formula for discrete data points here –

\(r_{xy} = \frac{N\Sigma x_iy_i – \Sigma x_i \Sigma y_i}{\sqrt{N\Sigma x_i^2 – (\Sigma x_i)^2} \sqrt{N\Sigma y_i^2 – (\Sigma y_i)^2}}\)

Let us now construct a table to compute all the values we are going to use in our correlation formula. Note that N here = 9

x | x2 | y | y2 | xy |

25 | 625 | 63 | 3969 | 1575 |

35 | 1225 | 68 | 4624 | 2380 |

10 | 100 | 72 | 5184 | 720 |

40 | 1600 | 62 | 3844 | 2480 |

85 | 7225 | 65 | 4225 | 5525 |

75 | 5625 | 46 | 2116 | 3450 |

60 | 3600 | 51 | 2601 | 3060 |

45 | 2025 | 60 | 3600 | 2700 |

50 | 2500 | 55 | 3136 | 2750 |

Σxi = 425 | Σxi2 = 24525 | Σyi = 542 | Σyi2 = 33188 | Σxiyi = 24640 |

(Σxi)2 = 4252 = 180625 | (Σyi)2 = 5422 = 293764 |

using the values in the formula, we get – $$ r_{xy} = \frac{N\Sigma x_iy_i – \Sigma x_i \Sigma y_i}{\sqrt{N\Sigma x_i^2 – (\Sigma x_i)^2} \sqrt{N\Sigma y_i^2 – (\Sigma y_i)^2}}$$ $$ = \frac{9.24640 – 425.542}{\sqrt{9.24525 – 180625} \sqrt{9.33188 – 293764}}$$ $$ = \frac{-8590}{\sqrt{40100}\sqrt{4928}} $$ $$ = -0.61 $$

This implies a negative correlation between the considered variables i.e. The higher the number of cigarettes smoked per week in last 5 years, the lesser the number of years lived. Note that it DOES NOT mean that smoking cigarettes decreases the life span. Because, many other factors might be responsible for one’s death. Still, it is an important conclusion nevertheless.

This way you can solve for other datasets similarly.