Karl Pearson's Correlation Coefficient: Formula, Property, Video, Example

There are many situations in our daily life where we know from experience, the direct association between certain variables but we can’t put a certain measure to it. For example, you know that the chances of you going out to watch a newly released movie is directly associated with the number of friends who go with you because the more the merrier!

Introduction to Coefficient of Correlation

The Karl Pearson’s product-moment correlation coefficient (or simply, the Pearson’s correlation coefficient) is a measure of the strength of a linear association between two variables and is denoted by r or rxy(x and y being the two variables involved).

This method of correlation attempts to draw a line of best fit through the data of two variables, and the value of the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit.

Browse more Topics under Correlation And Regression

Karl Pearson Correlation Coefficient Formula

The coefficient of correlation rxy between two variables x and y, for the bivariate dataset (xi,yi) where i = 1,2,3…..N; is given by –

$$r_{(x,y)} = \frac{\text{cov}(x,y)}{\sigma_x \sigma_y}$$

where,

⇒ cov(x,y): the covariance between x and y

– $\Sigma_{i = 1}^{N}\frac{(x_i – \bar{x})(y_i – \bar{y})}{N} = \frac{\Sigma x_iy_i}{N} – \bar{x}\bar{y}$

Here, $bar{x}$ and $\bar{y}$ are simply the respective means of the distributions of x and y.

⇒ σx and σy are the standard deviations of the distributions x and y.

– $\sigma_x = \sqrt{\Sigma \frac{(x_i – \bar{x})^2}{N}} = \sqrt{\frac{\Sigma x_i^2}{N} – \bar{x}^2}$

– $\sigma_y = \sqrt{\Sigma \frac{(y_i – \bar{y})^2}{N}} = \sqrt{\frac{\Sigma y_i^2}{N} – \bar{y}^2}$

Alternate Formula

If some data is given in the form of a class-distributed frequency distribution, you may use the following formulae –

⇒ cov(x,y): the covariance between x and y

– $\frac{\Sigma_{i,j} x_iy_if_{ij}}{N} – \bar{x}\bar{y}$

Here, $\bar{x}$ and $\bar{y}$ are simply the respective means of the distributions of x and y.

⇒ σx and σy are the standard deviations of the distributions x and y.

– $\sigma_x =\sqrt{\frac{\Sigma_i f_{io} x_i^2}{N} – \bar{x}^2}$

– $\sigma_y =\sqrt{\frac{\Sigma_j f_{io} y_i^2}{N} – \bar{y}^2}$

where,

xi: The central value of the i’th class of x

yj: The central value of the j’th class of y

fio,fij: Marginal Frequencies of x and y

fij: Frequency of the (i,j)th cell

In any case, the following equality must always hold:

Total frequency = N = $\Sigma_{i,j} f_{ij}$ = $\Sigma_i f_{io}$ = $\Sigma_j f_{jo}$

A Single Formula for Discrete Datasets –

$$r_{xy} = \frac{N\Sigma x_iy_i – \Sigma x_i \Sigma y_i}{\sqrt{N\Sigma x_i^2 – (\Sigma x_i)^2} \sqrt{N\Sigma y_i^2 – (\Sigma y_i)^2}}$$

Let us understand more about Scatter Diagram here

Properties of the Pearson’s Correlation Coefficient

⇒ r is unit-less. Thus, we may use it to compare association between totally different bivariate distributions as well. For eg – you may compare how much of you not going for a movie is related to your friends not joining you, and to you not being much interested for the movie yourself, both at the same time, with the Pearson’s correlation coefficients obtained from both the cases. In economics therefore, where the cost price or the market shares depend on lots of different factors, this parameter is of utmost importance in ascertaining the connection between various quantities.

⇒ The value of r always lies between +1 and -1. Depending on its exact value, we see the following degrees of association between the variables-

r value variation:

STRENGTH OF ASSOCIATION	NEGATIvE r	POSITIvE r
weak	-0.1 to -0.3	0.1 to 0.3
average	-0.3 to -0.5	0.3 to 0.5
strong	-0.5 to -1.0	0.5 to 1.0

A value greater than 0 indicates a positive association i.e. as the value of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association i.e. as the value of one variable increases, the value of the other variable decreases.

⇒ The Pearson product-moment correlation does not take into consideration whether a variable has been classified as a dependent or independent variable. It treats all variables equally.

⇒ A change of origin of the system, or any scaling of the variables doesn’t affect the value of r. The sign might change depending on the sign of scaling done.

Basically, if the bivariate system (x, y) is converted to another bivariate system (u, v) by a change of origin or scaling or both, in the following way –

$$u = \frac{x – a}{b}, v = \frac{y – c}{d}$$

Then the correlation coefficient takes on the following value – $$r_{(u,v)} = \frac{bd}{|b||d|} \ r_{(x,y)}$$

Assumptions

While calculating the Pearson’s Correlation Coefficient, we make the following assumptions –

There is a linear relationship (or any linear component of the relationship) between the two variables
We keep Outliers either to a minimum or remove them entirely

An outlier is a data point that does not fit the general trend of your data but would appear to be an extreme value and not what you would expect compared to the rest of your data points. you can detect outliers by plotting the two variables against each other on a graph and visually inspecting the graph for extreme points.

you can then either remove or manipulate that particular point as long as you can justify why you did so. Outliers can have a very large effect on the line of best fit and the Pearson correlation coefficient, which can lead to very different conclusions regarding your data. Both of the above points for a given pair of variables can be analyzed easily by studying their scatter plots.

Solved Example on Coefficient of Correlation

Question: An experiment conducted on 9 different cigarette smoking subjects resulted in the following data –

Subject Number	Cigarettes smoked per week	Number of years lived
	(averaged over the last 5 years of their life)
1	25	63
2	35	68
3	10	72
4	40	62
5	85	65
6	75	46
7	60	51
8	45	60
9	50	55

Calculate the correlation of coefficient between the number of cigarettes smoked and the longevity of a test subject.

Solution

Let us first assign random variables to our data in the following way –

x – the number of cigarettes smoked

y – years lived

We’ll be using the single formula for discrete data points here –

$r_{xy} = \frac{N\Sigma x_iy_i – \Sigma x_i \Sigma y_i}{\sqrt{N\Sigma x_i^2 – (\Sigma x_i)^2} \sqrt{N\Sigma y_i^2 – (\Sigma y_i)^2}}$

Let us now construct a table to compute all the values we are going to use in our correlation formula. Note that N here = 9

x	x2	y	y2	xy

25	625	63	3969	1575
35	1225	68	4624	2380
10	100	72	5184	720
40	1600	62	3844	2480
85	7225	65	4225	5525
75	5625	46	2116	3450
60	3600	51	2601	3060
45	2025	60	3600	2700
50	2500	55	3136	2750
Σxi = 425	Σxi2 = 24525	Σyi = 542	Σyi2 = 33188	Σxiyi = 24640
(Σxi)2 = 4252 = 180625		(Σyi)2 = 5422 = 293764

using the values in the formula, we get – $$ r_{xy} = \frac{N\Sigma x_iy_i – \Sigma x_i \Sigma y_i}{\sqrt{N\Sigma x_i^2 – (\Sigma x_i)^2} \sqrt{N\Sigma y_i^2 – (\Sigma y_i)^2}}$$ $$ = \frac{9.24640 – 425.542}{\sqrt{9.24525 – 180625} \sqrt{9.33188 – 293764}}$$ $$ = \frac{-8590}{\sqrt{40100}\sqrt{4928}} $$ $$ = -0.61 $$

This implies a negative correlation between the considered variables i.e. The higher the number of cigarettes smoked per week in last 5 years, the lesser the number of years lived. Note that it DOES NOT mean that smoking cigarettes decreases the life span. Because, many other factors might be responsible for one’s death. Still, it is an important conclusion nevertheless.

This way you can solve for other datasets similarly.

Share with friends

Karl Pearson’s Coefficient of Correlation

Suggested videos

Introduction to Coefficient of Correlation

Browse more Topics under Correlation And Regression

Karl Pearson Correlation Coefficient Formula

Alternate Formula

A Single Formula for Discrete Datasets –

Properties of the Pearson’s Correlation Coefficient

r value variation:

Assumptions

Solved Example on Coefficient of Correlation

Solution

Customize your course in 30 seconds

Which class are you in?

Browse

Correlation and Regression

Leave a Reply Cancel reply

Browse

Correlation and Regression

Download the App

Customize your course in 30 seconds