Correlation and Regression

Karl Pearson’s Coefficient of Correlation

There are many situations in our daily life where we know from experience, the direct association between certain variables but we can’t put a certain measure to it. For example, you know that the chances of you going out to watch a newly released movie is directly associated with the number of friends who go with you because the more the merrier!

Suggested videos

Play
Play
Play
Play
previous arrow
next arrow
previous arrownext arrow
Slider

 

But there are many other factors too, like your interest in that movie, your budget etc. Thus to analyze the situation in detail, you need to note down your similar past experiences and form a sort of distribution from that data. It is at this point that you require a Correlation Coefficient, which will now provide you with a value, based on which you can calculate the possibility of you not going for the movie this time if your friends don’t turn up! Karl Pearson’s Coefficient of Correlation is one such type of parameter which we’ll be studying in this section.

Learn more about Rank Correlation here in detail

 

Introduction to Coefficient of Correlation

The Karl Pearson’s product-moment correlation coefficient (or simply, the Pearson’s correlation coefficient) is a measure of the strength of a linear association between two variables and is denoted by r or rxy(x and y being the two variables involved).

This method of correlation attempts to draw a line of best fit through the data of two variables, and the value of the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit.

Browse more Topics under Correlation And Regression

Karl Pearson Correlation Coefficient Formula

The coefficient of correlation rxy between two variables x and y, for the bivariate dataset (xi,yi) where i = 1,2,3…..N; is given by –

$$r_{(x,y)} = \frac{\text{cov}(x,y)}{\sigma_x \sigma_y}$$

where,

⇒ cov(x,y): the covariance between x and y

– \(\Sigma_{i = 1}^{N}\frac{(x_i – \bar{x})(y_i – \bar{y})}{N} = \frac{\Sigma x_iy_i}{N} – \bar{x}\bar{y}\)

Here, \(bar{x}\) and \(\bar{y}\) are simply the respective means of the distributions of x and y.

⇒ σx and σy are the standard deviations of the distributions x and y.

– \(\sigma_x = \sqrt{\Sigma \frac{(x_i – \bar{x})^2}{N}} = \sqrt{\frac{\Sigma x_i^2}{N} – \bar{x}^2}\)

– \(\sigma_y = \sqrt{\Sigma \frac{(y_i – \bar{y})^2}{N}} = \sqrt{\frac{\Sigma y_i^2}{N} – \bar{y}^2}\)

Alternate Formula

If some data is given in the form of a class-distributed frequency distribution, you may use the following formulae –

⇒ cov(x,y): the covariance between x and y

– \(\frac{\Sigma_{i,j} x_iy_if_{ij}}{N} – \bar{x}\bar{y}\)

Here, \(\bar{x}\) and \(\bar{y}\) are simply the respective means of the distributions of x and y.

⇒ σx and σy are the standard deviations of the distributions x and y.

– \(\sigma_x =\sqrt{\frac{\Sigma_i f_{io} x_i^2}{N} – \bar{x}^2}\)

– \(\sigma_y =\sqrt{\frac{\Sigma_j f_{io} y_i^2}{N} – \bar{y}^2}\)

where,

xi: The central value of the i’th class of x

yj: The central value of the j’th class of y

fio,fij: Marginal Frequencies of x and y

fij: Frequency of the (i,j)th cell

In any case, the following equality must always hold:

Total frequency = N = \(\Sigma_{i,j} f_{ij}\) = \(\Sigma_i f_{io}\) = \(\Sigma_j f_{jo}\)

A Single Formula for Discrete Datasets –

$$r_{xy} = \frac{N\Sigma x_iy_i – \Sigma x_i \Sigma y_i}{\sqrt{N\Sigma x_i^2 – (\Sigma x_i)^2} \sqrt{N\Sigma y_i^2 – (\Sigma y_i)^2}}$$

Let us understand more about Scatter Diagram here

Properties of the Pearson’s Correlation Coefficient

⇒ r is unit-less. Thus, we may use it to compare association between totally different bivariate distributions as well. For eg – you may compare how much of you not going for a movie is related to your friends not joining you, and to you not being much interested for the movie yourself, both at the same time, with the Pearson’s correlation coefficients obtained from both the cases. In economics therefore, where the cost price or the market shares depend on lots of different factors, this parameter is of utmost importance in ascertaining the connection between various quantities.

⇒ The value of r always lies between +1 and -1.  Depending on its exact value, we see the following degrees of association between the variables-

r value variation:

STRENGTH OF ASSOCIATION NEGATIvE r POSITIvE r
weak -0.1 to -0.3 0.1 to 0.3
average -0.3 to -0.5 0.3 to 0.5
strong -0.5 to -1.0 0.5 to 1.0

A value greater than 0 indicates a positive association i.e. as the value of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association i.e. as the value of one variable increases, the value of the other variable decreases.

⇒ The Pearson product-moment correlation does not take into consideration whether a variable has been classified as a dependent or independent variable. It treats all variables equally.

⇒ A change of origin of the system, or any scaling of the variables doesn’t affect the value of r. The sign might change depending on the sign of scaling done.

Basically, if the bivariate system (x, y) is converted to another bivariate system (u, v) by a change of origin or scaling or both, in the following way –

$$u = \frac{x – a}{b}, v = \frac{y – c}{d}$$

Then the correlation coefficient takes on the following value – $$r_{(u,v)} = \frac{bd}{|b||d|} \ r_{(x,y)}$$

Assumptions

While calculating the Pearson’s Correlation Coefficient, we make the following assumptions –

  • There is a linear relationship (or any linear component of the relationship) between the two variables
  • We keep Outliers either to a minimum or remove them entirely

An outlier is a data point that does not fit the general trend of your data but would appear to be an extreme value and not what you would expect compared to the rest of your data points. you can detect outliers by plotting the two variables against each other on a graph and visually inspecting the graph for extreme points.

you can then either remove or manipulate that particular point as long as you can justify why you did so. Outliers can have a very large effect on the line of best fit and the Pearson correlation coefficient, which can lead to very different conclusions regarding your data. Both of the above points for a given pair of variables can be analyzed easily by studying their scatter plots.

Solved Example on Coefficient of Correlation

Question: An experiment conducted on 9 different cigarette smoking subjects resulted in the following data –

Subject Number Cigarettes smoked per week Number of years lived
  (averaged over the last 5 years of their life)  
1 25 63
2 35 68
3 10 72
4 40 62
5 85 65
6 75 46
7 60 51
8 45 60
9 50 55

Calculate the correlation of coefficient between the number of cigarettes smoked and the longevity of a test subject.

Solution

Let us first assign random variables to our data in the following way –

x – the number of cigarettes smoked

y – years lived

We’ll be using the single formula for discrete data points here –

\(r_{xy} = \frac{N\Sigma x_iy_i – \Sigma x_i \Sigma y_i}{\sqrt{N\Sigma x_i^2 – (\Sigma x_i)^2} \sqrt{N\Sigma y_i^2 – (\Sigma y_i)^2}}\)

Let us now construct a table to compute all the values we are going to use in our correlation formula. Note that N here = 9

x x2 y y2 xy
         
25 625 63 3969 1575
35 1225 68 4624 2380
10 100 72 5184 720
40 1600 62 3844 2480
85 7225 65 4225 5525
75 5625 46 2116 3450
60 3600 51 2601 3060
45 2025 60 3600 2700
50 2500 55 3136 2750
Σxi = 425 Σxi2 = 24525 Σyi = 542 Σyi2 = 33188 Σxiyi = 24640
(Σxi)2 = 4252 = 180625   (Σyi)2 = 5422 = 293764    

using the values in the formula, we get – $$ r_{xy} = \frac{N\Sigma x_iy_i – \Sigma x_i \Sigma y_i}{\sqrt{N\Sigma x_i^2 – (\Sigma x_i)^2} \sqrt{N\Sigma y_i^2 – (\Sigma y_i)^2}}$$ $$ = \frac{9.24640 – 425.542}{\sqrt{9.24525 – 180625} \sqrt{9.33188 – 293764}}$$ $$ = \frac{-8590}{\sqrt{40100}\sqrt{4928}} $$ $$ = -0.61 $$

This implies a negative correlation between the considered variables i.e. The higher the number of cigarettes smoked per week in last 5 years, the lesser the number of years lived. Note that it DOES NOT mean that smoking cigarettes decreases the life span. Because, many other factors might be responsible for one’s death. Still, it is an important conclusion nevertheless.

This way you can solve for other datasets similarly.

 

Share with friends

Customize your course in 30 seconds

Which class are you in?
5th
6th
7th
8th
9th
10th
11th
12th
Get ready for all-new Live Classes!
Now learn Live with India's best teachers. Join courses with the best schedule and enjoy fun and interactive classes.
tutor
tutor
Ashhar Firdausi
IIT Roorkee
Biology
tutor
tutor
Dr. Nazma Shaik
VTU
Chemistry
tutor
tutor
Gaurav Tiwari
APJAKTU
Physics
Get Started

Leave a Reply

Your email address will not be published. Required fields are marked *

Download the App

Watch lectures, practise questions and take tests on the go.

Customize your course in 30 seconds

No thanks.