You are on page 1of 36

Probability and Statistics Unit 8

Unit 8 Correlation and Regression


Structure:
8.1 Introduction
Objectives
8.2 Correlation
Types of correlation
Methods of measurement of correlation
8.3 Partial Correlation
8.4 Multiple Correlation
8.5 Regression
Regression Analysis
Regression Lines
Regression coefficient
Angle between two regression lines
8.6 Multiple Regression
8.7 Summary
8.8 Terminal Questions
8.9 Answers

8.1 Introduction
So far in our previous chapters we studied various daily problems related to
one variables. But there are more number of problems involving two or more
variables. If two quantities varies in such a way that movements in one
variable effects the movement of other variables, then we say that the two
variables are correlated. For example the variables like height and weight,
rainfall and yield, price and demand, income and expenditure, production
and employment etc. Regression measures the average relationship
between any two or more closely related variables.
In this unit, we will discuss about the techniques such as correlation and
regression, used for investigating the relationship between two or more
variables.
Objectives:
At the end of this unit the student should be able to:
calculate the coefficient for partial and multiple correlation

Sikkim Manipal University Page No.: 225


Probability and Statistics Unit 8

distinguish between parametric and non-parametric measures of


correlation
apply the method of estimating unknown values from known values
through regression equations.

8.2 Correlation
Correlation is a statistical tool used to study the relationship between two or
more variables. Two variables are said to be correlated if the change in one
variable there will change in other variable. On the other hand if the change
in one variable does not bring any change in other variable then we say that
the two variables are not correlated to each other.
According to Simpson and Kafka. Correlation analysis deals with the
association between two or more variables
In the words of A.M. Tuttle, Correlation is an analysis of the covariation
between two or more variables
8.2.1 Types of correlation
There are four types of correlation
1. Simple, Partial and Multiple correlation
2. Positive and negative correlation
3. Perfect and Imperfect correlation
4. Linear and non-linear correlation.
1. Simple, Partial and Multiple correlation:
Simple correlation is the relationship between any two variables. Partial
correlation is the study of relationship between any two out of three or more
variables ignoring the effect of other variables. For example, let us suppose
that we have three variables X1 = marks of maths, X2 = marks of Science
and X3 = marks of English. So if we study the relationship between X 1 and
X2 ignoring the effect of other variable i.e., X3, then it is partial correlation.
Multiple correlation is the study of simultaneous relationship between one or
group of other variables. For example, if we study X1, X2, X3 simultaneously
then correlation between X1 and (X2, X3) is multiple correlation. Multiple
correlation is not commonly used.
2. Positive and Negative correlation: Two variables are said to be
positively correlated when the both the variables under study move in the

Sikkim Manipal University Page No.: 226


Probability and Statistics Unit 8

same direction, i.e., if one variable increase the other variable should also
increase and if one variable decreases the other variable should also
decrease. Variables are said to be negatively correlated if increase in one
variable leads to decrease in other variable and vice versa. That is the
variables move in opposite direction. For positive correlation, the graph will
be an upward curve whereas in case of negative correlation the graph will
be downward curve.
3. Perfect and Imperfect correlation
When both the variables changes at a constant rate irrespective of the
change in direction then it is called perfect correlation. When the variables
changes at different ratio then it is called imperfect correlation. The values of
perfect correlation is 1 or -1 and the values of imperfect correlation lies in
between -1 and 1.
4. Linear and Non-linear correlation
Linear correlation is a correlation when the graph of the correlated data is a
straight line. That is the variables are perfectly correlated. The linear
correlation can be either positive or negative when the graph of straight line
is either upward or downward in direction. On the other hand the non-linear
or curvi-linear correlation is a correlation when the graph of the variables
gives a curve of any direction. Like perfect correlation, non-linear correlation
can be either be positive or negative in nature depending upon the upward
and downward direction of the curve

Positive Linear Correlation Negative linear Correlation

Positive non-Linear correlation Negative non-Linear correlation

Sikkim Manipal University Page No.: 227


Probability and Statistics Unit 8

8.2.2 Methods of measurement of correlation


Following are the three important methods of measuring the correlation
between the variables
1. Scatter Diagram method
2. Karl Pearsons Coefficient method
3. Spearmans Rank Coefficient method
Now let discuss these methods one by one in detail
1. Scatter Diagram method: It is the simplest method of studying the
correlation between two variables. In this method the values of one of
the variables are represented by X axis and other variable are
represented by Y axis. Then for each pairs of the values of the variables
a dot is plotted which gives an indication of the direction of the diagram.
The scatter of points on the graph gives an idea whether the variables
are related or not. When the dots are more scattered then the degree of
relation between two variables are very less. The more closer the dots
near the straight line more will be the association between the variables.
If all the points lie on a straight line falling from the lower left hand corner
to the upper right hand corner, then the correlation is said to be perfectly
positive. If all the points are lying on a straight line rising from the upper
left hand corner to the lower right- hand corner of the diagram, then the
correlation is said to be perfectly negative. If the points fall in a narrow
band, there will be high degree of correlation between the variables.
Correlation will be positive if the points are in increasing tendency from
the lower hand corner to the upper hand corner and negative if the
points show a declining tendency from upper left hand corner to the
lower right hand corner.
Perfectly positive I Perfectly negative- II

Sikkim Manipal University Page No.: 228


Probability and Statistics Unit 8

High Degree of + correlation High Degree of - correlation

Low Degree of + correlation Low Degree of correlation

No Correlation

2. Karl Pearsons Method of Coefficient of Correlation:


Karl Pearsons method of coefficient of correlation is also known as
Pearsonian coefficient of correlation or product correlation method
Let X and Y be two random variables, then the correlation of coefficient
between the variables X and Y is denoted by r(X,Y) or simply by r XY, and is
defined as
(8.1)

Sikkim Manipal University Page No.: 229


Probability and Statistics Unit 8

Cov (X,Y) = E[{X-E(X)}{Y-E(Y)}]


=

(8.2)

Other equivalent form of coefficient of correlation formulas are


We have
Cov (X,Y) =

(8.3)

Note:
1. The value of correlation coefficient can not exceed unity numerically. It
always lies in between -1 and +1. That is . If r = +1, the
correlation is perfect and positive and if r = -1, correlation is perfect and
negative.
2. It is not affected by change of origin or change of scale
3. It is a relative measure. It does not have any unit attached to it
Example: Calculate the coefficient of correlation by Karl Pearsons method
based on following values
T1 75 60 45 30 15
T2 150 175 200 225 250

Sikkim Manipal University Page No.: 230


Probability and Statistics Unit 8

Solution:
2 2
T1 X= T1/15 X T2 Y= T2/25 Y XY
75 5 25 150 6 36 30
60 4 16 175 7 49 28
45 3 9 200 8 64 24
30 2 4 225 9 81 18
15 1 1 250 10 100 10
15 55 40

Example: A computer while calculating correlation coefficient between two


variables X and Y from 25 pairs of observation obtained the following
results: n = 25, .
It was later noticed that there was some mistake while copied down two
pairs as
X 6 8
Y 14 6
While the correct values were
X 8 6
Y 12 8

Obtain the correct value of the correlation coefficient.

Solution: Correct
Corrected

Sikkim Manipal University Page No.: 231


Probability and Statistics Unit 8

Corrected
Corrected
Corrected

Cov(X,Y) =

So, corrected r(X,Y) = Cov(X,Y) / =

Example: Calculate the correlation coefficient for the following


Firm 1 2 3 4 5 6 7 8 9 10
Sales 50 50 55 60 65 65 65 60 60 60
Expense 11 13 14 16 16 15 15 14 13 13

Solution:
Sale (X- ) 2 Expenses (Y- ) 2
Firm x y xy
X x Y y
1 50 -8 64 11 -3 9 24
2 50 -8 64 13 -1 1 8
3 55 -3 9 14 0 0 0
4 60 2 4 16 2 4 4
5 65 7 49 16 2 4 14
6 65 7 49 15 1 1 7
7 65 7 4 15 1 1 7
8 60 2 4 14 0 0 0
9 60 2 4 13 -1 1 -2
10 55 -8 64 13 -1 1 8
N =10

Sikkim Manipal University Page No.: 232


Probability and Statistics Unit 8

Example: Find Karl Pearsons correlation coefficient for the data is


X 20 16 12 8 4
Y 22 14 4 12 8
Solution:
X Y X2 Y2 XY
20 22 400 484 440
16 14 256 196 224
12 4 144 16 48
8 12 64 144 96
4 8 16 64 32
X = 60 Y = 60 X = 880
2
Y = 904
2
XY = 840

Applying the formula for r and substituting the respective values from the
table we get r as:

Hence, Karl Pearsons correlation coefficient is 0.70.


Example: In a bivariate data on x and y, variance of x = 49, variance of
y = 9 and covariance Cov(x, y) = -17.5. Find coefficient of correlation
between x and y.

Sikkim Manipal University Page No.: 233


Probability and Statistics Unit 8

Solution: We know that:

( x x)( y y )
r
N x y

( x x )( y y )
Given Cov(x, y) = - 17.5
N
x = 49 = 7 y = 9 = 3

17.5
r -0.833
73

Hence, there is a highly negative correlation.

SAQ 1: Ten observation in Weight (x) and Height (y) of a particular age
group gave the following data.
x = 56 y = 138 x2 = 1357 y2 = 2136 xy = 836 .Find r

SAQ 2: Calculate Karl Pearsons Coefficient of Correlation from the data


given below
X Y
39 47
65 53
62 58
90 86
82 62
75 68
25 60
98 91
36 51
78 84
650 660

Sikkim Manipal University Page No.: 234


Probability and Statistics Unit 8

SAQ 3: Calculate Karl Pearsons Coefficient of Correlation from the data


given below
Year 1985 1986 1987 1988 1989 1990 1991 1992
Index of
100 102 104 107 105 112 103 99
Production
Number of
15 12 13 11 12 12 19 26
unemployed

3. Spearmans Rank Coefficient method


Karl Pearsons correlation coefficient assumes that:
i. Samples are drawn from a normal population
ii. The variables under study are affected by a large number of
independent causes so as to form a normal distribution.

When we do not know the shape of population distribution and when the
data is of qualitative type, Spearmans Ranks correlation coefficient is used
to measure the relationship.
Spearmans Rank correlation coefficient is defined as:
6 D2
1
N3 N

where, D is the difference between ranks assigned to the variables.


N is the number of observation
Note: The value of lies between -1 and +1 and its interpretation is
same as that of Karl Pearsons correlation coefficient. When is +1, there
is a complete agreement in the order of the ranks and the ranks are in the
same direction. When is -1, there is a complete agreement in the order of
the ranks and the ranks are in the opposite direction.
There are four types of problems in rank correlation.
1. Ranks are assigned
2. When ranks are not assigned
3. Ranks are to be assigned and there is tie between ranks
Sikkim Manipal University Page No.: 235
Probability and Statistics Unit 8

Type i: Ranks are assigned: When ranks are already assigned, take the
difference between the ranks of the variables and denote it by D. Then the
rank correlation is computed using the formula

6 D2
1
N3 N

Example: In a singing competition, two judges assigned the ranks for seven
candidates. Find Spearmans rank correlation coefficient.
Competitor 1 2 3 4 5 6 7
Judge I 5 6 4 3 2 7 1
Judge II 6 4 5 1 2 7 3

Solution:
2
Competitor R1 (Judge 1) R2 (Judge 2) D = R1 R 2 D
1 5 6 -1 1
2 6 4 2 4
3 4 5 -1 1
4 3 1 2 4
5 2 2 0 0
6 7 7 0 0
7 1 3 -2 4
Total 14

6 D2
1
N3 N

6 14 6 14
=1 1 0.75
7(7 1)
2
7 48

Hence, Spearmans rank correlation coefficient is 0.75.

Sikkim Manipal University Page No.: 236


Probability and Statistics Unit 8

Example: Find the rank difference coefficient of correlation for the data
displayed in table
Student Score on Score on Rank of Rank on
Test I Test II Test I Test II
X Y R1 R2
A 16 8 2 5
B 14 14 3 3
C 18 12 1 4
D 10 16 4 2
E 2 20 5 1

Solution:
Difference
Score on Score on Rank of Rank on Difference
Student between
Test I Test II Test I Test II squared
Ranks
2
X Y R1 R2 D D
A 16 8 2 5 -3 9
B 14 14 3 3 0 0
C 18 12 1 4 -3 9
D 10 16 4 2 2 4
E 2 20 5 1 4 16
D = 38
2
N=5

Applying the formula, we get:

6 D2 6(38)
=1 1 3 1 1.9 0.9
N N
3
5 5

Example: The following table represents the sales statistics of six sales
representatives in two different localities. Find whether there is a
relationship between buying habits of the people in the localities.
Representative 1 2 3 4 5 6
Locality I 2 5 3 1 4 6
Locality II 4 5 3 1 2 6

Sikkim Manipal University Page No.: 237


Probability and Statistics Unit 8

Solution:

Sales in Sales in 2
Representative D = R1-R2 D
Locality I, R1 locality II, R2
1 2 4 -2 4
2 5 5 0 0
3 3 3 0 0
4 1 1 0 0
5 4 2 2 4
6 6 6 0 0
Total 8

68 8
=1 1 0.7714
6 (6 1)
2
35

Therefore, there is high positive correlation between buying habits of the


locality people.

SAQ4: The ranks of same 16 students in mathematics and physics are as


follows: Two numbers within brackets denote the ranks of the students in
Mathematics and Physics: (1,1), (2,10), (3,3), (4,4), (5,5), (6,7), (7,2), (8,6),
(9,8), (10,11),(11,15), (12,9), (13,14), (14,12), (15,16), (16,13). Calculate the
rank correlation coefficient for proficiencies of this group in Mathematics and
Physics.

Type ii: Ranks are not assigned: When ranks are not given, we have to
assign the ranks to the variables either in ascending order or descending
order. The ranks can be assigned by taking either the highest value as 1 or
the lowest value as 1.Then use the same formula to compute the rank
correlation.

Example: Calculate the rank correlation for the following data of marks of 2
tests given to candidates for a clerical job.
Preliminary test: 92 89 87 86 83 77 71 63 53 50
Final test : 86 83 91 77 68 85 52 82 37 57

Sikkim Manipal University Page No.: 238


Probability and Statistics Unit 8

Solution:
Preliminary 2 2
R1 Final Test (Y) R2 (R1-R2) = D
test (X)
92 10 86 9 1
89 9 83 7 4
87 8 91 10 4
86 7 77 5 4
83 6 68 4 4
77 5 85 8 9
71 4 52 2 4
63 3 82 6 9
53 2 37 1 1
50 1 57 3 4
N = 10

6 D2
1
N3 N

6 44 6 44
=1 1 0.733
10(10 1)
2
990

3. Ranks are to be assigned and there is tie between ranks


It is a case when equal ranks are assigned to two or more entries in a
series. In such cases, we give each individual or entries an average marks.
For example if two individuals are ranked equal at sixth place, they are
given then they are given the rank , that is 6.5 and if three are

ranked equal at 6th place, then they are given the rank . Thus if two

or more individuals are to be ranked equal then the ranks assigned to these
individuals is the average of the ranks.

When equal ranks are assigned to some entries then there will be an
adjustment of to the value , where m stands for number of

items whose ranks are common. If there are more than one such group of
Sikkim Manipal University Page No.: 239
Probability and Statistics Unit 8

items with common rank, this value is added as many times as the number
of such groups. The formula can thus be written as:

= 1 6 D 1 / 12(m1 m1 ) 1 / 12(m2 m 2 ) ...
2 3 3

N N
3

Example: Find rank correlation coefficient for the data given in table.
Student A B C D E F G H I J
Score on Test I 20 30 22 28 32 40 20 16 14 18
Score on Test II 32 32 48 36 44 48 28 20 24 28

Solution: Ranks of test I and test II


Score Score Rank Rank Difference
Difference
Student on on of on between
squared
Test I Test II Test I Test II Ranks
2
X Y R1 R2 D D
A 20 32 6.5 5.5 1.0 1.00
B 30 32 3 5.5 - 2.5 6.25
C 22 48 5 1.5 3.5 12.25
D 28 36 4 4 0 0
E 32 44 2 3 - 1.0 1.00
F 40 48 1 1.5 - 0.5 0.25
G 20 28 6.5 7.5 - 1.0 1.00
H 16 20 9 10 - 1.0 1.00
I 14 24 10 9 1.0 1.00
J 18 28 8 7.5 0.5 0.25
D = 24
2
N = 10

Item 20 is repeated 2 times in Series X and hence m1 = 2. In series Y, item


48 is repeated 2 times so m2 = 2, item, 32 is repeated 2 times so m3 = 2,
item 28 is repeated 2 times so m4 = 2. Substituting these vales in the
formula, we get

= 1 6 D 1 / 12(m1 m1 ) 1 / 12(m2 m 2 ) 1 / 12(m3 m 3 ) 1 / 12(m4 m 4 )
2 3 3 3 3

N N
3

Where, mi represents the number of times a rank is repeated.

Sikkim Manipal University Page No.: 240


Probability and Statistics Unit 8

=1

6 24 1 / 12(2 3 2) 1 / 12(2 3 2) 1 / 12(2 3 2) 1 / 12(2 3 2)
10(100 1)
624 0.5 0.5 0.5 0.5
=1
10 99
146
=1 0.8525
10 99

SAQ5. An examination of eight applicants was taken by a company. From


the marks obtained by the applicants in the accountancy and Statistics
paper, compute rank coefficient of correlation
Applicants: A B C D E F G H
Marks in Account: 15 20 28 12 40 60 20 80
Marks in Stat. : 40 30 50 30 20 10 30 60

8.3 Partial Correlation


Partial Correlation is used in a situation where three or four variables are
involved. The three variables may be age, height and weight. Correlation
between height and weight can be computed by keeping age constant. Age
may be the important factor influencing the strength of relationship between
height and weight. Partial correlation is used to keep constant the effect of
age. The effect of one variable is partially out from the correlation between
other two variables. This statistical technique is known as partial correlation.
Correlation between variables x and y is denoted as rxy. Further, partial
correlation between x and y keeping the variable z constant is denoted
by rxy.z
Calculation of partial Correlation
Partial correlation is denoted by the symbol r12.3. Here correlation between
variable 1 and 2 keeping 3rd variable constant.
r12 r13.r23
r12.3
1 r13 . 1 r23
2 2

where,
r12.3 = Partial correlation between variables 1 and 2 keeping 3rd constant
Sikkim Manipal University Page No.: 241
Probability and Statistics Unit 8

r12 = correlation between variables 1 and 2


r13 = correlation between variables 1 and 3
r23 = correlation between variables 2 and 3

Similarly,
r13 r12 . r23 r23 r12 . r13
r13.2 and r23.1
1 r12 1 r32 1 r12 1 r31
2 2 2 2

Note: The limits of partial correlation is between -1 to 1.


Example: Given r12 = 0.8, r13 = 0.5 and r23 = 0.4, calculate all partial
correlations.
Solution: (i) The correlation between variables 1 and 2 keeping the 3rd
constant is given by
r12 r13.r23 0.8 0.5 0.4 0.6
r12.3 0.756
1 r . 1 r
13
2
23
2
1 0.5 1 0.4
2 2 0.794

(ii) The correlation between variables 1 and 3 keeping the 2nd constant is
given by
r13 r12.r32 0.5 0.8 0.4 0.18
r13.2 0.33
1 r12 . 1 r32
2 2
1 0.82 1 0.42 0.55

(iii) The correlation between variables 2 and 3 keeping the 1st constant is
given by
r23 r21.r31 0.4 0.8 0.5
r23.1 0
1 r21 . 1 r31
2 2
1 0.8 2 1 0.5 2

Example: Is it possible to have the following set of experimental data: r12 =


0.6, r13 = -0.5 and r23 = 0.8

Solution: In order to see whether there is inconsistency in the given data,


we should calculate
r12 r13 .r23 0.6 (0.5 0.8) 1
r12.3 1.92
1 r13 . 1 r23
2 2
1 (0.5) 2 1 0.8 2 0.52

Sikkim Manipal University Page No.: 242


Probability and Statistics Unit 8

SAQ6. Given r12 = 0.7, r13 = 0.61 and r23 = 0.4, calculate all partial
correlations. Calculate r12.3 , r13.2

8.4 Multiple Correlations


Three or more related variables are involved in multiple correlations. The
dependent variable is denoted by X1 and other variables are denoted by X2,
X3 and so on. Coefficient of multiple linear correlation is represented by R1
and it is common to add subscripts designating the variables involved. Thus
R1.234 would represent the coefficient of multiple linear correlations between
X1 on the one hand and X2, X3 and X4 on the other hand. The subscript of
the dependent variable is always to the left of the point.

The coefficient of multiple correlations for R1.23, R2.13 and R3.12 can be
expressed as:

R1.23 = r
12
2
r13 2 2 r12 r13 r23 1 r
23
2

R2.13 = r 2
12
r 2 2 r12 r13 r23
23
1 r
2
13

R3.12 = r 2
13
r 2 2 r12 r13 r23
23
1 r
2
12

Coefficient of multiple correlations for R1.23 is the same as R1.32.

A coefficient of multiple correlation lies between 0 and 1. If the coefficient


of multiple correlations is 1, it shows that the correlation is perfect. If it is 0,
it shows that there is no linear relationship between the variables. The
coefficients of multiple correlations are always positive in sign and range
from 0 to +1. Coefficient of multiple determinations can be obtained by
squaring R1.23.

Multiple correlation analysis measures the relationship between the given


variables. In this analysis, the degree of association is measured between
one variable (which is considered as the dependent variable) and a group of
other variables (which are considered as independent variables).

Sikkim Manipal University Page No.: 243


Probability and Statistics Unit 8

Example: The following are the zero order correlation coefficients.


r12 = 0.98; r13 = 0.44 r23 = 0.54

Calculate multiple correlation coefficient treating first variable as dependent


and second and third variables as independent.

Solution: First variable is dependent. Second and third variables are


independent. Using the formula for multiple correlation coefficients for R1.23
we get:

R1.23 = r
2
12
2
r13 2r 12 r 13 r 23 1 r
2
23
= 0.986

Hence the multiple correlation coefficient is 0.986.

8.5 Regression
Regression is defined as, the measure of the average relationship between
two or more variables in terms of the original units of the data

Correlation analysis attempts to study the relationship between the two


variables x and y. Regression analysis attempts to predict the average x
for a given y. In regression, it is attempted to quantify the dependence of
one variable on the other. For example, if there are two variables x and y
and y depends on x, then the dependence is expressed in the form of the
equations.

8.5.1 Regression analysis


Regression analysis is used to estimate the values of the dependent
variables from the values of the independent variables. Independent
variables are also known as regressor or predictor or explanatory analysis
while the dependent variable is also known as regressed or explained
variable. Regression analysis is used to get a measure of the error involved
while using the regression line as a basis for estimation. The regression
coefficient y on x is the coefficient of the variable x in the line of regression
y on x. Regression coefficients are used to calculate correlation coefficient.
The square of correlation is the product of regression coefficients.

Sikkim Manipal University Page No.: 244


Probability and Statistics Unit 8

8.5.2 Regression lines (Linear Regression)


When the variables in the bivariate distribution are plotted in a scattered
diagram and the curve so formed is a straight line then it is said to be linear
regression between the variables.

The line of regression is the line which gives the best estimate to the value
one variable for any specific value of the other variable. The line of
regression is the line of best fit and is obtained by the principle of least
squares.

For a set of paired observations there exist two straight lines. The line
drawn in such a way that the sum of vertical deviation is zero and the sum of
their squares is minimum, is called regression line of y on x. It is used to
estimate y values for given x values. The line drawn in such a way that the
sum of horizontal deviation is zero and sum of their squares is minimum, is
called regression line of x on y. It is used to estimate x values for given y
values. The smaller the angle between these lines, the higher is the
correlation between the variables. The regression lines always intersect at
( x, y ).

The regression lines have equation,

i) The regression equation of y on x is given by:



y y byx x x
ii) The regression equation of x on y is given by:


x x bxy y y
where,
N xy ( x) ( y ) ( X X ) (Y Y ) cov xy
bxy r x
N y ( y )
2 2
(Y Y ) 2
y y2
and
N xy ( x) ( y) ( X X ) (Y Y ) y cov xy
b yx r
N x 2 ( x) 2 ( X X ) 2 x x2

Sikkim Manipal University Page No.: 245


Probability and Statistics Unit 8

where byx and bxy are called regression coefficients and r is the correlation
coefficient.

8.5.3 Regression Coefficient


When a regression is linear, then the regression coefficient is given by the
slope of the regression line.
The geometric mean of regression coefficients gives the correlation
coefficient.

byx .bxy r 2

byx .bxy r
Note that , byx = , bxy = ,

Properties:
1. The product of regression coefficients is always less than 1, that is,
b yx .b xy 1
2. Regression coefficient is independent of the change of origin but not of
scale.
3. It is an absolute measure

8.5.4 Angle between two lines of Regression


If is the acute angle between two lines of regression, then

Case (i): If r = 0, implies , that is if two variables are uncorrelated, the


lines of regression become perpendicular to each other.

Case (ii): If r = implies . In this case two lines of


regression either coincide or they are parallel to each other. But since two
lines of regression pass through the point so they cannot be parallel.
Hence in this case positive correlation, positive or negative, the two lines of
regression coincide.

Sikkim Manipal University Page No.: 246


Probability and Statistics Unit 8

The differences between correlation and regression coefficient


Differences between correlation and regression coefficient

Regression Coefficient
Correlation Coefficient
The correlation coefficients, The regression coefficients,
rxy = ryx byx bxy
r lies between -1 and 1. byx can be greater than one, in which
case bxy must be less than one such
that byx.bxy<1
It has no units attached to it. It has units attached to it.
It is not based on cause and effect It is based on cause and effect
relationship. relationship.
It indirectly helps in estimation. It is meant for estimation.

Example: Find regression equation from the data represented in table.


Then calculate correlation coefficient.
Age of Husband 18 19 20 21 22 23 24 25 26 27
Age of Wife 17 17 18 18 19 19 19 20 21 22

Solution: Data required for calculation of correlation and regression


coefficients
Age of 2 2
dx = x-22 dx Age of wife (y) dy = y-19 dy dx dy
husband (x)
18 -4 16 17 -2 4 8
19 -3 9 17 -2 4 6
20 -2 4 18 -1 1 2
21 -1 1 18 -1 1 1
22 0 0 19 0 0 0
23 1 1 19 0 0 0
24 2 4 19 0 0 0
25 3 9 20 1 1 3
26 4 16 21 2 4 8
27 5 25 22 3 9 15
Total 225 5 85 190 0 24 43

Sikkim Manipal University Page No.: 247


Probability and Statistics Unit 8

225 190
X= = 22.5 Y= = 19
10 10
Regression equation of Y on X is given by:
Y Y b y x ( X X )

N dxdy ( dx) ( dy ) 10 43 (5) (0) 430


b yx = 0.521
N dx 2 ( dx) 2 10 85 (5) 2 825

19 0.521 22.5

0.521 7.2775
Regression Equation of X and Y is:
N dxdy ( dx) ( dy ) 10 43 (5) (0) 43
bxy = 1.392
N dy 2 ( dy ) 2 10 24 (5) 2 24

22.5 1.792 19
1.792 11.548
r = 0.521x1.792 = 0.966
Hence, the correlation coefficient r is 0.966.

Example: In a correlation study, we have the data represented in table. Find


the two regression equations.
Series X Series Y
Mean Standard Deviation 65 67
Standard Deviation 2.5 3.5
Correlation coefficient 0.8

Solution:
y
Y Y r (X X )
x
3.5
Y 67 (0.8) ( X 65)
2.5
67 1.12 65
1.12 5.8

Sikkim Manipal University Page No.: 248


Probability and Statistics Unit 8

Regression equation of x on y is given by:


x
X X r (Y Y )
y

2.5
X 65 (0.8) Y 67
3.5

65 0.57 67
0.57 26.72
Hence, the two regression equations are:
1.12 5.8
0.57 26.72

Example: The table shows the results that were worked out from scores in
Statistics and Mathematics in a certain examination.
Scores in Statistics Scores in Mathematics
(X) (Y)
Mean 40 48
Standard Deviation 10 15

Karl Pearsons correlation coefficient between x and y is = + 0.42. Find the


regression lines x on y and y on x. Use the regression lines to find the
value of y when x = 50 and value of x when y = 30.

Solution: Given the following data:


X = 40; Y = 40 x = 10; y = 15; r = 0.42

The regression line X on Y is:


( X X ) ( r x / y ) (Y Y ) .(1)

The regression line y on x is given as:


(Y Y ) (r y / x ) ( X X ) .(2)

Therefore substituting the values we get the respective equation as:

= 0.279 y + 26.608 (3)


= 0.63x + 22.80 .(.4)

Sikkim Manipal University Page No.: 249


Probability and Statistics Unit 8

Therefore,
when y = 30; x =35.518 using equation (3) and
when x =50, y = 54.3 by using equation (4)

Example: For the data shown in table, obtain the two regression equations.
Estimate Y for X = 15 and estimate X for Y = 20
X 12 4 20 8 16
Y 18 22 10 16 14

Solution: The table displays the values required for obtaining the
regression equations.
X = (12 + 4 + 20 + 8 + 16)/ 5 =12 = mean of X

Y = (18 + 22 + 10 + 16 + 14) / 5 = 16 = mean of Y


X X= Y Y=
X Y (X X) 2 (Y Y) 2 (X X) (Y Y)
X 12 Y - 16
12 8 0 2 0 4 0
4 22 -8 6 64 36 - 48
20 10 8 -6 64 36 - 48
8 16 -4 0 16 0 0
16 14 4 -2 16 4 -8
160 80 - 104

( X X )( Y Y ) 104
b yx 0.65
( X X ) 2 160

and

( X X ) (Y Y ) 104
bxy 1.3
(Y Y ) 2 80

Regression equation X on Y is given by:

( X X ) bxy (Y Y )

12 1.3 16

Sikkim Manipal University Page No.: 250


Probability and Statistics Unit 8

32.8 1.3
When, Y = 20, X =6.8.
Regression equation Y on X is given by:
( Y Y) b( X X)

16 0.65 12

23.8 0.65
When X = 15, Y =14.05

SAQ 7: From the following data given below compute the two regression
coefficients and formulate the two regression equation: = 510,
= 7140, = 4150, = 54900, = 740200, N = 102. Also
determine the value of Y when X = 7.

SAQ 8: Obtain the regression equation of X on Y and of Y on X from the


following data:
X 1 5 3 2 1 1 7 3
Y 6 1 0 0 1 2 1 5

Also determine r and predict Y, when X = 10 and X when Y = 2.5

8.6 Multiple Regression


Multiple regression analysis is an extension of two variable regression
analysis. In this analysis, two or more independent variables are used to
estimate the values of a dependent variable, instead of one independent
variable.

Objectives of multiple regression analysis are:


To derive an equation, which provides estimates of the dependent
variable from values of the two or more independent variables
To obtain the measure of the error involved in using the regression
equation as a basis of estimation

Sikkim Manipal University Page No.: 251


Probability and Statistics Unit 8

To obtain a measure of the proportion of variance in the dependent


variable accounted for or explained by the independent variables

Multiple regression equation explains the average relationship between the


given variables and the relationship is used to estimate the dependent
variable. Regression equation refers the equation for estimating a
dependent variable. Estimating dependent variable X1 from the independent
variables X2, X3, is known as regression equation of X1 on X2, X3.

Regression equation, when three variables are involved, is given below:


1.23 = a1.23 + b1.23 2 + b13.2 3
where, X1.23 is an estimated value of the dependent variable, X2 and X3 are
independent variables, a1.23 = (Constant) the intercept made by the
regression plan. It gives the value of the dependent variable, when all the
independent variables assume a value equal to zero, b1.23 and b13.2 = partial
regression coefficients or net regression coefficients, b1.23 = measures the
amount by which a unit change in X2 is expected to affect X1 when X3 is held
constant.

8.7 Summary
In this unit we studied the concept of correlation with the help of Karl
Pearsons correlation and Spearmans rank correlation coefficient method
with suitable number of examples. Partial and multiple correlation are also
studied. The concept of Regression analysis and multiple regression are
introduced with the help of examples.

8.8 Terminal Questions


1. Ten competitors in a musical test were ranked by three judges A, B, C in
the following order:
A 1 6 5 10 3 2 4 9 7 8
B 3 5 8 4 7 10 2 1 6 9
C 6 4 9 8 1 2 3 10 5 7

Using rank correlation method find which judge has the nearest
approach to common likings in music.

Sikkim Manipal University Page No.: 252


Probability and Statistics Unit 8

2. Two managers are asked to rank a group of employees in order of


potential for eventually becoming top managers. The rankings are as
follows:
Employees (S) Ranking by manager (I) Ranking by manager (II)
A 10 9
B 2 4
C 1 2
D 4 3
E 3 1
F 6 5
G 5 6
H 8 8
I 7 7
J 9 10

Compute the coefficient of rank correlation and comment on the value.

3. Calculate the Karl Pearsons coefficient of correlation from the following


data and interpret its value:
Roll No. : 1 2 3 4 5
Marks in Accountancy : 48 35 17 23 47
Marks in Statistics : 45 20 40 25 45

4. Calculate the Karl Pearsons Coefficient between age and playing habits
of the following students:
Age : 15 16 17 18 19 20
No. of students : 250 200 150 120 100 80
Regular players : 200 150 90 48 30 12
5. Given r12 = 0.6, r13 = 0.5 and r23 = 0.2, calculate r12.3
6. For a large group of students x1 = score in economics, x2 = score in
maths, x3 = score in Stats., r12 = 0.69, r13 = 0.45, r23 = 0.58. Determine
the multiple correlation R3.12.

Sikkim Manipal University Page No.: 253


Probability and Statistics Unit 8

7. From the following data obtain the equation of the two lines of
regression for the following data

X Y X Y
43 29 45 27
44 31 42 29
46 19 38 41
40 18 40 30
44 19 42 26
42 27 57 10

Also, determine the value of correlation coefficient between X & Y

8.9 Answers

Self Assessment Questions


1. We know that
N xy x y
r
N x 2
( x)2 1/ 2
N y 2
( y)2 1/ 2

Given N = 10, x = 56 y = 138, X2 = 1357, Y2 = 2136, XY =


836
Therefore,
10 836 - (56)(138)
r 0.1286
10 1357 (56) 10 2136 (138)
2 1/ 2 2 1/ 2

Hence, Karl Pearsons correlation coefficient is 0.1286

Sikkim Manipal University Page No.: 254


Probability and Statistics Unit 8

2. We have X =650/10=65 and Y =660/10=66


2 2 xy
X Y x =X- X y=Y- Y x x
39 47 -26 -19 676 361 494
65 53 0 -13 0 169 0
62 58 -3 -8 9 64 24
90 86 25 20 625 400 500
82 62 17 -4 289 16 -68
75 68 10 2 100 4 20
25 60 -40 -6 1600 36 240
98 91 33 25 1089 625 825
36 51 -29 -15 841 225 435
78 84 13 18 169 324 234
Total = 650 660 0 0 5398 2224 2704

Karl Pearsons correlation coefficient is given by


xy 2704
r
( x ) ( y )
2 2
5398 2224
0.7804

3. The table below shows the sums required for calculation of Karl
Pearsons correlation coefficient.
Index of
2 No. of 2
Year Production x X X x yYY y xy
unemployed
X
1985 100 -4 16 15 0 0 0
1986 102 -2 4 12 -3 9 +6
1987 104 0 0 13 -2 4 0
1988 107 +3 9 11 -4 16 - 12
1989 105 +1 1 12 -3 9 -3
1990 112 +8 64 12 -3 9 - 24
1991 103 -1 1 19 +4 16 -4
1992 99 -5 25 26 + 11 121 - 55
X = 832 x = 0 x = 120 Y = 120 y = 0 y = 184 xy = -
2 2

92

Sikkim Manipal University Page No.: 255


Probability and Statistics Unit 8

X = 104 Y = 15
xy 92
r - 0.619
( x 2 ) ( y 2 ) 120 184

Therefore, a correlation between production and unemployed is


negative.
4.
Rank in Maths Rank in Physics 2
D = X-Y D
(X) (Y)
1 1 0 0
2 10 -8 64
3 3 0 0
4 4 0 0
5 5 0 0
6 7 -1 1
7 2 5 25
8 6 2 4
9 8 1 1
10 11 -1 1
11 15 -4 16
12 9 3 9
13 14 -1 1
14 12 2 4
15 16 -1 1
16 13 3 9
Total 0 136

Rank Correlation coefficient is given by:

6 D2 6 14 6 136
1 =1 1 0.8
N N
3
7(7 1)
2
16 255

Sikkim Manipal University Page No.: 256


Probability and Statistics Unit 8

5.
Marks in Rank Rank
Applica Marks in Stats. 1 2 2 2
Accountancy assigned assigned (R -R ) = D
nts (Y)
(X) R1 R2
A 15 2 40 6 16
B 20 3.5 30 4 0.25
C 28 5 50 7 4
D 12 1 30 4 9
E 40 6 20 2 16
F 60 7 10 1 36
G 20 3.5 30 4 0.25
H 80 8 60 8 0
N=8

Item 20 is repeated 2 times in Series X and hence m1 = 2. In series Y,


item 30 is repeated 3 times so m2 = 3. Substituting these vales in the
formula, we get

= 1 6 D 1 / 12(m1 m1 ) 1 / 12(m2 m 2 )
2 3 3

N N
3

Where, mi represents the number of times a rank is repeated.

=1

6 81.5 1 / 12(2 3 2) 1 / 12(33 3)
8(64 1)
681.5 0.5 2
=1
8 63
6 84
=1 0
504
6. Solution: (i) The correlation between variables 1 and 2 keeping the 3rd
constant is given by
r12 r13 .r23 0.7 0.61 0.4 0.456
r12.3 0.629
1 r13 . 1 r23
2 2
1 0.612 1 0.4 2 0.794 0.916

Sikkim Manipal University Page No.: 257


Probability and Statistics Unit 8

(ii) The correlation between variables 1 and 3 keeping the 2nd constant is
given by

r13 r12 .r32 0.61 0.7 0.4 0.33


r13.2 0.505
1 r12 . 1 r32
2 2
1 0.61 1 0.4
2 2 0.714 0.916

7. Two regression coefficients are


N xy ( x) ( y ) 102(54900) (510 7140) 1958400
bxy =
N y ( y )
2 2
102(740200) (7140) 2
245220800
0.08

N xy ( x) ( y ) 102(54900) (510 7140)


byx = 12
N x 2 ( x) 2 102(4150) (510) 2
0.08
Two regression equations are
a) X on Y :

Where = =

b) Y on X :

8.
X Y dx=X-3 dy=Y-2 dxdy
1 6 -2 4 4 16 -8
5 1 -1 -1 4 1 -2
3 0 -2 -2 0 4 0
2 0 -2 -2 1 4 2
1 1 -1 -1 4 1 2
1 2 0 0 4 0 0
7 1 -1 -1 16 1 -4
3 5 3 3 0 9 0

Sikkim Manipal University Page No.: 258


Probability and Statistics Unit 8

N=8
N dxdy ( dx) ( dy ) 8(10) (1 10) 80 5
bxy =
N dy ( dy )
2 2
8(36) (0) 2
288 18
0.28

N dxdy ( dx) ( dy ) 8(10) (1 0)


byx = 0.3
N dx 2 ( dx) 2 8(33) (1) 2

Two regression equations are


a) X on Y :

Where =

b) Y on X :

Coefficient of correlation

Since the regression coefficient are negative, the correlation coefficient


will be negative.
Value of Y when X = 10

Using the regression equation of Y on X, we get


Y = 2.8625-0.3(10) = 2.8625 3 = -0.1375
Value of X when Y= 2.5

Using the regression equation of Y on X, we get


X = 3.435 - 0.25(2.5) = 3.435 - 0.700 = 2.735

Sikkim Manipal University Page No.: 259


Probability and Statistics Unit 8

Terminal Questions
1. Pair of judge A and C has the nearest approach to common likings in
music.
2. 0.915, there is high degree of positive correlation in ranks assigned by
the two managers.
3. R = 0.429
4. R = -0.991
5. 0.589
6. 0.584
7. bxy = -0.44, byx = -1.22, Regression eq of X on Y : X = 54.80 0.44Y
Regression eq of Y on X : Y = 78.67 1.22Y

Sikkim Manipal University Page No.: 260

You might also like