Professional Documents
Culture Documents
5.1 Introduction
The statistical methods discussed so far have been concerned with only one
single variable like the mean of the distribution of height, standard deviation
of weight etc. There are however, many situations where we are interested
in the relationship between two or more variables occurring together. For
example, we may be interested in studying the ii) effect of various process
parameters on the production process ii) influence of rainfall that yields a
certain crop, ii) impact of height and weight on the health etc. The variables
are said to be co related if a relationship exists between the two. We will
introduce in this chapter, some statistical concepts and techniques, which
are useful in analyzing the relationship between such multiple variables.
There are two main problems involved in such studies:
First, the data may reveal some association between x and y, and we may
be interested to measure numerically the strength of this association
between the variables. Such a measure will determine how well a linear or
other equation describes the relationship between the variables. This is the
problem of correlation.
Secondly, there may be one variable of particular interest, and the other
variable, regarded as an auxiliary variable, may be studied for its possible
aid in throwing some light on the first one. In this case, one is then
interested in using a mathematical equation for making estimates or
predictions regarding the main variable. This equation is known as
regression equation, and the problem of making predictions on the basis of
the equation is called the problem of regression.
In short, correlation is concerned with the measurement of the “strength of
association between variable”; while regression is concerned with the
“prediction” of the most likely value of one variable when the value of the
other variable is known.
Objectives:
After studying this unit, you should be able to:
develop the concept of correlation and regression
explain the association between two variables in the scatter diagram
describe the properties of coefficient of correlation
discuss the coefficient of correlation.
Yc a bX
Y
X
Fig. 5.1: Measures of variation
In the table of heights, we see that when the mother‟s height is 63 inches,
the daughter‟s height is 66 inches, and when the mother‟s height is 65
inches the daughter‟s height is 68 inches. This means that the variation in
the daughter‟s height from 66 inches to 68 inches is attributable to the
variation in the mother‟s height and hence is known as “explained variation”.
However, when the mother‟s height is 66 inches, the daughter‟s height
becomes 65 inches. The variation in the daughter‟s height cannot be
explained by the variation of the mother‟s height from this previous
observation, and hence is known as “unexplained variation”.
If all the points in the scatter diagram fell on the regression line, then all
variations in the value of Y will be attributable to the variations in the
corresponding values of X and there will be no unexplained variation.
Therefore, the total variation has two components, so that:
Total variation=explained variation+ unexplained variation
The formula for these variations is as follows:
Y 2 ( X X )2
a) Total variation = (Y Y ) = Y n =
2
2
N
(Yc Y ) 2 Y 2
=
b) Explained variation =
a Y b XY
n
c) Unexplained variation = (Y Yc ) = 2
Y 2
a Y b XY
X = Summation of X variable
Y = Summation of Y variable
X
2
= The X variable is squared and then summed
Y
2
= The Y variable is squared then summed
a Y b XY
Y 2
Y n
Total var iation 2
Activity 1:
State in each case whether you would expect to find a positive
correlation, a negative correlation or no correlation: i) The ages of
husband and child ii) Socks size and honesty iii) Amount of rainfall and
yield of rice
Activity 2:
An analyst wants to determine if there is any relationship between the
heights of the daughters and the heights of the mothers. The following table
shows the statistical data. Calculate the coefficient of correlation and
coefficient of determination.
Mother (X) Daughter (Y)
63 66
65 68
66 65
67 67
67 69
68 70
[Hints: Refer the 5.2.1 & 5.2.2, Ans: 0.597 & 0.357]
Yc a bX
Where, a and b are the two pieces of information, which determines the
position of the line completely. Here, the parameter:
“a” determines the level of the fitted line at Y-axis and is known as the
Y-intercept.
“b” determines the slope of the regression line which is the change in Yc for
per unit change in X.
1) Y Y 0
c
Y Y
2
2) c = Minimum or least value
Where, Y is the observed value of the dependent variable for a given value
of X and Yc is the computed value of the dependent variable for the same
value of X. The relation between Y and Yc is shown below (Fig.5.6).
B
Y
Y Y 0
c
Y Y
2
And, c = minimum or least value
Since, Yc a bX is the algebraic equation of the line, we must find out the
value of a and b. These values of “a” and “b”, based upon the “least
squares” principle, and calculated according to the following formulae:
a
y x x xy
2
n x x
2 2
And,
n xy x y
b
n x2 x 2
The value of a can also be calculated easily, once the value of b has been
calculated as follows:
a Yb X
Where Y and X are simple arithmetic means Y data and X data
respectively, and n represents the number of paired observations.
5.3.3 Standard error of the estimate
We have found a line through the scatter points, which best fits the data.
The closer these values are to each other, the better the fit. It means that if
the points in the scatter diagram are closely spaced around the regression
line, then the estimated value will be close to the observed value of Y and
hence this estimate can be considered as highly reliable. Accordingly, a
measure of variability of scatter around the regression line would determine
the reliability of this estimate Yc . The smaller this measure, the more
dependable the estimate will be.
This measure is known as “standard error of the estimate” and is used to
determine the dispersion of observed values of Y about the regression lines.
n2
Where, Y = observed value of the dependent variable
Y a Y b XY
2
S y,x
n2
70
69
Height of daughters
68
67
66
65
63 64 65 66 67 68
Height of mothers
Yc a bX
n xy x y
Where b and a Y b X
n x2 x 2
X Y X2 XY Y2
63 66 3969 4158 4356
65 68 4225 4420 4624
66 65 4356 4290 4225
67 67 4489 4489 4489
67 69 4489 4623 4761
68 70 4624 4760 4900
X 396 Y 405 X 2
26152 XY 26740 Y 2
27355
405 396
And, a 0.625 67.5 41.25 26.25
6 6
Hence, the line of regression equation would be:
Yc a bX 26.25 0.625 X
Yc a bX 26.25 0.625 70 70
Y a Y b XY
2
C) Now, S y , x
n2
11.25
= 1.678
4
5.4 Summary
In this unit, at first, we learnt the meaning of correlation and regression
and how one variable is dependent on the other variables. Correlation is
termed as the“strength of association between variable”; while
regression is described as the “prediction” of the most likely value of one
variable when the value of the other variable is known.
In the second stage, we studied the types of correlation and the limiting
values of the correlation. In this stage, we learnt the measures of
variation and the various formula used to determine the correlation
coefficient and coefficient of determination.
In the third stage, we discussed the linear regression equation and the
various forms of correlation analysis like positive, negative, curvilinear
and no relationship.
Finally, we learnt the process of calculation of standard error of the
estimates and simple application has also been discussed in this unit
5.5 Glossary
Correlation coefficient: It is a value that will determine if there is a
relationship between two variables and the strength of it.
Correlation analysis: A technique to determine the degree to which
variables are linearly related.
Curvilinear relationship: An association between two variables that is
described by a curved line.
Scatter diagram: A graph of points on a rectangular grid; the X and Y
coordinates of each point correspond to the two measurements made on
some particular sample element, and the pattern of points illustrates the
relationship between two variables.
Regression: It is a technique to determine the relationship between
variables.
Dependent variable: The variable that is being predicted or determined by
another variable.
Independent variable: The variable(s) use to predict the value of the
dependent variable.
Simple linear regression: When we have only one dependent variable and
one independent variable. The relationship is approximated via a straight
line.
Multiple regression: When we have two or more independent variables
used to estimate one dependent variable.
5.7 Answers
Answers to Self Assessment Questions
1. False
2. False
3. True
4. False
5. False
6. linear ii) independent iii) -1, +1
7. True
8. False
9. True
10. (d)
11. Curvilinear
Answers to Terminal Questions
1. Refer to 5.1 – The data may reveal some association between x and y,
and we may be interested to measure numerically the strength of this
association between the variables etc.
2. Refer to 5.2.1 – If all the points in the scatter diagram fell on the
regression line then all variations in the value of Y will be attributable to
the variations in the corresponding values of X and there will be no
unexplained variation.
3. Refer 5.2.2 – The co-efficient of determination r 2 is the square of the
co-efficient of correlation which is the measure of strength of the
relationship between two variables etc.
4. Refer 5.3 – A scatter diagram of the data helps in having a visual idea
about the nature of association between two variables etc.
5. Refer to 5.3.3 – If the points in the scatter diagram are closely spaced
around the regression line, then the estimated value will be close to the
observed value of Y and hence this estimate can be considered as
highly reliable etc.
References
Bharadwaj, R. (2001) Business Statistics, Excel books, New Delhi.
Chandan, J., Jagjit Singh., & Khanna, K. (2003): Business Statistics,
Vikash Publishing House.