You are on page 1of 19

Quantitative Techniques in Business Unit 5

Unit 5 Simple Correlation and Regression


Structure:
5.1 Introduction
Objectives
5.2 Correlation Analysis
Measures of variations
Coefficient of determination
5.3 Regression Analysis
The scatter diagram
The linear regression equation
Standard error of the estimate
5.4 Summary
5.5 Glossary
5.6 Terminal Questions
5.7 Answers

5.1 Introduction
The statistical methods discussed so far have been concerned with only one
single variable like the mean of the distribution of height, standard deviation
of weight etc. There are however, many situations where we are interested
in the relationship between two or more variables occurring together. For
example, we may be interested in studying the ii) effect of various process
parameters on the production process ii) influence of rainfall that yields a
certain crop, ii) impact of height and weight on the health etc. The variables
are said to be co related if a relationship exists between the two. We will
introduce in this chapter, some statistical concepts and techniques, which
are useful in analyzing the relationship between such multiple variables.
There are two main problems involved in such studies:
First, the data may reveal some association between x and y, and we may
be interested to measure numerically the strength of this association
between the variables. Such a measure will determine how well a linear or
other equation describes the relationship between the variables. This is the
problem of correlation.
Secondly, there may be one variable of particular interest, and the other
variable, regarded as an auxiliary variable, may be studied for its possible

Sikkim Manipal University Page No. 79


Quantitative Techniques in Business Unit 5

aid in throwing some light on the first one. In this case, one is then
interested in using a mathematical equation for making estimates or
predictions regarding the main variable. This equation is known as
regression equation, and the problem of making predictions on the basis of
the equation is called the problem of regression.
In short, correlation is concerned with the measurement of the “strength of
association between variable”; while regression is concerned with the
“prediction” of the most likely value of one variable when the value of the
other variable is known.
Objectives:
After studying this unit, you should be able to:
 develop the concept of correlation and regression
 explain the association between two variables in the scatter diagram
 describe the properties of coefficient of correlation
 discuss the coefficient of correlation.

5.3 Correlation Analysis


Correlation analysis helps us to decide the strength of the linear relationship
between two variables. The word „correlation‟ is used to denote the degree
of association between variables and it is represented in terms of a
coefficient known as correlation coefficient. If two variables x and y are so
related, the variations in the magnitude of one variable tend to be
accompanied by variations in the magnitude of the other variables, they are
said to be correlated. If y tends to increase as x increases, the variables are
said to be positively correlated. If y tends to decrease as x increases, the
variables are negatively correlated. If the values of y are not affected by
changes in the values of x, the variables are said to be uncorrelated. In
1896, Karl Pearson developed an index or coefficient of this association in
cases where the relationship is a linear one, i.e. where the trend of
relationship can be described by a straight line.
There is another index for the degree of relationship between two variables
where the relationship is nonlinear. This section focuses only on linear
relationship and the Pearson‟s coefficient of correlation. The coefficient has
two characteristics:

Sikkim Manipal University Page No. 80


Quantitative Techniques in Business Unit 5

1) The range of correlation coefficient is in between -1 to +1. If there is no


relationship at all between two variables, for example, between prices of
petrol and the radioactive elements, then its value will be zero. On the
other hand, if the relationship is perfect, which means that all the points
on the scatter diagram fall on the straight line, and then the value of the
correlation coefficient is +1 or -1, depending upon the direction of the
line. Other values of the correlation coefficient show an intermediate
degree of relationship between the two variables.
2) The sign of the coefficient can be positive or negative. The coefficient is
positive when the slope of the line is positive and it is negative when the
slope of the line is negative.
5.2.1 Measures of variations
Let us consider the following problem, which will explain the concept of
variation.
Mother (X) Daughter (Y)
63 66
65 68
66 65
67 67
67 69
68 70
The computed value of Y which is designated as Yc and is calculated from
the linear regression equation. The closeness between Y and Yc determines
the degree of correlation between them, the perfect correlation being when
Y=Yc for all values of Y and there is no variation at all.
The other method that could be applied here is the arithmetic average of the
Y distribution, Y , to estimate the value of the dependent variable, but we
will expect a lot of deviation between our estimates and the value of Y, and
the total variation between the individual values of Y and Y would be
considerably high. The figure 5.1 shows the difference between the
representation of data by Y and Yc. The figure shows that the regression line
Yc is closer to most points than the line represented by Y and hence a
better estimate of Y.

Sikkim Manipal University Page No. 81


Quantitative Techniques in Business Unit 5

Yc  a  bX
Y

X
Fig. 5.1: Measures of variation

In the table of heights, we see that when the mother‟s height is 63 inches,
the daughter‟s height is 66 inches, and when the mother‟s height is 65
inches the daughter‟s height is 68 inches. This means that the variation in
the daughter‟s height from 66 inches to 68 inches is attributable to the
variation in the mother‟s height and hence is known as “explained variation”.
However, when the mother‟s height is 66 inches, the daughter‟s height
becomes 65 inches. The variation in the daughter‟s height cannot be
explained by the variation of the mother‟s height from this previous
observation, and hence is known as “unexplained variation”.
If all the points in the scatter diagram fell on the regression line, then all
variations in the value of Y will be attributable to the variations in the
corresponding values of X and there will be no unexplained variation.
Therefore, the total variation has two components, so that:
Total variation=explained variation+ unexplained variation
The formula for these variations is as follows:
 Y 2  ( X  X )2
a) Total variation =  (Y  Y ) =  Y   n  =
2  
2
 N
 

 (Yc  Y ) 2 Y 2

= 
b) Explained variation =
a Y  b XY 
n
c) Unexplained variation =  (Y  Yc ) = 2
Y 2
 a  Y  b XY

Correlations Coefficient (r) can be determined by the following


computational formula:
n(XY )  (X )(Y )
r
nX ²)  (X )²  n(Y ²)  (Y )²
Sikkim Manipal University Page No. 82
Quantitative Techniques in Business Unit 5

Where n=number of paired observations

 XY = Summation of individual products of values of X and Y

X = Summation of X variable

Y = Summation of Y variable

X
2
= The X variable is squared and then summed

( X ) 2 = The X variable is summed then squared

Y
2
= The Y variable is squared then summed

( Y ) 2 = The Y variable is summed and then squared

Properties of correlation coefficient


i) The correlation coefficient is independent of the choice of both origin
and scale of observations.
ii) The correlation coefficient is a pure number. It is independent of the
units of measurement.
iii) The correlation coefficient lies in between -1 and +1.
Uses
i) The coefficient of correlation is a measure of the degree of association
between two variables. For comparing two series of observations, it is
sometimes necessary to determine whether they are associated or not,
and to establish relations of cause and effect. The coefficient of
correlation describes a method of determining numerically the
existence of such casual connection between them, e.g., whether
rainfall is connected to the food grains. When the pattern of dots in the
scatter diagram is linear, the correlation coefficient can be considered
as a useful measure of such relationship. A positive value of the
correlation coefficient indicates that a high value of one variable is
associated with the high values of the other, and the low values with
the low values. When the correlation coefficient is negative, high values
of one variable are in general associated with the low values of the
other.
Sikkim Manipal University Page No. 83
Quantitative Techniques in Business Unit 5

ii) Again, the proportion of variation described by regression is equal to


the square of correlation coefficient, i.e.,

r 2 = proportion of variation explained by regression


The values of r 2 , therefore, enables us to state the relative amount of
variation in the dependent variable, which can be explained by the
regression equation.
iii) It helps in estimating the value of the dependent variable, when the
value of the independent variable is known. Thus, for a given x, the
estimated value of y is obtained from the regression equation of y on x.
iv) In educational and psychological measurements, coefficient of
correlation is used in problems of reliability and validity of tests.
Limitations
i) In linear correlation, it is assumed that there is a straight line
relationship between the variables. A small value of r , therefore
indicates only a poor linear type of relationship between the
variables. Before using r as a measure of the degree of association
between two variables, it is advisable to draw a scatter diagram and
see the types of pattern.
ii) Again, a high value of r , also does not indicate that there is a direct
cause and effect relation-ship between variables. The high value of
r may be generated solely because of the influence of the third
variable affecting both. In this case, the effect of the third variable
should be eliminated from the first two and then the partial
correlation coefficient between them can be found out.
iii) Sometimes, it may be the case that two series of observations show
a high correlation coefficient even though there is no logical basis for
any relationship between them. For example, a statistician observed
a high correlation between the Indian population and Pakistan
population. But it is hard to develop a theory as to why this should
be. Such correlation is said to be spurious correlation or non-sense
correlation. One should apply the common sense is deciding
whether the association indicated by the value of r is real or
spurious.

Sikkim Manipal University Page No. 84


Quantitative Techniques in Business Unit 5

iv) If the data are not reasonably homogeneous, the coefficient of


correlation may give misleading information of the strength of the
association. For example, if the scatter diagram shows the points in
separate clusters or groups, the correlation coefficient based on all
the groups taken together may be very high; yet if separate values of
r are calculated for each group, they may be close to zero. If some
reasonable basis can be found for separating the data into groups, it
is desirable to calculate values of r for each group.
5.2.2 Co-efficient of determination
The co-efficient of determination r  
2
is the square of the co-efficient of
correlation, which is the measure of strength of the relationship between two
variables. It is subject to more precise interpretation because it can be
presented as a proportion or as a percentage.
The coefficient of determination can be defined as the proportion of the
variation in the dependent variable Y that is explained by the independent
variable X, in the regression model.
Mathematically,

a  Y  b XY 
 Y  2

Explained var iation  (Y  Y )


C
2
n
r 
2
=
 (Y  Y ) 2 =
 Y 2

 Y   n
Total var iation 2

Activity 1:
State in each case whether you would expect to find a positive
correlation, a negative correlation or no correlation: i) The ages of
husband and child ii) Socks size and honesty iii) Amount of rainfall and
yield of rice

Sikkim Manipal University Page No. 85


Quantitative Techniques in Business Unit 5

Activity 2:
An analyst wants to determine if there is any relationship between the
heights of the daughters and the heights of the mothers. The following table
shows the statistical data. Calculate the coefficient of correlation and
coefficient of determination.
Mother (X) Daughter (Y)
63 66
65 68
66 65
67 67
67 69
68 70
[Hints: Refer the 5.2.1 & 5.2.2, Ans: 0.597 & 0.357]

Self Assessment Questions


1. Coefficient of correlation between two variables X and Y depends upon
the units of measurement. (True/False)
2. If two variables are uncorrelated, they are independent. (True/False)
3. If two variables are independent, they are uncorrelated. (True/False)
4. If the correlation between X and Y is negative, then the correlation
between –X and –Y is positive. (True/False)
5. Coefficient of correlation between price and demand of a commodity is -
1.2. (True/False)
6. Fill in the blanks:
i) Coefficient of correlation is a measure of the strength of the
________________ relationship between two variables.
ii) Coefficient of correlation is ______ of the change of origin and scale.
iii) Coefficient of correlation lies between _____ and ________.

5.3 Regression Analysis


The term “regression” is used to denote estimation or prediction of the
average value of one variable for a specified value of the other variable. The
estimation is done by means of suitable equations, derived on the basis of
available bivariate data. Such an equation is known as regression equation
and its geometrical representation is called regression curve. In simple

Sikkim Manipal University Page No. 86


Quantitative Techniques in Business Unit 5

linear regression, a mathematical regression equation is developed to


describe the functional relationship that exists between two variables X and
Y. This association is represented by plotting the values of paired
coordinates (X, Y) on a graph, with dependent variable (Y) along the vertical
Y-axis and the independent variable (X) along the horizontal X-axis.
5.3.1 The Scatter diagram
When statistical data relating to the simultaneous measurements of two
variables are available, each pair of observations can be geometrically
represented by a point on the graph paper. The values of the one variable
will be shown along the x-axis and the other variable along Y-axis. If these
are n pairs of observations, finally the graph paper will contain n points. This
diagrammatic representation of bivariate data is known as scatter diagram.
A scatter diagram of the data helps in having a visual idea about the nature
of association between two variables. If the points cluster along a straight
line, the association between variables is linear. For example, if the pattern
of points on the scatter diagram shows a linear path diagonally across the
graph paper from the bottom left hand corner to the top right hand corner,
the correlation is positive (Fig.5.2). Otherwise, it is negative (Fig.5.3).
Further, if the points cluster along a curve, the corresponding association is
non-linear or curvilinear (Fig.5.4). Finally, if the points neither cluster along a
straight line nor along a curve (Fig.5.5), there is absence of any association
between the variables.

Fig. 5.2: Positive correlation

Sikkim Manipal University Page No. 87


Quantitative Techniques in Business Unit 5

Fig. 5.3: Negative correlation

Fig. 5.4: Curvilinear

Fig. 5.5: No correlation

Sikkim Manipal University Page No. 88


Quantitative Techniques in Business Unit 5

5.3.2 The linear regression equation


The pattern of the scatter diagram indicates a linear relationship between x
and y and this relationship can be described by a straight line through these
points. This line is known as the “line of regression”. This line should be the
most representative of the data. These are infinite number of lines that can
approximately pass through this pattern, and we are looking for one line out
of these, that is most suitable as representative of all the data. This line is
known as the “line of best fit”. The best line would be the one that passes
through all the points. Since it is not possible, we must find a line, which is
closes to all the points. A line will be the closest to all points if the total
distance between the line and all the points is minimum. But, since some
points will be above the line, so that the difference between the line and the
points above the line would be positive and some points will be below the
line, so that the difference would be negative. And hence, the best line that
passes through these data, these differences will cancel each other, so that
the total sum of differences as a measure of best fit would not be valid.
However, if we took these differences individually and the cumulative sum of
their square would eliminate the problem of positive and negative
differences, since the square of the negative difference would also be
positive, hence the total sum of squares would be positive.
Now, we are looking for a line which is closest to all the points. For such a
line, the absolute sum of differences between the points and the line would
be minimum. Hence, this method of finding the line of best fit is known as
the method of “least squares”. The algebraic equation of this line is given
below:

Yc  a  bX

Where, a and b are the two pieces of information, which determines the
position of the line completely. Here, the parameter:
“a” determines the level of the fitted line at Y-axis and is known as the
Y-intercept.

“b” determines the slope of the regression line which is the change in Yc for
per unit change in X.

X represents a given value of the independent variable

Sikkim Manipal University Page No. 89


Quantitative Techniques in Business Unit 5

Yc represents the computed value of the dependent variable.

The property of the regression is as follows:

1)  Y  Y   0
c

 Y  Y 
2
2) c = Minimum or least value
Where, Y is the observed value of the dependent variable for a given value
of X and Yc is the computed value of the dependent variable for the same
value of X. The relation between Y and Yc is shown below (Fig.5.6).

B
Y

Fig. 5.6: Relationship between two variables

The above line AB is the line of best fit when:

 Y  Y   0
c

 Y  Y 
2
And, c = minimum or least value

Where, Y is the actual observation and Yc I s the corresponding computed


value based upon the method of least squares.

Since, Yc  a  bX is the algebraic equation of the line, we must find out the
value of a and b. These values of “a” and “b”, based upon the “least
squares” principle, and calculated according to the following formulae:

a
 y  x    x  xy 
2

n x    x 
2 2

Sikkim Manipal University Page No. 90


Quantitative Techniques in Business Unit 5

And,
n xy    x  y 
b

n  x2    x  2

The value of a can also be calculated easily, once the value of b has been
calculated as follows:
 
a Yb X
 
Where Y and X are simple arithmetic means Y data and X data
respectively, and n represents the number of paired observations.
5.3.3 Standard error of the estimate
We have found a line through the scatter points, which best fits the data.
The closer these values are to each other, the better the fit. It means that if
the points in the scatter diagram are closely spaced around the regression
line, then the estimated value will be close to the observed value of Y and
hence this estimate can be considered as highly reliable. Accordingly, a
measure of variability of scatter around the regression line would determine
the reliability of this estimate Yc . The smaller this measure, the more
dependable the estimate will be.
This measure is known as “standard error of the estimate” and is used to
determine the dispersion of observed values of Y about the regression lines.

This measure is designed by S y , x 


 Y  Y 
c
2

n2
Where, Y = observed value of the dependent variable

Yc = Corresponding computed value of the dependent variable


n=sample size
(n-2) = Degrees of freedom

Based upon the above relationship, a simple formula for calculating S y , x


would be:

 Y   a  Y  b XY 
2

S y,x 
n2

Sikkim Manipal University Page No. 91


Quantitative Techniques in Business Unit 5

Example 5.1: An analyst wants to determine if there is any relationship


between the heights of the daughters and the heights of the mothers. The
following table shows the statistical data. a) Compute the regression line
b) What would be the height of daughters, if the mother‟s height is 70
inches? c) Determine the standard error.
Mother (X) Daughter (Y)
63 66
65 68
66 65
67 67
67 69
68 70
Solution: The scatter diagram (Minitab result) shows an increasing trend
through which the line of the best fit can be established.
Regression Plot
C2 = 26.25 + 0.625 C1

S = 1.67705 R-Sq = 35.7 % R-Sq(adj) = 19.6 %

70

69
Height of daughters

68

67

66

65

63 64 65 66 67 68

Height of mothers

The line is identified by:

Yc  a  bX

n xy    x  y   
Where b  and a  Y  b X

n  x2    x  2

Sikkim Manipal University Page No. 92


Quantitative Techniques in Business Unit 5

The following table shows the calculation:

X Y X2 XY Y2
63 66 3969 4158 4356
65 68 4225 4420 4624
66 65 4356 4290 4225
67 67 4489 4489 4489
67 69 4489 4623 4761
68 70 4624 4760 4900

 X  396  Y  405  X 2
 26152  XY  26740  Y 2
 27355

6  26740  396  405 60


Then, b    0.625
6  26152  396  396 96

405 396
And, a   0.625   67.5  41.25  26.25
6 6
Hence, the line of regression equation would be:

Yc  a  bX  26.25  0.625 X

b) If mother‟s height is 70 inches, then the calculated height of the daughter


would be:

Yc  a  bX  26.25  0.625  70  70

 Y   a  Y  b XY 
2

C) Now, S y , x 
n2

27355  26.25  405  0.625  26740


= 4

11.25
=  1.678
4

Sikkim Manipal University Page No. 93


Quantitative Techniques in Business Unit 5

Self Assessment Questions


7. If all the points on a scatter diagram appear to form a straight line going
downward from left to right, the correlation between the variables is
perfectly negative. (True/False)
8. Regression and correlation analysis are used to determine cause-effect
relationships. (True/False)
9. The regression line is derived from a sample, not the entire population.
(True/False)
10. For the estimating equation to be perfect estimator of the dependent
variable, which of these would have to be true?
a) The standard error of the estimate is zero.
b) All the data points are on the regression line.
c) The coefficient of determination is -1.
d) (a) and (b) but not (c)
e) All of these.
11. An association between two variables that is described by a curved line
is a __________ one.

5.4 Summary
 In this unit, at first, we learnt the meaning of correlation and regression
and how one variable is dependent on the other variables. Correlation is
termed as the“strength of association between variable”; while
regression is described as the “prediction” of the most likely value of one
variable when the value of the other variable is known.
 In the second stage, we studied the types of correlation and the limiting
values of the correlation. In this stage, we learnt the measures of
variation and the various formula used to determine the correlation
coefficient and coefficient of determination.
 In the third stage, we discussed the linear regression equation and the
various forms of correlation analysis like positive, negative, curvilinear
and no relationship.
 Finally, we learnt the process of calculation of standard error of the
estimates and simple application has also been discussed in this unit

Sikkim Manipal University Page No. 94


Quantitative Techniques in Business Unit 5

5.5 Glossary
Correlation coefficient: It is a value that will determine if there is a
relationship between two variables and the strength of it.
Correlation analysis: A technique to determine the degree to which
variables are linearly related.
Curvilinear relationship: An association between two variables that is
described by a curved line.
Scatter diagram: A graph of points on a rectangular grid; the X and Y
coordinates of each point correspond to the two measurements made on
some particular sample element, and the pattern of points illustrates the
relationship between two variables.
Regression: It is a technique to determine the relationship between
variables.
Dependent variable: The variable that is being predicted or determined by
another variable.
Independent variable: The variable(s) use to predict the value of the
dependent variable.
Simple linear regression: When we have only one dependent variable and
one independent variable. The relationship is approximated via a straight
line.
Multiple regression: When we have two or more independent variables
used to estimate one dependent variable.

5.6 Terminal Questions


1. Discuss the significance of the concept of correlation and regression.
2. Discuss the various types of variations.
3. Determine the coefficient of correlation.
4. What is scatter diagram? How does it help in ascertaining the nature
and degree of linear correlation between two variables?
5. What is standard error of the estimate?

Sikkim Manipal University Page No. 95


Quantitative Techniques in Business Unit 5

5.7 Answers
Answers to Self Assessment Questions
1. False
2. False
3. True
4. False
5. False
6. linear ii) independent iii) -1, +1
7. True
8. False
9. True
10. (d)
11. Curvilinear
Answers to Terminal Questions
1. Refer to 5.1 – The data may reveal some association between x and y,
and we may be interested to measure numerically the strength of this
association between the variables etc.
2. Refer to 5.2.1 – If all the points in the scatter diagram fell on the
regression line then all variations in the value of Y will be attributable to
the variations in the corresponding values of X and there will be no
unexplained variation.
 
3. Refer 5.2.2 – The co-efficient of determination r 2 is the square of the
co-efficient of correlation which is the measure of strength of the
relationship between two variables etc.
4. Refer 5.3 – A scatter diagram of the data helps in having a visual idea
about the nature of association between two variables etc.
5. Refer to 5.3.3 – If the points in the scatter diagram are closely spaced
around the regression line, then the estimated value will be close to the
observed value of Y and hence this estimate can be considered as
highly reliable etc.
References
 Bharadwaj, R. (2001) Business Statistics, Excel books, New Delhi.
 Chandan, J., Jagjit Singh., & Khanna, K. (2003): Business Statistics,
Vikash Publishing House.

Sikkim Manipal University Page No. 96


Quantitative Techniques in Business Unit 5

 Das, N. (2009): Statistical Methods(Vol I), The McGraw Hill Companies,


New Delhi,
 Panneerselvam, R. (2005) Research Methodology, New Delhi: Prentice
Hall of India Private Limited.
 Richard Levin, I., David Rubin, S. (2007): Statistics for Management.
New Delhi: Eastern Economy Edition.
 Shenoy, G., Srivastava, U., & Sharma, S: Business Statistics, New Age
International.

Sikkim Manipal University Page No. 97

You might also like