You are on page 1of 14

ecture 8

Simple Linear Regression


In the analysis of experimental data it is frequently desirable
to know something about the relationship between two variables.
For example we may be interested in studying the relationship
between height and weight of human males.
The nature and strength of the relationship between variables
such as height and weight may be examined by regression and
correlation analysis. These techniques, though related, serve
different purposes.
Regression analysis is useful in determining the possible form
of the relationship between variables so that ultimately we are
able to predict or estimate the value of one variable
corresponding to a given value of another variable.
Correlation analysis is concerned with measuring the strength of
the relationship between variables.
Regression
From a number of pairs of corresponding observations made on two
variables X and Y say, we attempt to find the relationship
between them.
Very rarely there may be an exact relationship between them when
all the observed values lie exactly on a given curve and exactly
satisfy its equation. sually the relationship is not exact, the
points do not all lie exactly on a given curve and repeated
observations made for the same value of X would not always give
the same value of Y but a range !or distribution" of Y values
corresponding to that value of X.
#egression analysis examines the equation relating the mean
value of this distribution of Y values for a given X value i.e.
for all such X values in the bivariate population. X is often
called the regressor variable.
Simple Linear Regression
Assumptions
$. For any value of X the distribution of Y is %ormal.
&. The variance,
&
, of the distribution of Y is the same for any
1
chosen value of X.
'. The mean Y values corresponding to different X values lie on
a straight line.
(ence the population equation relating X and Y is)
(ere and are population parameters and are constant.
Linear Model
The individual Y values corresponding to any X value will in
general differ from the value predicted by the equation because
of the distribution of Y values about their mean. (ence the
individual Y values !i.e.y
i
" corresponding to each * + x
i
are
given by)
where
i
are the error terms which are normally distributed with
,ero mean and variance equal to that of the distribution of Y,
i.e.,
&
.
The constants and have to be estimated from the set of sample
observations i.e.
(ence the sample estimate of the linear relationship become
or
Least Squares Estimates Of and
-onsider the following diagram.
2
.et e
$
,e
&
,e
'
,...,e
n
denote the deviations of the points !in the y/
direction" from the line then
.
.
.
.
%ow and are chosen so that the sum of the squares of the
deviations of the plotted points from the predicted value has a
minimum value. If S represents this sum then
S + e
i
&

For S to be a minimum we must have
3
or t!e sa"e of clarity #e can omit t!e subscript i and #rite
t!ese equations as
These equations are called the $ormal Equations.
Solving t!ese equations gives
The equation is called the regression line of y on x.
E%ample
&!e follo#ing data is t!oug!t to be related by t!e equation y =
a + bx:
4
x: ' ( ) * + ,
y: )., *.' +.+ ,.) ,.- ..'
/a0 dra# a scatter diagram and /b0 obtain t!e regression
equation and plot it on t!e scatter diagram.
/a0
/b0 $o# n = -1 % 2 (,1 %
)
2 ,,1 y

2 )... and %y 2 8+.8.
so1 b 2 (,.,,3(.., 2 .884
and a 2 /)...3-0 5 .884 % /(,3-0 2 ).*4+
6ence1 t!e regression equation is7
Evaluating &!e Regression Equation
8!en t!e regression equation !as been obtained it must be
evaluated to determine #!et!er it adequately describes t!e
relations!ip bet#een t!e t#o variables and #!et!er it can be
used effectively for prediction and estimation purposes.
5
&!e total variation can be split into t#o parts1 as s!o#n
by t!e equation1 one part due to residuals and t!e second
due to regression.
0s the above figure shows the deviation of y
i
from the mean of
ycan be divided into two parts
%ow, the sum of squared errors is)
1y expansion of the right/hand side, it can be shown that)
T22 + 22!residual" 3 22!#egression"
6ence t!e total variation can be split into t#o parts1 one due
to residual /or error0 variation and t!e second due to
regression.

urt!ermore1 it can be s!o#n t!at
SS/Regression0 2 /S
%y
0
)
3/S
%%
0

2 S
yy
5 /S
%y
0
)
3/S
%%
0 .
6
6ence1 SS/Residual0 2 &SS 5 SS/Regression0
8!ere1
9roperties Of b
After muc! mat!ematical manipulation it can be s!o#n t!at
E:b; 2 1 #!ic! is per!aps to be e%pected1 and also t!e variance
Since Y is $ormally distributed / #it! variance
)
0 it follo#s
t!at b is also $ormally distributed i.e. b < .
6ence1 t!e quantity =
is a standard normal variable #!ere t!e un"no#n variance1
)

can be replaced by its estimate1
& statistic 2 /b5 03s
b
.
8!en t!e sample variance is used1 t!e variable follo#s a t 5
distribution #it! n - 2 degrees of freedom /n -2 since a and b
!ave been estimated from t!e data0.
Similarly1
>s a standard normal variable #!ic! follo#s a t5distribution
#it! /n 5 )0 degrees of freedom #!en sample variance is used.
7

= =
n
y
y y y S
i xy
2
2 2
) (
) (
&!ese distributions can be used to
(. form confidence intervals for t!e value of t!e slope and
intercept.
). test !ypot!eses 2 ' about t!e slope and intercept
/usually 2 '0.
*. test t!e slopes /or intercepts0 of t#o regression lines to
see if t!ey are equal.
E%ample
An e%periment #as conducted to study t!e effect of a certain
drug in lo#ering !eart rate in adults. &!e independent variable
is dosage /mg0 and t!e dependent variable is t!e difference
bet#een lo#est rate follo#ing administration of t!e drug and a
predrug control. &!e follo#ing data #ere collected.
?ose /mg0
x
Reduction in !eart rate
/beats3min0
y
'.,' ('
'.., 8
(.'' ()
(.), ()
(.,' (+
(.., ()
).'' (-
).), (8
).,' (.
).., )'
*.'' (8
*.), )'
*.,' )(
/a0 Obtain t!e regression equation 1 /b0 &est t!e null
!ypot!esis 6
'
7 2 ' and 2 ' /c0 orm 4,@ Confidence >ntervals
for and b .
Solution 7
$o# n 2 (*1 y 2 (481 y
)
2 *))-1 %y 2++)., 1 and % 2 ),.''
%
)

2 -*.*.,
8
2
xx
+ $$.'45
S
yy
2 )('.*'8
S
%y
2 +-.,
b2 +.'88
a 2 ..',,
6ence1 t!e regression equation becomes7
/b0 &est of 6ypot!eses7
6
o
7 2 ' vs. 6
(
7 '
SS/Residual0 2 SSE 2 )'.))
9
Estimated standard error of b 2 S
b
2 '.(-(- 2 '.+')
t 2 ('.(.
2 /n 5 )0 2 (( and t
((
/'.''',0 2 +.+*.
Since obs t A +.+*. 6
'
is reBected in favor of 6
(
i.e. t!ere is
very strong evidence t!at '.
S
a
)
2 '..8.8 = S
a
2 ..8.8 2 .88.-
A 4,@ Confidence >nterval for a is7
A 4,@ Confidence >nterval for b is7

A$OCA Applied to Regression
&!is provides an alternative met!od for testing t!e null
!ypot!esis 6
'
7 2 '
>t can be s!o#n t!at
#it! ( d.f.
10
#it! /n - 2) d.f.
6ence
8!en t!e null !ypot!esis is true t!e
quotient follo#s t!e F 5 distribution #it! ( degree of freedom
in t!e numerator and /n-2) degrees of freedom in t!e
denominator.
>t is interesting to note t!at t!e F 5 value obtained using t!is
met!od is equal to t
)
obtained using t!e t5 test.
&!e Anova table is 7
Source Sum of Squares d.f Mean
Square
Calue
Regressio
n
SS/Regression0
Or SSR
( MSR MSR 3 MSE
Residual3
Error
SSResidual3Error
Or SSE
n5) MSE
&otal &SS n5(
E%ample
Consider t!e previous e%ample concerning t!e effect of a certain
drug in lo#ering !eart rate. &est t!e null !ypot!esis 6
'
7
2'using Anova.
n 2 (* = S
%%
2 ((.*., 1 S
yy
2 )('.*'81 S
%y2
2 +-.',1
&SS 2 S
yy
2 )('.*'8
SSR 2 )'.))
&!e Anova table is 7
11
Source Sum of Squares d.f Mean
Square
Calue
Regression SSR 2 (4'.'88 ( MSR2
(4'.'88
MSR3MSE
2
('*.+(
Residual3E
rror
SSE 2 )'.)) (( MSE 2
(.8*8)
&otal &SS 2 )('.*.8 ()
rom table
(1((
/.''(02 (4.-4. Since obs F A (4.-4 6
'
is reBected1
t!ere is very strong evidence t!at t!e slope is not Dero.
$ote from t!e t -test #e !ad obs t 2 ('.(. and
/t!e difference due to rounding error0
E%cel can be used to carry out all of t!e above calculations.
rom t!e main menu select &ools t!en ?ata Analysis and t!en
Regression to get t!e follo#ing dialog bo%7
12

&6E COE>C>E$& O ?E&ERM>$A&>O$
&!e coefficient of determination1 denoted by r
)
1 represents
t!e proportion of &SS t!at is e%plained by t!e use of t!e
regression model. &!e computational formula for r
)
is7
MEA$1 S&A$?AR? ?EC>A&>O$1 A$? SAM9L>$E ?>S&R>FG&>O$ O b7

F
2 and
b
2

3 /SS
%%
0 or s
b
2 s
e
3 SS
%%
CO$>?E$CE >$&ERCAL OR
&!e /( 5 0(''@ confidence interval for is given by =
b t s
b
#!ere s
b
2 s
e
SS
%%
and t!e value of t is obtained
from t!e t distribution table for 3) area in t!e rig!t tail
of t!e t distribution and n5) degrees of freedom.
13
1 0
2
2

=
r
SS
bSS
r
yy
xy
&ES& S&A&>S&>C OR b7
& 2 /b 5 03 s
b
&!e value of is substituted from t!e null !ypot!esis.
GS>$E &6E REERESS>O$ MO?EL OR ES&>MA&>$E &6E MEA$ CALGE O H
Confidence >nterval for
&!e /( 5 0(''@ confidence interval for % 2 %
o
is /y !at0
ts
ym
#!ere t!e value of t is obtained from t!e t distribution
table for 3) area in t!e rig!t tail of t!e t distribution
curve and df 2 n 5 ). &!e value of s
ym
is calculated as
follo#s7
9rediction >nterval for y
p
7 y/!at0 ts
yp
#!ere7
8!ere t!e value of t is obtained from t!e t distribution table
for 3) area in t!e rig!t tail of t!e t distribution curve and
df 2 n 5 ).
14
x
x y
+ =
x
x y
+ =
xx
o
e ym
SS
x x
n
s s
2
) ( 1
+ =
xx
o
e yp
SS
x x
n
s s
2
) ( 1
1

+ + =

You might also like