You are on page 1of 68

Statistics for Managers

using Microsoft Excel


3rd Edition
Chapter 11
Simple Linear Regression

2002 Prentice-Hall, Inc.

Chap 11-1

Chapter Topics

Types of regression models


Determining the simple linear regression
equation
Measures of variation
Assumptions of regression and correlation
Residual analysis
Measuring autocorrelation
Inferences about the slope
Chap 11-2

Chapter Topics

(continued)

Correlation - measuring the strength of


the association
Estimation of mean values and
prediction of individual values
Pitfalls in regression and ethical issues

Chap 11-3

Purpose of Regression
Analysis

Regression analysis is used primarily


to model causality and provide
prediction

Predicts the value of a dependent


(response) variable based on the value of
at least one independent (explanatory)
variable
Explains the effect of the independent
variables on the dependent variable
Chap 11-4

Types of Regression Models


Positive Linear Relationship

Negative Linear Relationship

Relationship NOT Linear

No Relationship

Chap 11-5

Simple Linear Regression


Model

Relationship between variables


is described by a linear function
The change of one variable
causes the change in the
other variable
A dependency of one variable
on the other

Chap 11-6

Population Linear Regression


Population regression line is a straight line
that describes the dependence of the
average value (conditional mean) of one
Population Rando
variable on the other
Population
Y intercept

Dependen
t
(Response
) Variable

Slope
Coefficient

m Error

Yi X i i
Population
Regression
YX
Line
(conditional mean)

Independent
(Explanatory
) Variable
Chap 11-7

Population Linear Regression

(continued)

(Observed Value of Y) = Yi

X i i

i = Random Error

YX X i

(Conditional Mean)

X
Observed Value of Y
Chap 11-8

Sample Linear Regression


Sample regression line provides an
estimate of the population regression
line as well as a predicted value of Y
Sample
Y Intercept

Yi b0 b1 X i ei

Sample
Slope
Coefficient
Residual

Sample Regression Line


Y b0 b1 X (Fitted
Regression Line, Predicted Value)
Chap 11-9

Sample Linear Regression

(continued)

b0 and b1 are obtained by finding the


values
b0 of b1 and that minimizes the
sum of the squared residuals

i 1

Yi Yi

e
2

i 1

2
i

b0 provides an estimateof
b1 provides and estimateof
Chap 11-10

Sample Linear Regression

(continued)

Yi b0 b1 X i ei
Y

ei

Yi X i i
b1

YX X i

b0
Observed Value

Y i b0 b1 X i

X
Chap 11-11

Interpretation of the
Slope and the Intercept

E Y | X 0 is the average value of Y

when the value of X is zero.

E Y | X
1
measures the change in
X

the average value of Y as a result of a


one-unit change in X.

Chap 11-12

Interpretation of the
Slope and the Intercept

(continued)

is the estimated

b E Y | X 0

average value of Y when the value of X is


zero.

E Y | X
b1
Xthe estimated change in the
is
average value of Y as a result of a oneunit change in X.
Chap 11-13

Simple Linear Regression:


Example
You want to examine
the linear
dependency of the
annual sales of
produce stores on
their size in square
footage. Sample
data for seven
stores were
obtained. Find the
equation of the

Store

Square
Feet

Annual
Sales
($1000)

1
2
3
4
5
6
7

1,726
1,542
2,816
5,555
1,292
2,208
1,313

3,681
3,395
6,653
9,543
3,318
5,563
3,760
Chap 11-14

Scatter Diagram: Example


Annual Sales ($000)

12000
10000
8000
6000
4000
2000
0
0

Excel Output

1000

2000

3000

4000

5000

6000

Square Feet

Chap 11-15

Equation for the Sample


Regression Line: Example
Yi b0 b1 X i
1636.415 1.487 X i
From Excel Printout:

Coefficients
Intercept
1636.414726
X Variable 1 1.486633657
Chap 11-16

Annual Sales ($000)

Graph of the Sample


Regression Line: Example
12000
10000
8000
6000

Yi =

4000
2000

15
4
.
6
3
6
1

Xi
7
8
1. 4

0
0

1000

2000

3000

4000

5000

6000

Square Feet

Chap 11-17

Interpretation of Results:
Example
Yi 1636.415 1.487 X i
The slope of 1.487 means that for each increase of
one unit in X, we predict the average of Y to
increase by an estimated 1.487 units.
The model estimates that for each increase of one
square foot in the size of the store, the expected
annual sales are predicted to increase by $1487.
Chap 11-18

Simple Linear Regression in


PHStat

In excel, use PHStat | regression | simple


linear regression

EXCEL spreadsheet of regression sales


on footage

Micros oft Excel


Works heet

Chap 11-19

Measure of Variation:
The Sum of Squares

SST

Total
=
Sample
Variability

SSR
Explained
Variability

SSE

Unexplained
Variability

Chap 11-20

Measure of Variation:
The Sum of Squares

(continued)

SST = total sum of squares

SSR = regression sum of squares

Measures the variation of the Yi values


around their mean Y
Explained variation attributable to the
relationship between X and Y

SSE = error sum of squares

Variation attributable to factors other than


the relationship between X and Y
Chap 11-21

Measure of Variation:
The Sum of Squares

(continued)

SSE =(Yi - Yi )2

SST = (Yi - Y)

_
SSR = (Yi - Y)2

Xi

_
Y
X
Chap 11-22

Explanatory Power of
Regression
Variations in
store sizes not
used in
explaining
variation in
sales

Sizes

Sales

Variations in
sales explained
by the error
term SSE
Variations in sales
explained by sizes or
variations in sizes
used in explaining
variation in sales

SSR

Chap 11-23

The ANOVA Table in Excel


ANOVA
df

SS

MS

Significanc
e
F

Regressio
n

SSR

MSR
=SSR/p

MSR/MSE

P-value of
the F Test

Residuals

n-p1

MSE
SSE =SSE/(n-p1)

Total

n-1

SST

Chap 11-24

Measures of Variation
The Sum of Squares: Example
Excel Output for Produce Stores
Degrees of freedom

ANOVA
df

SS

MS

Regression

30380456.12

30380456

Residual

1871199.595 374239.92

Total

32251655.71

Regression (explained) df
Error (residual) df
Total df

F
81.17909

SSE
SSR

Significance F
0.000281201

SST
Chap 11-25

The Coefficient of
Determination

SSR Regression Sum of Squares


r

SST
Total Sum of Squares
2

Measures the proportion of variation


in Y that is explained by the
independent variable X in the
regression model

Chap 11-26

Explanatory Power of
Regression

Sales

Sizes

r
2

SSR

SSR SSE
Chap 11-27

Coefficients of Determination (r
2
) and Correlation (r)
Y r2 = 1, r = +1
^=b +b X
Y
i
0
1 i

Y r2 = 1, r = -1
^=b +b X
Y
i

Yr2 = .8, r = +0.9

X
Y

^=b +b X
Y
i
0
1 i
X

1 i

r2 = 0, r = 0
^ =b +b X
Y
i
0
1 i
X
Chap 11-28

Standard Error of Estimate

SYX

SSE

n2

i 1

Y Yi

n2

The standard deviation of the variation


of observations around the regression
line
Chap 11-29

Measures of Variation:
Produce Store Example
Excel Output for Produce Stores

r2 = .94

Regression Statistics
Multiple R
0.9705572
R Square
0.94198129
Adjusted R Square 0.93037754
Standard Error
611.751517
Observations
7

Syx

94% of the variation in annual sales can


be explained by the variability in the
size of the store as measured by square
Chap 11-30
footage

Linear Regression
Assumptions

Normality

Y values are normally distributed for each X


Probability distribution of error is normal

2. Homoscedasticity (Constant Variance)


3. Independence of Errors

Chap 11-31

Variation of Errors around


the Regression Line
f(e)

Y values are normally distributed


around the regression line.
For each X value, the spread or
variance around the regression line is
the same.

Y
X2

X1
X

Sample Regression Line

Chap 11-32

Residual Analysis

Purposes

Examine linearity
Evaluate violations of assumptions

Graphical Analysis of Residuals

Plot residuals vs. Xi , Yi and time

Chap 11-33

Residual Analysis for


Linearity
Y

X
X

e
X

Not Linear

Linear

Chap 11-34

Studentized Residual

SRi

SYX

ei
1 hi

where

1
hi
n

X X
X X
2

i 1

Residual divided by its standard error


Standardized residual adjusted for the
distance from the average X value
Allow us to normalize the magnitude of the
residuals in units reflecting the variation
around the regression line
Chap 11-35

Residual Analysis for


Homoscedasticity
Y

SR

SR

Heteroscedasticity

Homoscedasticity
Chap 11-36

Residual Analysis:Excel
Output for Produce Stores
Example
Observation
1
2
3
4
5
6
7

Excel Output

Predicted Y
4202.344417
3928.803824
5822.775103
9894.664688
3557.14541
4918.90184
3588.364717

Residuals
-521.3444173
-533.8038245
830.2248971
-351.6646882
-239.1454103
644.0981603
171.6352829

Residual Plot

1000

2000

3000

4000

Square Feet

5000

6000

Chap 11-37

Residual Analysis
for Independence

The Durbin-Watson Statistic

Used when data is collected over time to


detect autocorrelation (residuals in one time
period are related to residuals in another
period)
n
Measures
violation of independence
2
(
e

e
)
assumption
i
i 1
Should be close to 2.

i 2

e
i 1

2
i

If not, examine the


model for
autocorrelation.
Chap 11-38

Durbin-Watson Statistic
in PHStat

PHStat | regression | simple linear


regression

Check the box for Durbin-Watson Statistic

Chap 11-39

Obtaining the Critical Values


of Durbin-Watson Statistic
Table 13.4 Finding critical values of Durbin-Watson Statistic

5
p=1

p=2

dL

dU

dL

dU

15

1.08

1.36

.95

1.54

16

1.10

1.37

.98

1.54
Chap 11-40

Using the
Durbin-Watson Statistic
H 0:

No autocorrelation (error terms are independent)


H1 : There is autocorrelation (error terms are not
independent)

Reject H0
(positive
autocorrelation)

dL

Inconclusive
Accept H0
(no autocorrelatin)

dU

4-dU

Reject H0
(negative
autocorrelation)

4-dL

4
Chap 11-41

Residual Analysis
for Independence
Graphical Approach
Not Independent

Independent

e
Time

Cyclical Pattern

Time

No Particular Pattern

Residual is plotted against time to detect any autocorrelation


Chap 11-42

Inference about the Slope:


t Test

t test for a population slope

Null and alternative hypotheses

Is there a linear dependency of Y on X ?


H0: 1 = 0 (no linear dependency)
H1: 1 0 (linear dependency)

Test statistic

b1 1
t
where Sb1
Sb1

d. f . n 2

SYX
n

(X
i 1

X)

Chap 11-43

Example: Produce Store


Data for Seven Stores:
Store
1
2
3
4
5
6
7

Square
Feet

Annual
Sales
($000)

1,726
1,542
2,816
5,555
1,292
2,208
1,313

3,681
3,395
6,653
9,543
3,318
5,563
3,760

Estimated
Regression
Equation:

Yi = 1636.415 +1.487Xi
The slope of this
model is 1.487.
Is square footage
of the store
affecting its annual
Chap 11-44
sales?

Inferences about the Slope:


t Test Example
H0: 1 = 0
H1: 1 0

Test Statistic:
From Excel Printout

b1 Sb1

Coefficients Standard Error t Stat P-value


.05
Intercept
1636.4147
451.4953 3.6244 0.01515
df 7 - 2 = 5
Footage
1.4866
0.1650 9.0099 0.00028
Critical Value(s):

Reject
.025

Decision:
Reject H0

Reject
.025

-2.5706 0 2.5706

Conclusion:
There is evidence that
square footage affects
annual sales. Chap 11-45

Inferences about the Slope:


Confidence Interval Example
Confidence Interval Estimate of the
Slope:

b1 tn 2 Sb1

Excel Printout for Produce Stores


Lower 95% Upper 95%
Intercept
475.810926 2797.01853
X Variable 11.06249037 1.91077694
At 95% level of confidence, the confidence interval
for the slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear dependency
of annual sales on the size of the store.
Chap 11-46

Inferences about the Slope:


F Test

F Test for a population slope

Null and alternative hypotheses

Is there a linear dependency of Y on X ?


H0: 1 = 0 (No Linear Dependency)
H1: 1 0 (Linear Dependency)

Test statistic

SSR
1
SSE
n 2

Numerator d.f.=1, denominator d.f.=n-2


Chap 11-47

Relationship between
a t Test and an F Test

Null and alternative hypotheses

H0: 1 = 0 (No linear dependency)

H1: 1 0

t
n2

(Linear dependency)

F1,n 2

Chap 11-48

Inferences about the Slope:


F Test Example
Test Statistic:

H0: 1 = 0
From Excel Printout
ANOVA
H1: 1 0
df
SS
MS
F Significance F
.05
Regression
1 30380456.12 30380456.12 81.179
0.000281
numerator Residual
5 1871199.595 374239.919
df = 1
Total
6 32251655.71
denominator
df 7 - 2 = 5
Decision: Reject H0
Reject

Conclusion:

= .05

6.61

F1,n 2

There is evidence that


square footage affects
annual sales.
Chap 11-49

Purpose of Correlation
Analysis

Correlation analysis is used to measure


strength of association (linear
relationship) between two numerical
variables

Only concerned with strength of the


relationship
No causal effect is implied

Chap 11-50

Purpose of Correlation
Analysis

(continued)

Population correlation coefficient (Rho)


is used to measure the strength
between the variables
Sample correlation coefficient r is an
estimate of and is used to measure
the strength of the linear relationship in
the sample observations

Chap 11-51

Sample of Observations from


Various r Values
Y

r = -1

r = -.6

r=0

r = .6

r=1

Chap 11-52

Features of and r

Unit free
Range between -1 and 1
The closer to -1, the stronger the
negative linear relationship
The closer to 1, the stronger the positive
linear relationship
The closer to 0, the weaker the linear
relationship
Chap 11-53

Test for a Linear Relationship

Hypotheses

H0: = 0 (no correlation)

H1: 0 (correlation)

Test statistic
t

where

r
n2
2

r r2

X
i 1

X
i 1

X Yi Y

Y Y
i 1

Chap 11-54

Example: Produce Stores


From Excel Printout

Is there any
evidence of a
linear relationship
between the
annual sales of a
store and its
square footage
at .05 level of
significance?

Regression Statistics
Multiple R
0.9705572
R Square
0.94198129
Adjusted R Square 0.93037754
Standard Error
611.751517
Observations
7

H0: = 0 (No association)


H1: 0 (Association)
.05
df 7 - 2 = 5
Chap 11-55

Example:
Produce Stores Solution
r

.9706
t

9.0099
2
1 .9420
r
5
n2
Critical Value(s):
Reject
.025

Reject
.025

-2.5706 0 2.5706

Decision:
Reject H0
Conclusion:
There is evidence of a
linear relationship at 5%
level of significance

The value of the t


statistic is exactly the
same as the t statistic
value for test on the
slope coefficient
Chap 11-56

Estimation of Mean Values


Confidence interval estimate for
Y | X X i
:
The mean ofSize
Y given
a particular Xi
of interval varies according
Standard error
of the estimate

Yi tn 2 SYX
t value from table
with df=n-2

to distance away from mean,

(Xi X )
1
n
n
2
(Xi X )
2

i 1

Chap 11-57

Prediction of Individual
Values
Prediction interval for individual
response Yi at a particular Xi
Addition of one increases width of interval
from that for the mean of Y

Yi tn 2 SYX

1 (Xi X )
1 n
n
2
(Xi X )
2

i 1

Chap 11-58

Interval Estimates
for Different Values of X
Y

Confidence
Interval for the
mean of Y

Prediction Interval
for a individual Yi

X
b

1 i
+
Yi = b0

A given X

X
Chap 11-59

Example: Produce Stores


Data for seven stores:
Store

Square
Feet

Annual
Sales
($000)

1
2
3
4
5
6
7

1,726
1,542
2,816
5,555
1,292
2,208
1,313

3,681
3,395
6,653
9,543
3,318
5,563
3,760

Predict the
annual sales for
a store with
2000 square
feet.
Regression Model

Obtained:

Yi = 1636.415 +1.487Xi
Chap 11-60

Estimation of Mean Values:


Example
Confidence Interval Estimate
for
Y|X Xi
Find the 95% confidence interval for the average
annual sales for a 2,000 square-foot store.

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)


X = 2350.29

SYX = 611.75

Yi tn 2 SYX

tn-2 = t5 = 2.5706

1
( X i X )2
n
4610.45 612.66
n
2
(
X

X
)
i
i 1

Chap 11-61

Prediction Interval for Y :


Example
Prediction Interval for
Individual Y

Find the 95% prediction interval


for the annual sales of a 2,000

square-foot store
Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29

Yi tn 2 SYX

SYX = 611.75

tn-2 = t5 = 2.5706

1 ( X i X )2
1 n
4610.45 1687.68
n
2
(
X

X
)
i
i 1

Chap 11-62

Estimation of Mean Values and


Prediction of Individual Values in
PHStat

In excel, use PHStat | regression | simple


linear regression

Check the confidence and prediction


interval for X= box

EXCEL spreadsheet of regression sales


on footage

Micros oft Excel


Works heet

Chap 11-63

Pitfalls of Regression
Analysis

Lacking an awareness of the assumptions


underlying least-squares regression
Not knowing how to evaluate the
assumptions
Not knowing the alternatives to leastsquares regression if a particular
assumption is violated
Using a regression model without
knowledge of the subject matter
Chap 11-64

Strategies for Avoiding


the Pitfalls of Regression

Start with a scatter plot of X on Y to


observe possible relationship
Perform residual analysis to check the
assumptions
Use a histogram, stem-and-leaf
display, box-and-whisker plot, or
normal probability plot of the
residuals to uncover possible nonnormality
Chap 11-65

Strategies for Avoiding


the Pitfalls of Regression

(continued)

If there is violation of any assumption, use


alternative methods (e.g.: least absolute
deviation regression or least median of
squares regression) to least-squares
regression or alternative least-squares
models (e.g.: Curvilinear or multiple
regression)
If there is no evidence of assumption
violation, then test for the significance of
the regression coefficients and construct
confidence intervals and prediction intervals
Chap 11-66

Chapter Summary

Introduced types of regression models


Discussed determining the simple linear
regression equation
Described measures of variation
Addressed assumptions of regression
and correlation
Discussed residual analysis
Addressed measuring autocorrelation
Chap 11-67

Chapter Summary

(continued)

Described inference about the slope


Discussed correlation -- measuring the
strength of the association
Addressed estimation of mean values
and prediction of individual values
Discussed possible pitfalls in regression
and recommended a strategy to avoid
them
Chap 11-68

You might also like