You are on page 1of 111

Statistics for Business and

Economics

Chapter 11
Multiple Regression and Model
Building
Learning Objectives
1. Explain the Linear Multiple Regression Model
2. Describe Inference About Individual Parameters
3. Test Overall Significance
4. Explain Estimation and Prediction
5. Describe Various Types of Models
6. Describe Model Building
7. Explain Residual Analysis
8. Describe Regression Pitfalls
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear
Models With Two or More
Quantitative Variables
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear
Multiple Regression Model
General form:

y = b 0 + b1 x1 + b 2 x2 + L + b k xk + e
k independent variables

x1, x2, , xk may be functions of variables


e.g. x2 = (x1)2
Regression Modeling
Steps
1. Hypothesize deterministic component
2. Estimate unknown model parameters
3. Specify probability distribution of random
error term
Estimate standard deviation of error
4. Evaluate model
5. Use model for prediction and estimation
Probability Distribution
of Random Error
Regression Modeling
Steps
1. Hypothesize deterministic component
2. Estimate unknown model parameters
3. Specify probability distribution of random
error term
Estimate standard deviation of error
4. Evaluate model
5. Use model for prediction and estimation
Assumptions for Probability
Distribution of
1. Mean is 0
2. Constant variance, 2
3. Normally Distributed
4. Errors are independent
Linear Multiple Regression
Model
Types of
Regression Models
Explanatory
Variable

1 2 or More 1
Quantitative Quantitative Qualitative
Variable Variables Variable

1st 2nd 3rd 1st Inter- 2nd Dummy


Order Order Order Order Action Order Variable
Model Model Model Model Model Model Model
Regression Modeling
Steps
1. Hypothesize deterministic component
2. Estimate unknown model parameters
3. Specify probability distribution of random
error term
Estimate standard deviation of error
4. Evaluate model
5. Use model for prediction and estimation
FirstOrder Multiple
Regression Model
Relationship between 1 dependent and 2 or
more independent variables is a linear function

Population Population Random


Y-intercept slopes error
y = b 0 + b1 x1 + b 2 x2 + L + b k xk + e

Dependent Independent
(response) (explanatory)
variable variables
First-Order Model With
2 Independent Variables
Relationship between 1 dependent and 2
independent variables is a linear function
Model
E ( y ) = b 0 + b1 x1 + b 2 x2
Assumes no interaction between x1 and x2
Effect of x1 on E(y) is the same regardless of x2
values
Population Multiple
Regression Model
Bivariate model:
yi = b 0 + b1 x1i + b 2 x2i + e i
y (Observed y)

Response b0 ei
Plane
x2

x1 (x1i , x2i)
E ( y ) = b 0 + b1 x1i + b 2 x2i
Sample Multiple
Regression Model
Bivariate model:
yi = b0 + b1 x1i + b2 x2i + e i
y (Observed y)

Response b^0
Plane e^i
x2

x1 (x1i , x2i)
yi = b0 + b1 x1i + b2 x2i
No Interaction
E(y) = 1 + 2x1 + 3x2
E(y)
E(y) = 1 + 2x1 + 3(3) = 10 + 2x1
12
E(y) = 1 + 2x1 + 3(2) = 7 + 2x1
8 E(y) = 1 + 2x1 + 3(1) = 4 + 2x1

4 E(y) = 1 + 2x1 + 3(0) = 1 + 2x1

0 x1
0 0.5 1 1.5
Effect (slope) of x1 on E(y) does not depend on x2 value
Parameter Estimation
Regression Modeling
Steps
1. Hypothesize Deterministic Component
2. Estimate Unknown Model Parameters
3. Specify Probability Distribution of Random
Error Term
Estimate Standard Deviation of Error
4. Evaluate Model
5. Use Model for Prediction & Estimation
First-Order Model
Worksheet
Case, i yi x1i x2i
1 1 1 3
2 4 8 5
3 1 3 2
4 3 5 6
: : : :

Run regression with y, x1, x2


Multiple Linear
Regression Equations
Too
complicated
by hand! Ouch!
Interpretation of Estimated
Coefficients
^
1. Slope (bk)
^
Estimated y changes by bk for each 1 unit
increase in xk holding all other variables
constant ^
Example: if b1 = 2, then sales (y) is expected to
increase by 2 for each 1 unit increase in advertising
(x1) given the number of sales reps (x2)
^
2. Y-Intercept (b0)
Average value of y when xk = 0
1st Order Model Example

You work in advertising for Youve collected the


the New York Times. You following data:
want to find the effect of ad (y) (x1) (x2)
size (sq. in.) and newspaper Resp Size Circ
circulation (000) on the
1 1 2
number of ad responses (00).
4 8 8
Estimate the unknown
1 3 1
parameters.
3 5 7
2 6 4
4 10 6
Parameter Estimation
Computer Output
b^0 Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Param=0 Prob>|T|
INTERCEP 1 0.0640 0.2599 0.246 0.8214
ADSIZE 1 0.2049 0.0588 3.656 0.0399
CIRC 1 0.2805 0.0686 4.089 0.0264

b^1 b^2

y = .0640 + .2049 x1 + .2805 x2


Interpretation of Coefficients
Solution
^
1. Slope (b1)
Number of responses to ad is expected to
increase by .2049 (20.49) for each 1 sq. in.
increase in ad size holding circulation constant

^
2. Slope (b2)
Number of responses to ad is expected to increase
by .2805 (28.05) for each 1 unit (1,000) increase in
circulation holding ad size constant
Estimation of 2
Regression Modeling
Steps
1. Hypothesize Deterministic Component
2. Estimate Unknown Model Parameters
3. Specify Probability Distribution of Random
Error Term
Estimate Standard Deviation of Error
4. Evaluate Model
5. Use Model for Prediction & Estimation
Estimation of 2

For a model with k independent variables


SSE
where SSE = ( yi - yi )
2
s =
2

n - (k + 1)

SSE
s= s = 2

n - (k + 1)
Calculating s2 and s
Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq.
in.), x1, and newspaper
circulation (000), x2, on the
number of ad responses (00), y.
Find SSE, s2, and s.
Analysis of Variance
Computer Output
Analysis of Variance

Source DF SS MS F P
Regression 2 9.249736 4.624868 55.44 .0043
Residual Error 3 .250264 .083421
Total 5 9.5

SSE S2
.250264
s =
2
= .083421
6-3
s = .083421 = .2888
Evaluating the Model
Regression Modeling
Steps
1. Hypothesize Deterministic Component
2. Estimate Unknown Model Parameters
3. Specify Probability Distribution of Random
Error Term
Estimate Standard Deviation of Error
4. Evaluate Model
5. Use Model for Prediction & Estimation
Evaluating Multiple
Regression Model Steps
1. Examine variation measures
2. Test parameter significance
Individual coefficients
Overall model
3. Do residual analysis
Variation Measures
Evaluating Multiple
Regression Model Steps
1. Examine variation measures
2. Test parameter significance
Individual coefficients
Overall model
3. Do residual analysis
Multiple Coefficient of
Determination
Proportion of variation in y explained by all x
variables taken together

2 Explained Variation SSE


R = = 1-
Total Variation SSyy
Never decreases when new x variable is added to
model
Only y values determine SSyy
Disadvantage when comparing models
Adjusted Multiple Coefficient
of Determination
Takes into account n and number of
parameters
Similar interpretation to R2

n-1
R 2
a = 1-
n-(k+1)
( 1- R )
2


Estimation of R2 and Ra2
Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq. in.),
x1, and newspaper circulation
(000), x2, on the number of ad
responses (00), y. Find R2 and
Ra2.
Excel Computer Output
Solution
R2

Ra2
Testing Parameters
Evaluating Multiple
Regression Model Steps
1. Examine variation measures
2. Test parameter significance
Individual coefficients
Overall model
3. Do residual analysis
Inference for an Individual
Parameter
Confidence Interval
bi ta 2 sb df = n (k + 1)
i

Hypothesis Test
Ho: i = 0
Ha: i 0 (or < or > )
Test Statistic
bi
t=
sb
i
Confidence Interval
Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq. in.),
x1, and newspaper circulation
(000), x2, on the number of ad
responses (00), y. Find a 95%
confidence interval for 1.
Excel Computer Output
Solution

b1 sb
1
Confidence Interval
Solution
.204921 3.182(.058822)
.0177 b1 .3921
Hypothesis Test Example
You work in advertising for the
New York Times. You want to find
the effect of ad size (sq. in.), x1, and
newspaper circulation (000), x2,
on the number of ad responses
(00), y. Test the hypothesis that the
mean ad response increases as
circulation increases (ad size
constant). Use = .05.
Hypothesis Test
Solution
H0: b 2 = 0 Test Statistic:
Ha: b 2 0
a = .05
df = 6 - 3 = 3
Critical Value(s):
Decision:
Reject H0
.05
Conclusion:
0 2.353 t
Excel Computer Output
Solution

b 2 sb
2
Hypothesis Test
Solution
H0: b 2 = 0 Test Statistic:
Ha: b 2 0 b2 .280492
a = .05 t= = = 4.089
S b .068602
df = 6 - 3 = 3 2

Critical Value(s):
Decision:
Reject H0 Reject at a = .05
.05
Conclusion:
There is evidence the mean
0 2.353 t ad response increases as
circulation increases
Excel Computer Output
Solution
b2
t=
sb
2
PValue
Evaluating Multiple
Regression Model Steps
1. Examine variation measures
2. Test parameter significance
Individual coefficients
Overall model
3. Do residual analysis
Testing Overall Significance
Shows if there is a linear relationship
between all x variables together and y
Hypotheses
H0: b1 = b2 = ... = bk = 0
No linear relationship
Ha: At least one coefficient is not 0
At least one x variable affects y
Testing Overall Significance
Test Statistic
MS ( Model )
F=
MS ( Error )
Degrees of Freedom
1 = k 2 = n (k + 1)
k = Number of independent variables
n = Sample size
Testing Overall
Significance Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq. in.),
x1, and newspaper circulation
(000), x2, on the number of ad
responses (00), y. Conduct the
global Ftest of model
usefulness. Use = .05.
Testing Overall Significance
Solution
H0: 1 = 2 = 0
Test Statistic:
Ha: At least 1 not zero
a = .05
1 = 2 2 = 3
Critical Value(s):
Decision:
a = .05
Conclusion:

0 9.55 F
Testing Overall Significance
Computer Output
k
Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Prob>F
Model 2 9.2497 4.6249 55.440 0.0043
Error 3 0.2503 0.0834
C Total 5 9.5000 MS(Model)
n (k + 1) MS(Error)
Testing Overall Significance
Solution
H0: 1 = 2 = 0
Test Statistic:
Ha: At least 1 not zero
a = .05 4.6249
F= = 55.44
1 = 2 2 = 3 .0834
Critical Value(s):
Decision:
Reject at a = .05
a = .05
Conclusion:
There is evidence at least 1
0 9.55 F of the coefficients is not zero
Testing Overall Significance
Computer Output Solution
Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Prob>F
Model 2 9.2497 4.6249 55.440 0.0043
Error 3 0.2503 0.0834
C Total 5 9.5000
MS(Model)
MS(Error)
P-Value
Interaction Models
Types of
Regression Models
Explanatory
Variable

1 2 or More 1
Quantitative Quantitative Qualitative
Variable Variables Variable

1st 2nd 3rd 1st Inter- 2nd Dummy


Order Order Order Order Action Order Variable
Model Model Model Model Model Model Model
Interaction Model With
2 Independent Variables
Hypothesizes interaction between pairs of x
variables
Response to one x variable varies at different
levels of another x variable
Contains two-way cross product terms
E ( y ) = b 0 + b1 x1 + b 2 x2 + b3 x1 x2
Can be combined with other models
Example: dummy-variable model
Effect of Interaction
Given:
E ( y ) = b 0 + b1 x1 + b 2 x2 + b3 x1 x2
Without interaction term, effect of x1 on y is
measured by b1

With interaction term, effect of x1 on y is


measured by b1 + b3x2
Effect increases as x2 increases
Interaction Model Relationships
E(y) = 1 + 2x1 + 3x2 + 4x1x2
E(y)
E(y) = 1 + 2x1 + 3(1) + 4x1(1) = 4 + 6x1
12

8
E(y) = 1 + 2x1 + 3(0) + 4x1(0) = 1 + 2x1
4
0 x1
0 0.5 1 1.5
Effect (slope) of x1 on E(y) depends on x2 value
Interaction Model Worksheet

Case, i yi x1i x2i x1i x2i


1 1 1 3 3
2 4 8 5 40
3 1 3 2 6
4 3 5 6 30
: : : : :
Multiply x1 by x2 to get x1x2.
Run regression with y, x1, x2 , x1x2
Interaction Example

You work in advertising for the


New York Times. You want to
find the effect of ad size (sq. in.),
x1, and newspaper circulation
(000), x2, on the number of ad
responses (00), y. Conduct a test
for interaction. Use = .05.
Interaction Model Worksheet
yi x1i x2i x1i x2i
1 1 2 2
4 8 8 64
1 3 1 3
3 5 7 35
2 6 4 24
4 10 6 60
Multiply x1 by x2 to get x1x2.
Run regression with y, x1, x2 , x1x2
Excel Computer Output
Solution
Global Ftest indicates at least one parameter is not zero

F P-Value
Interaction Test
Solution
H0: b 3 = 0 Test Statistic:
Ha: b 3 0
a = .05
df = 6 - 2 = 4
Critical Value(s):
Decision:
Reject H0 Reject H0
.025 .025
Conclusion:
-2.776 0 2.776 t
Excel Computer Output
Solution

b3
t=
sb
3
Interaction Test
Solution
H0: b 3 = 0 Test Statistic:
Ha: b 3 0 t = 1.8528
a = .05
df = 6 - 2 = 4
Critical Value(s):
Decision:
Reject H0 Reject H0 Do no reject at a = .05
.025 .025
Conclusion:
There is no evidence of
-2.776 0 2.776 t interaction
SecondOrder Models
Types of
Regression Models
Explanatory
Variable

1 2 or More 1
Quantitative Quantitative Qualitative
Variable Variables Variable

1st 2nd 3rd 1st Inter- 2nd Dummy


Order Order Order Order Action Order Variable
Model Model Model Model Model Model Model
Second-Order Model With
1 Independent Variable
Relationship between 1 dependent and 1
independent variable is a quadratic function
Useful 1st model if non-linear relationship
suspected Curvilinear
Model effect

E ( y ) = b 0 + b1 x + b 2 x 2

Linear effect
Second-Order Model
Relationships
y b2 > 0 y b2 > 0

x1 x1

y b2 < 0 y b2 < 0

x1 x1
Second-Order Model
Worksheet
2
Case, i yi xi xi
1 1 1 1
2 4 8 64
3 1 3 9
4 3 5 25
: : : :
Create x2 column.
Run regression with y, x, x2.
2 Order Model Example
nd

Errors (y) Weeks (x)


The data shows the number of 20 1
weeks employed and the number 18 1
of errors made per day for a 16 2
sample of assembly line 10 4
workers. Find a 2nd order model, 8 4
4 5
conduct the global Ftest, and 3 6
test if 2 0. Use = .05 for all 1 8
tests. 2 10
1 11
0 12
1 12
Second-Order Model
Worksheet
2
yi xi xi
20 1 1
18 1 1
16 2 4
10 4 16
: : :
Create x2 column.
Run regression with y, x, x2.
Excel Computer Output
Solution

y = 23.728 - 4.784 x + .242 x 2


Overall Model Test Solution
Global Ftest indicates at least one parameter is not zero

F P-Value
2 Parameter Test Solution
2 test indicates curvilinear relationship exists

t P-Value
Types of
Regression Models
Explanatory
Variable

1 2 or More 1
Quantitative Quantitative Qualitative
Variable Variables Variable

1st 2nd 3rd 1st Inter- 2nd Dummy


Order Order Order Order Action Order Variable
Model Model Model Model Model Model Model
Second-Order Model With

2Relationship
Independent between 1Variables
dependent and 2
independent variables is a quadratic
function
Useful 1st model if non-linear relationship
suspected
Model
E ( y ) = b 0 + b1 x1i + b 2 x2i + b 3 x1i x2i
+b x + b x
2
4 1i
2
5 2i
Second-Order Model
Relationships
y b4 + b5 > 0 y b4 + b5 < 0

x2 x2
x1 x1

y b 32 > 4 b 4 b 5 E ( y ) = b 0 + b1 x1i + b 2 x2i


x2 + b 3 x1i x2i + b 4 x12i
x1
+b x 2
5 2i
Second-Order Model
Worksheet
2 2
Case, i yi x1i x2i x1ix2i x1i x2i
1 1 1 3 3 1 9
2 4 8 5 40 64 25
3 1 3 2 6 9 4
4 3 5 6 30 25 36
: : : : : : :

Multiply x1 by x2 to get x1x2; then create x12, x22.


Run regression with y, x1, x2 , x1x2, x12, x22.
Models With One Qualitative
Independent Variable
Types of
Regression Models
Explanatory
Variable

1 2 or More 1
Quantitative Quantitative Qualitative
Variable Variables Variable

1st 2nd 3rd 1st Inter- 2nd Dummy


Order Order Order Order Action Order Variable
Model Model Model Model Model Model Model
Dummy-Variable Model

Involves categorical x variable with 2 levels


e.g., male-female; college-no college
Variable levels coded 0 and 1
Number of dummy variables is 1 less than
number of levels of variable
May be combined with quantitative variable
(1st order or 2nd order model)
Dummy-Variable Model
Worksheet
Case, i yi x1i x2i
1 1 1 1
2 4 8 0
3 1 3 1
4 3 5 1
: : : :
x2 levels: 0 = Group 1; 1 = Group 2.
Run regression with y, x1, x2
Interpreting Dummy-
Variable Model Equation
Given: yi = b0 + b1 x1i + b2 x2i
y = Starting salary of college graduates
x1 = GPA
0 if Male
x2 =
1 if Female
Same slopes
Male ( x2 = 0 ):
yi = b0 + b1 x1i + b2 (0) = b0 + b1 x1i
Female ( x2 = 1 ):
( )
yi = b0 + b1 x1i + b2 (1) = b0 + b2 + b1 x1i
Dummy-Variable Model
Example
Computer Output: yi = 3 + 5 x1i + 7 x2i
0 if Male
x2 =
1 if Female Same slopes
Male ( x2 = 0 ):
yi = 3 + 51 x1i + 7(0) = 3 + 5 x1i
Female ( x2 = 1 ):
yi = 3 + 5 x1i + 7(1) = ( 3 + 7 ) + 5 x1i
Dummy-Variable Model
Relationships
y Same Slopes ^
b1
Female
^b + ^
b2
0 Male
^
b0

0 x1
0
Nested Models
Comparing Nested Models
Contains a subset of terms in the complete (full) model
Tests the contribution of a set of x variables to the
relationship with y
Null hypothesis H0: bg+1 = ... = bk = 0
Variables in set do not improve significantly the
model when all other variables are included
Used in selecting x variables or models
Part of most computer programs
Selecting Variables
in Model Building
Selecting Variables in Model
Building
A butterfly flaps its wings in Japan, which causes it
to rain in Nebraska. -- Anonymous

Use Theory Only! Use Computer Search!


Model Building with
Computer Searches
Rule: Use as few x variables as possible
Stepwise Regression
Computer selects x variable most highly correlated
with y
Continues to add or remove variables depending on SSE
Best subset approach
Computer examines all possible sets
Residual Analysis
Evaluating Multiple
Regression Model Steps
1. Examine variation measures
2. Test parameter significance
Individual coefficients
Overall model
3. Do residual analysis
Residual Analysis

Graphical analysis of residuals


Plot estimated errors versus xi values
Difference between actual yi and predicted yi
Estimated errors are called residuals
Plot histogram or stem-&-leaf of residuals
Purposes
Examine functional form (linear v. non-linear
model)
Evaluate violations of assumptions
Residual Plot
for Functional Form

Add x2 Term Correct Specification


^e ^e

x x
Residual Plot
for Equal Variance

Unequal Variance Correct Specification


^
e ^
e

x x

Fan-shaped.
Standardized residuals used typically.
Residual Plot
for Independence

Not Independent Correct Specification


^
e ^
e

x x

Plots reflect sequence data were collected.


Residual Analysis
Computer Output
Dep Var Predict Student
Obs SALES Value Residual Residual -2-1-0 1 2
1 1.0000 0.6000 0.4000 1.044 | |** |
2 1.0000 1.3000 -0.3000 -0.592 | *| |
3 2.0000 2.0000 0 0.000 | | |
4 2.0000 2.7000 -0.7000 -1.382 | **| |
5 4.0000 3.4000 0.6000 1.567 | |*** |

Plot of standardized
(student) residuals
Regression Pitfalls
Regression Pitfalls
Parameter Estimability
Number of different xvalues must be at least one
more than order of model
Multicollinearity
Two or more xvariables in the model are correlated
Extrapolation
Predicting yvalues outside sampled range
Correlated Errors
Multicollinearity

High correlation between x variables


Coefficients measure combined effect
Leads to unstable coefficients depending on
x variables in model
Always exists matter of degree
Example: using both age and height as
explanatory variables in same model
Detecting Multicollinearity

Significant correlations between pairs of x


variables are more than with y variable
Nonsignificant ttests for most of the
individual parameters, but overall model test
is significant
Estimated parameters have wrong sign
Solutions to Multicollinearity
Eliminate one or more of the correlated x
variables
Avoid inference on individual parameters
Do not extrapolate
Extrapolation

y Interpolation

Extrapolation Extrapolation

x
Sampled Range
Conclusion
1. Explained the Linear Multiple Regression Model
2. Described Inference About Individual Parameters
3. Tested Overall Significance
4. Explained Estimation and Prediction
5. Described Various Types of Models
6. Described Model Building
7. Explained Residual Analysis
8. Described Regression Pitfalls

You might also like