All Econonmetrics Lectures

Chapter 4
Linear Regression with

One Regressor
(„Simple Regression‟)
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives a number for 1→1 association
3. Simple Regression is Better than Correlation
2
Outline
3
Does Having Too Many Students
Per Teacher Lower Test Marks?
4
Scatterplots are Pictures of 1→1
Association
5
Is there a Number for This
Relationship?
6
What about Mean? Variance?
7
Treat this as a Dataset on Student-
teacher ratio (STR), called „X‟
8
Treat this as a Dataset on Student-
teacher ratio (STR), called „X‟
Imagine Falling Rain
9
Collapse onto „X‟ (horizontal) axis
10
Ignore „Y‟ (vertical)
11
Sample Mean
n
xi xi
i 1
x
n n
12
Sample Variance
n
( xi x) 2
( xi x )2
2 i 1
S x
n 1 n 1
13
Standard error/deviation is the
square root of the variance
n
( xi x) 2
( xi x )2
Sx S x2 i 1
n 1 n 1
14
(It is very close to a typical departure of x from mean.
„standard‟ = „typical‟; „deviation/error‟ = departures from mean)
| xi x|
Sx
n
15
Treat this as a Dataset on Test
score, called „Y‟
16
Collapse onto Y axis
17
Calculate Mean and Variance
2
S y
18
Is there a Number for This
Relationship? Not Yet
19
Break up All Observations
into 4 Quadrants
x
II I
III IV
20
Fill In the Signs of Deviations from
Means for Different Quadrants
x
II I
xi x 0 yi y 0 xi x 0 yi y 0
III IV
21
Fill In the Signs of Deviations from
Means for Different Quadrants
x
II I
III IV
22
The Products are Positive in I and III
x
II I
( xi x )( y i y ) 0 ( xi x )( y i y ) 0
III IV
( xi x )( y i y ) 0 ( xi x )( y i y ) 0
23
The Products are Negative in II and IV
x
II I
( xi x )( y i y ) 0 ( xi x )( y i y ) 0
III IV
( xi x )( y i y ) 0 ( xi x )( y i y ) 0
24
Sample Covariance, Sxy, describes
the Relationship between X and Y
1
S xy ( xi x )( y i y )
(n 1)
If Sxy > 0 most data lies in I and III:
This concurs with our visual common sense because
it looks like a positive relationship
If Sxy < 0 most data lies in II and IV
This concurs with our visual common sense because
it looks like a negative relationship
If Sxy = 0 data is „evenly spread‟ across I-IV 25
What About Our Data?
x
II I
III IV
26
Large Negative Sxy
27
Large Positive Sxy
28
Zero Sxy
29
Our Data has a Mild Negative
Covariance Sxy<0
x
II I
III IV
30
Correlation, rXY, is a Measure of
Relationship that is Unit-less
S XY
rXY
S X SY
It can be proved that it lies between -1 and 1.
1 rXY 1
It has the same sign as SXY so ….
31
Mild Negative Correlation
rXY= -.2264
x
II I
III IV
32
Outline
2. Correlation gives number for 1→1 association
33
Outline
But…
How much does Y change when X changes?
What is a good guess of Y if X =25?
What does correlation = -.2264 mean anyway?
34
Outline
35
What is Simple Regression?
Simple regression allows us answer all three

questions:
“How much does Y change when X changes?”
“What is a good guess of Y if X =25?”
“What does correlation = -.2264 mean anyway?”
…by fitting a straight line to data
on two variables, Y and X.
Y b0 b1. X
36
Y b0 b1. X
fadsf
fadsf
sa
sa
37
We Get our Guessed Line Using
„(Ordinary) Least Squares [OLS]‟
OLS minimises the squared difference between a
regression line and the observations.
We can view these squared differences as squares.
This task then becomes the minimisation of the
area of the squares.
Applet: http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html
38
Measures of Fit
(Section 4.3)
The regression R2 can be seen from the applet

http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html
It is the proportional reduction in the sum of squares as one
moves from modeling Y by a constant (with LS estimator
being a sample mean, and sum of squares equal to „total sum
of squares‟) to a line. R2=[TSS-„sum of squares‟]/TSS
If the model fits perfectly „sum of squares‟ = 0 and R2=1
If model does no better than a constant it = TSS and R2=0
The standard error of the regression (SER) measures the
magnitude of a typical regression residual in the units of Y.
39
The Standard Error of the
Regression (SER)
The SER measures the spread of the distribution of u. The SER
is (almost) the sample standard deviation of the OLS residuals:
n
1
SER = (uî uˆ )2
n 2 i 1
n
1
= uî2
n 2 i 1
n
1
(the second equality holds because û = uî = 0).
n i 1
40
n
1
SER = uî2
n 2 i 1
The SER:
has the units of u, which are the units of Y
measures the average “size” of the OLS residual (the average
“mistake” made by the OLS regression line)
Don‟t worry about the n-2 (instead of n-1 or n) – the reason is
too technical, and doesn‟t matter if n is large.
41
How the Computer Did it
(SW Key Concept 4.2)
42
The OLS Line has a Small Negative
Slope
Estimated slope = ˆ1 = – 2.28

Estimated intercept = ˆ = 698.9
0
Estimated regression line: TestScore = 698.9 – 2.28 STR

43
Interpretation of the estimated slope
and intercept
Test score= 698.9 – 2.28 STR
Districts with one more student per teacher on average have
test scores that are 2.28 points lower.
Test score
That is, = –2.28
STR
The intercept (taken literally) means that, according to this
estimated line, districts with zero students per teacher would
have a (predicted) test score of 698.9.
This interpretation of the intercept makes no sense – it
extrapolates the line outside the range of the data – here, the
intercept is not economically meaningful.
44
Remember Calculus?
Test score = 698.9 – 2.28 STR
dTest score
Differentiation gives dSTR = –2.28
„d‟ means „infinitely small change‟ but for a „very small
change‟ called „ ‟ it will still be pretty close to the truth. So,
an approximation is:
Test score
= –2.28
STR
How to interpret this? Take denominator over the other side.
Test score 2.28 STR
So, if STR goes up by one, Test score falls by 2.28.
If STR goes up by, say, 20, Test score falls by 2.28 (20)=45.6
45
Predicted values & residuals:
One of the districts in the data set is Antelope, CA, for which
STR = 19.33 and Test Score = 657.8
predicted value: Yˆ
Antelope= 698.9 – 2.28 19.33 = 654.8
residual: uˆ Antelope = 657.8 – 654.8 = 3.0
46
R 2 and SER evaluate the Model
TestScore = 698.9 – 2.28 STR, R2 = .05, SER = 18.6

By using STR, you only reduce the sum of squares by 5%
compared with just ‘modeling’ Test score by its average. That is,
STR only explains a small fraction of the variation in test scores.
The standard residual size is 19, which looks large. 47
Seeing R2 and the SER in EVIEWs
Dependent Variable: TESTSCR
Method: Least Squares
Sample: 1 420
Included observations: 420
TESTSCR=C(1)+C(2)*STR
Coefficient Std. Error t-Statistic Prob.

C(1) 698.9330 9.467491 73.82451 0.0000
C(2) -2.279808 0.479826 -4.751327 0.0000
R-squared 0.051240 Mean dependent var 654.1565

Adjusted R-squared 0.048970 S.D. dependent var 19.05335
S.E. of regression 18.58097 Akaike info criterion 8.686903
Sum squared resid 144315.5 Schwarz criterion 8.706143
Log likelihood -1822.250 Durbin-Watson stat 0.129062
= 698.9 – 2.28.STR
48
Recap
But…
How much does Y change when X changes?
What is a good guess of Y if X =25?
49
Recap
But…
How much does Y change when X changes? b1 x
What is a good guess of Y if X =25? b0+b1(25)
50
Outline
But…
Surprise: R2=rXY2
51
Outline
But…
Surprise: R2=rXY2
52
Chapter 5
Regression with a Single

Regressor: Hypothesis Tests
What is Simple Regression?
We‟ve used Simple regression as a means of
describing an apparent relationship between two
variables. This is called descriptive statistics.
Simple regression also allows us to estimate, and
make inferences, under the OLS assumptions,
about the slope coefficients of an underlying
model. We do this, as before, by fitting a straight
line to data on two variables, Y and X. This is
called inferential statistics.
54
The Underlying Model
(or „Population Regression Function‟)
Yi = 0 + 1X i + ui, i = 1,…, n
X is the independent variable or regressor

Y is the dependent variable
0 = intercept
1 = slope
ui = the regression error or residual
The regression error consists of omitted factors, or possibly
measurement error in the measurement of Y. In general, these
omitted factors are other factors that influence Y, other than
the variable X
55
What Does it Look Like in This Case?
Yi = 0 + 1Xi + ui, i = 1,…, n
X is the STR
Y is the Test score
0= intercept
Test score
1=
STR
= change in test score for a unit change in STR
If we also guess 0 we can also predict Test score when STR
has a particular value.
Clearly, we want good guesses (estimates) of 0 and 1.
56
A Picture is Worth 1000 Words
57
From Now on we Use „b‟ or „ ˆ ‟ to Signify our
Guesses, or „Estimates‟ of the Slope or Intercept.
We never see the True Line.
û1
û 2
b0+b1x
58
From Now on we Use „b‟ or „ ˆ ‟ to Signify our
Estimates of the Slope or Intercept and for
guesses for u. We never see the True Line or u‟s.
û1
û 2
b0+b1x
59
Our Estimators are Really Random
Least squares estimators have a distribution; they are
different every time you take a different sample. (like an
average of 5 heights, or 7 exam marks)
The estimators are Random Variables. A Random
variable generate numbers with a central measure called
a mean and a volatility called the Standard Errors
Least squares estimators b0 & b1 have means 0 & 1
Hypothesis testing:
Eg. How to test if the slope 1
is zero, or -37?
Confidence intervals:
Eg. What is a reasonable range of guesses for the slope 1?
60
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals
61
Outline
1. OLS Assumptions
62
Outline
1. OLS Assumptions (Very Technical)
63
Outline
1. OLS Assumptions (When will OLS be ‘good’?)
64
Estimator Distributions Depend on
Least Squares Assumptions
A key part of the model is the assumptions made
about the residuals ut for t=1,2….n.
1. E(ut)=0
2. E(ut2)=σ2 =SER2 (note: not σ2t – invariant)
3. E(utus)=0 t≠s
4. Cov(Xt, ut)=0
5. ut~Normal
65
SW has different assumptions;
Use mine for any Discussions
The conditional distribution of u given X has
mean zero, that is, E(u|X = x) = 0. (a combination
of 1. and 4.)
(Xi,Yi), i =1,…,n, are i.i.d. (unnecessary in many
applications)
Large outliers in X and/or Y are rare. (technical
assumption)
66
How reasonable
are these assumptions?
To answer, we need to understand them.
1. E(ut)=0
3. E(utus)=0 t≠s
4. Cov(Xt, ut)=0
5. ut~Normal
67
It‟s All About u
u is everything left out of the model
E(ut)=0
2. E(utus)=0 t≠s
3. Cov(Xt, ut)=0
4. ut~Normal
68
1. E(ut)=0 is not a big deal
Providing the model has a constant, this is not a
restrictive assumption.
If „all the other influences‟ don‟t have a zero
mean, the estimated constant will just adjust to
the point u does have a zero mean.
Really, B0+u could be thought of as everything
else that affects y apart from x
69
2. E(ut2)=σ2 =SER2 is Controversial
If this assumption holds, the errors are said to be

homoskedastic
If it is violated, the errors are said to be
heteroskedastic (hetero for short)
There are many conceivable forms of hetero, but
perhaps the most common is when the variance
depends upon the value of x
70
Hetero Related to X is very Common
Homoskedastic Heteroskedastic
71
Our Data Looks OK, but Don‟t be Complacent
Homoskedastic Heteroskedastic
72
3. E(utus)=0 t≠s
A violation of this is called autocorrelation

If the underlying model generates data for a time
series, it is highly likely that „left out‟ variables
will be autocorrelated (i.e. z depends on lagged z;
most time series are like this) and so u will be too.
But if the model describes a cross-section
assumption 3 is likely to hold.
73
Aside: Hetero and Auto are not a
Disaster
Hetero plagues cross-sectional data, Auto plagues
time series.
Remarkably, Heteroskedasticity and
Autocorrelation do not bias the Least Squares
Estimators.
This is a very strange result!
74
Hetero Doesn‟t Bias
y
Draw a least squares

line through these points
75
If You Could See the True Line
You‟d Realize hetero is bad for OLS
. y (a) homoskedasticity (b) heteroskedasticity
76
But OLS is Still Unbiased!
In case (b), OLS is still unbiased because the next draw
is just as likely to find the third error above the true
line, pulling up the (negative) slope of the least squares
line. On average, the true line would be revealed
with many samples.
y (a) (b)
x
But we will make an adjustment to our analysis later OLS is no
longer „best‟ which means minimum variance
77
Conquer Hetero and Auto with Just
One Click
SW recommend you correct standard errors for
hetero and auto. In EVIEWs you do this by:
estimate/options/heteroskedasticity consistent
coefficient covariance/ input: leave white ticked if
only worried about hetero. Tick Newey West to
correct for both.
Because OLS is unbiased, the correction only
occurs for the standard errors.
Sometimes, we will use standard errors corrected
in this way
78
4. Cov(Xt, ut)=0
This will be discussed extensively next lecture

When there is only one variable in a regression, it
is highly likely that that variable will be correlated
with a variable that is left out of the model, which
is implicitly in the error term.
Before proceeding with assumption 5, it is worth
stating that 1. – 4. are all that are required to prove
the so-called Gauss-Markov Theorem, that OLS
is Best Linear Unbiased Estimator (SW Section
5.5)
79
5. ut~Normal
With this assumption, OLS is minimum volatility
estimator among all consistent estimators.
Many variables are Normal
http://rba.gov.au/Statistics/AlphaListing/index.html
The assumption „delivers‟ a known distribution of OLS
estimators (a „t‟ distribution) if n is small. But if n is large
(>30) the OLS estimators become Normal, so it is
unnecessary. This is due to the Central Limit Theorem
http://onlinestatbook.com/stat_sim/sampling_dist/index.html
http://www.rand.org/statistics/applets/clt.html
80
Assessment of Assumptions
1. E(ut)=0 harmless if model has constant
2. E(ut2)=σ2 =SER2 not too serious since
3. E(utus)=0 t≠s OLS still unbiased
4. Cov(Xt, ut)=0 Serious – see next lecture
5. ut~Normal Nice Property to have, but if
sample size is big it doesn’t matter
We assume 2. and 3. hold, or just adjust the
standard errors. We‟ll also assume n is large and
we‟ll always keep a constant, so 5. and 1. are not
relevant. This lecture, we assume 4 holds.
81
Outline
1. OLS Assumptions
82
With OLS Assumptions the CLT
Gives Us the Distribution of 1
ˆ ~ N( 1 , SE ( 1
2
) )
1
83
t-distribution (small n) vs Normal
0.4
0.3
0.2
0.1
3 2 1 1 2 3
84
ˆ ~ N( 1 , SE ( 1
2
) )
1
85
ˆ ~ N( 1 , SE ( 1
2
) )
1
86
EVIEWs output gives us SE(B1)

Date: 06/04/08 Time: 22:13
Sample: 1 420
C 698.9330 9.467491 73.82451 0.0000

STR -2.279808 0.479826 -4.751327 0.0000

Log likelihood -1822.250 Hannan-Quinn criter. 8.694507
F-statistic 22.57511 Durbin-Watson stat 0.129062
Prob(F-statistic) 0.000003
87
Outline
1. OLS Assumptions
88
EVIEWs Output Can be Summarized
in Two Lines
Put standard errors in parentheses below the estimated
coefficients to which they apply.
2
TestScore = 698.9 – 2.28 STR, R = .05, SER = 18.6
(10.4)
9.4675 (0.52)
0.4798
This expression gives a lot of information

The estimated regression line is
TestScore = 698.9 – 2.28 STR
The standard error of ˆ is 10.4
0
9.4675
The standard error of ˆ1 is 0.52

0.4798
The R2 is .05; the standard error of the regression is 18.6

89
We Only Need Two Numbers For
Hypothesis Testing
Put standard errors in parentheses below the estimated
coefficients to which they apply.
2
TestScore = 698.9 – 2.28 STR, R = .05, SER = 18.6
(10.4)
9.4675 (0.52)
0.4798
This expression gives a lot of information

The estimated regression line is
TestScore = 698.9 – 2.28 STR
The standard error of ˆ is 10.4
0
9.4675
The standard error of ˆ1 is 0.52

0.4798
The R2 is .05; the standard error of the regression is 18.6

90
Remember Hypothesis Testing?
1. H0 = null hypothesis = „status quo‟ belief = what
you believe without good reason to doubt it.
2. H1 = alternative hypothesis = what you believe if
you reject H0
3. Collect evidence and create a calculated Test
Statistic
4. Decide on a significance level = test size =
Prob(type I error) =
5. The test size defines a rejection region and a
critical value (the changeover point)
6. Reject H0 if Test Statistic lies in Rejection Region
91
Hypothesis Testing and the Standard
Error of ˆ1 (Section 5.1)
The objective is to test a hypothesis, like 1 = 0, using data – to
reach a tentative conclusion whether the (null) hypothesis is
correct or incorrect.
General setup
Null hypothesis and two-sided alternative:
H0: 1 = 1,0 vs. H1: 1 1,0
where 1,0 is the hypothesized value under the null.
Null hypothesis and one-sided alternative:

H 0: 1 = 1,0 vs. H1: 1 < 1,0
92
General approach: construct t-statistic, and compute p-value (or
compare to N(0,1) critical value)
estimator - hypothesized value

In general: t=
standard error of the estimator
where the SE of the estimator is the square root of an

estimator of the variance of the estimator.
ˆ
For testing 1, t= 1 1,0
,
SE ( ˆ1 )
where SE( ˆ1 ) = the square root of an estimator of the variance
of the sampling distribution of ˆ1
Comparing distance between estimate and your hypothesized

value is obvious; doing it in units of volatility is less so.
93
Summary: To test H0: 1 = 1,0 vs.
H1: 1 ≠ 1,0,
Construct the t-statistic
ˆ
t= 1 1,0
SE ( ˆ1 )
Reject at 5% significance level if |t| > 1.96
This procedure relies on the large-n approximation; typically
n = 30 is large enough for the approximation to be excellent.
94
p-values are another method
See textbook pp 72-81, and look up p-value in the index
1. What values of the test statistic would make you more

determined to reject the null than you are now?
2. If the null is true, what is the probability of obtaining
those values? This is the p-value.
“the p-value, also called the significance probability [not in
QBA] is the probability of drawing a statistic at least as
adverse to the null hypothesis as the one you actually
computed in your sample, assuming the null hypothesis is
correct” pg. 73
95
p-values are another method
See textbook pp 72-81, and look it up in the index
For a two sided test, p-value is p = Pr[|t| > |tact|] = probability

in tails of normal outside |tact|;
you reject at the 5% significance level if the p-value is < 5%
(or < 1% or <10% depending on test size)
REJECT H0 IF PVALUE <
96
Example: Test Scores and STR,
California data
Estimated regression line: TestScore = 698.9 – 2.28STR
Regression software reports the standard errors:
(The standard errors are corrected for heteroskedasticity)
SE( ˆ0 ) = 10.4 SE( ˆ1 ) = 0.52
ˆ 2.28 0
t-statistic testing 1,0 = 0 = 1 1,0
= = –4.38
ˆ
SE ( 1 ) 0.52
The 1% 2-sided significance level is 2.58, so we reject the null
at the 1% significance level.
Alternatively, we can compute the p-value…
97
The p-value based on the large-n standard normal approximation
to the t-statistic is 0.00001 (10–5)
98
Hypothesis Testing Can be Tricky
Date: 06/05/08 Time: 22:35
Sample: 1 420
C 655.1223 1.126888 581.3553 0.0000

COMPUTER -0.003183 0.002106 -1.511647 0.1314
„Prob‟ only equals p-value for a two sided test
99
Try These Hypotheses
(a) H0:B1=0, H1:B1>0, with =.05 using the critical-values approach
(b) H0:B1=0, H1:B1<0, with =.05 using the critical-values approach
(c) H0:B1=0, H1:B1≠0, with =.05 using the critical-values approach
(d) H0:B1=0, H1:B1>0, with =.05 using the p-value approach
(e) H0:B1=0, H1:B1<0, with =.05 using the p-value approach
(f) H0:B1=0, H1:B1≠0, with =.05 using the p-value approach
(g) H0:B1=-.05, H1:B1<-.05, with =.10

100
H0:B1=0, H1:B1>0, with =.05 using the critical-values approach
101
(b) H0:B1=0, H1:B1<0, with =.05 using the critical-values approach
102
(c) H0:B1=0, H1:B1≠0, with =.05 using the critical-values approach
103
(d) H0:B1=0, H1:B1>0, with =.05 using the p-value approach
This is very hard to do with p-values!
104
(e) H0:B1=0, H1:B1<0, with =.05 using the p-value approach
105
(f) H0:B1=0, H1:B1≠0, with =.05 using the p-value approach
106
(g) H0:B1=-0.05, H1:B1<-0.05, with =.10
107
Outline
1. OLS Assumptions
108
ˆ ~ N( 1 , SE ( 1
2
) )
1
109
Confidence
Intervals
ˆ ~ N ( , SE ( ) 2 )
1 1 1
ˆ 2
~ N (0, SE ( 1 ) )
1 1
ˆ
1 1 ~ N (0, 1)
SE ( 1 )
110
95% Confidence Intervals Catch the
True Parameter 95% of the Time
ˆ-
Prob(-1.96 1.96) .95
SE( ˆ )
Prob(-1.96. SE( ˆ ) ˆ - 1.96.SE( ˆ )) .95
Prob(1.96.SE( ˆ ) ˆ 1.96.SE( ˆ )) .95
Prob( ˆ 1.96. SE( ˆ ) ˆ 1.96.SE( ˆ )) .95
So, the probabilit y will be captured by the random
interval ˆ 1.96. SE( ˆ ) is 0.95.
http://bcs.whfreeman.com/bps4e/content/cat_010/applets/confidenceinterval.html
111
Confidence Intervals are
Reasonable Ranges
If we cannot reject H 0 : in favour of H1 : at, say 5%
ˆ-
It implies 1.96
SE( ˆ )
ˆ- ˆ
1.96 1.96 1.96 1.96
SE( )ˆ ˆ
SE( )
ˆ 1.96 SE( ˆ ) ˆ 1.96 SE( ˆ )
But this just says must lie in a 95% CI.
Going the other way, we can define a 1 - Confidence Interval
as a range of values that could not be rejected as nulls in a two - sided
test of significan ce with test size .
112
Confidence interval example: Test Scores and STR
Estimated regression line: TestScore = 698.9 – 2.28 STR
SE( ˆ0 ) = 10.4 SE( ˆ1 ) = 0.52
95% confidence interval for ˆ1 :
{ ˆ1 1.96 SE( ˆ1 )} = {–2.28 1.96 0.52}

= (–3.30, –1.26)
113
If You Make 1→1 Associations, Use
Simple Regression, not Correlation
1. OLS Assumptions
But be careful of the Simple Regression assumption

Cov(Xt, ut)=0
114
Chapter 6
Introduction to
Multiple Regression
115
Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator
116
It‟s all about u
(SW Section 6.1)
The error u arises because of factors that influence Y but are not
included in the regression function; so, there are always omitted
variables.
Sometimes, the omission of those variables can lead to bias in the

OLS estimator. This occurs because the assumption
4. Cov(Xt, ut)=0
Is violated
117
Outline
3. Measures of fit
118
Omitted variable bias=OVB
The bias in the OLS estimator that occurs as a result of an
omitted factor is called omitted variable bias.
Let y= 0+ 1x+u and let u=f(Z)
omitted variable bias is a problem if the omitted factor “Z” is:
1. A determinant of Y (i.e. Z is part of u); and
2. Correlated with the regressor X (i.e. corr(Z,X) 0)
Both conditions must hold for the omission of Z to result in

omitted variable bias.
119
What Causes Long Life?
Gapminder (http://www.gapminder.org/world) is an online
applet that contains demographic information about each
country in the world.
Suppose that we are interested in predicting life
expectancy, and think that both income per capita and the
number of physicians per 1000 people would make good
indicators.
Our first step would be to graph these predictors against
life expectancy
We find that both are positively correlated with life expectancy
120
…Doctors or Income or Both?
Simple Linear Regression only allows us to use
one of these predictors to estimate life expectancy.
But income per capita is correlated with the
number of physicians per 1000 people. Suppose
the truth is:
Life=B0+B1Income+B2Doctors+u but you run
Life=B0+B1Income+u* (u*=B2Doctors+u)
121
OVB=„Double Counting‟
B1 is the impact of Income on Life, holding
everything else constant including the residual
But if correlation exists between the Doctors (in
the residual) and income (rIncDoct≠0 ), and, if the
true impact of Doctors (B2≠0) is non-zero, then B1
counts both effects – it „double counts‟
Life=B0+B1Income+u* (u*=B2Doctors+u)
122
Our Test score Reg has OVB
In the test score example:
1. English language deficiency (whether the student is learning
English) plausibly affects standardized test scores: Z is a
determinant of Y.
2. Immigrant communities tend to be less affluent and thus
have smaller school budgets – and higher STR: Z is
correlated with X.
Accordingly, ˆ1 is biased.
123
What is the bias? We have a formula
STR is larger for those classes with a higher PctEL (both being a
feature of poorer areas), the correlation between STR and PctEL
will be positive
PctEL appears in u with a negative sign in front of it – higher
PctEL leads to lower scores. Therefore the correlation between
STR and u[ minus PctEL], must be negative (ρXu < 0).
Here is the formula. (Standard deviations are always positive)
Bias ˆ
X X
u the volatility of the error and the

Xu
X included variable matter
0
So the coefficient of student-teacher ratio is negatively biased by
the exclusion of the percentage of English learners. It is „too big‟
in absolute value.
124
Including PctEL Solves Problem
Some ways to overcome omitted variable bias
1. Run a randomized controlled experiment in which treatment
(STR) is randomly assigned: then PctEL is still a determinant
of TestScore, but PctEL is uncorrelated with STR. (But this is
unrealistic in practice.)
2. Adopt the “cross tabulation” approach, with finer gradations
of STR and PctEL – within each group, all classes have the
same PctEL, so we control for PctEL (But soon we will run
out of data, and what about other determinants like family
income and parental education?)
3. Use a regression in which the omitted variable (PctEL) is no
longer omitted: include PctEL as an additional regressor in a
multiple regression.
125
Outline
3. Measures of fit
126
The Population Multiple Regression
Model (SW Section 6.2)
Consider the case of two regressors:
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n
Y is the dependent variable

X1, X2 are the two independent variables (regressors)
(Yi, X1i, X2i) denote the ith observation on Y, X1, and X2.
0 = unknown population intercept
1 = effect on Y of a change in X1, holding X2 constant
2 = effect on Y of a change in X2, holding X1 constant
ui = the regression error (omitted factors)
127
Partial Derivatives in Multiple
Regression = Cet. Par. in Economics
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n
We can use calculus to interpret the coefficients:
Y
1 = X1 , holding X2 constant= Ceteris Paribus
Y
2 = X2 , holding X1 constant = Ceteris Paribus
0 = predicted value of Y when X1 = X2 = 0.
128
The OLS Estimator in Multiple
Regression (SW Section 6.3)
With two regressors, the OLS estimator solves:
n
min b0 ,b1 ,b2 [Yi (b0 b1 X 1i b2 X 2i )]2
i 1
The OLS estimator minimizes the average squared difference

between the actual values of Yi and the prediction (predicted
value) based on the estimated line.
This minimization problem is solved using calculus
This yields the OLS estimators of 0 , 1 and 2.
129
Multiple regression in EViews
Sample: 1 420
White Heteroskedasticity-Consistent Standard Errors & Covariance
TESTSCR=C(1)+C(2)*STR+C(3)*EL_PCT

C(1) 686.0322 8.728224 78.59930 0.0000
C(2) -1.101296 0.432847 -2.544307 0.0113
C(3) -0.649777 0.031032 -20.93909 0.0000

TestScore = 686.0 – 1.10 STR – 0.65 PctEL
More on this printout later… 130

Outline
3. Measures of fit
131
Measures of Fit for Multiple
Regression (SW Section 6.4)
R2 now becomes the square of the correlation coefficient
between y and predicted y.
It is still the proportional reduction in the residual sum of

squares as we move from modeling y with just a sample
mean, to modeling it with a group of variables.
132
2 2
R and R
The R2 is the fraction of the variance explained – same definition
as in regression with a single regressor:
2 ESS SSR
R = =1 ,
TSS TSS
n n n
where ESS = (Yî Yˆ ) , SSR =
2
uˆ , TSS =
2
i (Yi Y ) 2 .
i 1 i 1 i 1
The R2 always increases when you add another regressor

(why?) – a bit of a problem for a measure of “fit”
133
2 2
R and R
The R 2 (the “adjusted R2”) corrects this problem by “penalizing”
you for including another regressor – the R 2 does not necessarily
increase when you add another regressor.
Adjusted R2:
2 n 1 SSR n 1 SSR 2
R =1 = 1 (1-R )
n k 1 TSS n k 1 TSS
Note that R 2 < R2, however if n is large the two will be very
close.
134
Measures of fit, ctd.
Test score example:
(1) TestScore = 698.9 – 2.28STR,

STR
R2 = .05, SER = 18.6
(2) TestScore = 686.0 – 1.10STR – 0.65PctEL,

R2 = .426, R 2 = .424, SER = 14.5
What – precisely – does this tell you about the fit of regression
(2) compared with regression (1)?
Why are the R2 and the R 2 so close in (2)?
135
Outline
3. Measures of fit
136
Sampling Distribution Depends on
Least Squares Assumptions (SW Section 6.5)
yi=B0+B1x1i+B2x2i+……Bkxki+ui
• E(ut)=0
2. E(utus)=0 t≠s
3. Cov(Xt, ut)=0
4. ut~Normal plus
5. There is no perfect multicollinearity
137
Assumption #4: There is no perfect multicollinearity
Perfect multicollinearity is when one of the regressors is an
exact linear function of the other regressors.
Example: Suppose you accidentally include STR twice:
138
Perfect multicollinearity is when one of the regressors is an
exact linear function of the other regressors.
In such a regression (where STR is included twice), 1 is the
effect on TestScore of a unit change in STR, holding STR
constant (???)
The Standard Errors become Infinite when perfect
multicollinearity exists
139
OLS Wonder Equation
Suˆ 1
SE (bi )
S xi n(1 R 2
xi on X )
• Multicollinearity increases R2 and therefore
increases variance of bi
Perfect multicollinearity (R2=1) makes
regression impossible
You expect a low standard error the more
variables you add to a regression. The more
you add, the higher the R-squared on the
denominator becomes, because it always
rises with extra variables. 140
Quality of Slope Estimate (R2
and Suˆ fixed)
High SE(bi) Low SE(bi) Low SE(bi)
xi xi xi
Sxi2 = Sxi2 < Sxi2

n=6 n=20 n=6
141
The Sampling Distribution of the
OLS Estimator (SW Section 6.6)
Under the Least Squares Assumptions,
The exact (finite sample) distribution of ˆ1 has mean 1,
var( ˆ1 ) is inversely proportional to n; so too for ˆ2 .
Other than its mean and variance, the exact (finite-n)
distribution of ˆ1 is very complicated; but for large n…
p
ˆ is consistent: ˆ 1 (law of large numbers)
1 1
ˆ
1 1
SE ( ˆ ) is approximately distributed N(0,1) (CLT)

1
So too for ˆ2 ,…, ˆk

Conceptually, there is nothing new here!
142
Multicollinearity, Perfect and
Imperfect (SW Section 6.7)
Some more examples of perfect multicollinearity
The example from earlier: you include STR twice.
Second example: regress TestScore on a constant, D, and
Bel, where: Di = 1 if STR ≤ 20, = 0 otherwise; Beli = 1 if STR
>20,
= 0 otherwise, so Beli = 1 – Di and there is perfect
multicollinearity because Bel+D=1 (the 1 „variable‟ for the
constant)
To fix this, drop the constant
143
Perfect multicollinearity, ctd.
Perfect multicollinearity usually reflects a mistake in the
definitions of the regressors, or an oddity in the data
If you have perfect multicollinearity, your statistical software
will let you know – either by crashing or giving an error
message or by “dropping” one of the variables arbitrarily
The solution to perfect multicollinearity is to modify your list
of regressors so that you no longer have perfect
multicollinearity.
144
Imperfect multicollinearity
Imperfect and perfect multicollinearity are quite different despite
the similarity of the names.
Imperfect multicollinearity occurs when two or more regressors

are very highly correlated.
Why this term? If two regressors are very highly
correlated, then their scatterplot will pretty much look like a
straight line – they are collinear – but unless the correlation
is exactly 1, that collinearity is imperfect.
145
Imperfect multicollinearity, ctd.
Imperfect multicollinearity implies that one or more of the
regression coefficients will be imprecisely estimated.
Intuition: the coefficient on X1 is the effect of X1 holding X2
constant; but if X1 and X2 are highly correlated, there is very
little variation in X1 once X2 is held constant – so the data are
pretty much uninformative about what happens when X1
changes but X2 doesn‟t, so the variance of the OLS estimator
of the coefficient on X1 will be large.
Imperfect multicollinearity (correctly) results in large
standard errors for one or more of the OLS coefficients as
described by the OLS wonder equation
Next topic: hypothesis tests and confidence intervals…
146
Portion of X that “explains” Y
High R2
Y
For any two circles,
the overlap tells the
size of the R2
147
Portion of X that “explains” Y
LowR2
Y
For any two circles,
the overlap tells the
size of the R2
148
Adding Another X Increases R2
X2
X1
Now the R is the overlap of both X1

and X2 with Y
149
Imperfect (but high)
multicollinearity
Y Since X2 and X1 share a lot of
the same information, adding
X2 allows us to work out
independent effects better, but
we realize we don‟t have
X2
much information (area) to do
X1
this with. Larger n makes all
circles bigger and, as before,
the overlap tells the size of R2
Suˆ 1
SE (b1 )
S x1 n(1 R 2
x1 on x2 )
150
Chapter 7:
Multiple Regression:
Multiple Coefficient Testing
151
Multiple Coefficients Tests?
yt = 0 + 1x1t + 2x2t +.. kxkt + et
We know how to obtain estimates of the

coefficients, and each one is a ceteris paribus („all
other things equal‟) effect
Why would we want to do hypotheses tests about
groups of coefficients?
152
Multiple Coefficients Tests
yt = 0 + 1x1t + 2x2t +.. kxkt + et
Example 1: Consider the statement that „this whole

model is worthless‟.
One way of making that statement mathematically
formal is to say
1 = 2=….= k=0
because if this is true then none of the variables x1, x2.
. xk helps explain y
153
yt = 0 + 1x1t + 2x2t +.. kxkt + et
Example 2: Suppose y is the share of the population that votes

for the ruling party and x1 and x2 is the spending on TV and
radio advertising.
The Prime Minister might want to know if TV is more effective
than radio, as measured by the impact on the share of the
popular vote for of an extra dollar spent on each. The way to
write this mathematically is
1 > 2
154
yt = 0 + 1x1t + 2x2t +.. kxkt + et
Example 3: Suppose y is the growth in GDP, x1 and x2 are the

cash rate one and two quarters ago, and that all the other X‟s are
different macroeconomic variables.
Suppose we are interested in testing the effectiveness of
monetary policy. One way of doing this is asking if the cash
rate at any lag has an impact on GDP growth. Mathematically,
this is
1 = 2= 0
155
yt = 0 + 1x1t + 2x2t +.. kxkt + et
In each case, we are interested in making

statements about groups of coefficients.
What about just looking at the estimates?
Same problem as in t-testing. You ought to care
about reliability.
What about sequential testing?
errors compound, even if possible (SW Sect. 7.2)
156
yt = 0 + 1x1t + 2x2t +.. kxkt + et
The so-called F-test can do all of these

restrictions, except for example 2.
Before turning to the F-test, let‟s do example 2,
which can be done with a t-test
157
Example 2 Solution
yt = 0 + 1x1t + 2x2t +.. kxkt + et
If then . Sub this in.
yt = 0 + ( ) x1t + 2 x2t + . + et
= 0 + x1t + 2(x1t + x2t)+ . + et
So, to test 1 > 2 just run a new regression including
x1+x2 instead of x2 (everything else is left the same) and do a
t-test for H0: =0 vs. H1: >0. Naturally, if you accept H1: >0,
this implies 1 2 >0 which implies 1 > 2
This technique is called reparameterization
158
Restricted Regressions
yt = 0 + 1x1t + 2x2t +.. kxkt + et
One more thing before we do the F-test, we

must define a „restricted regression‟. This is
just the model you get when a hypothesis is
assumed true
159
Restricted Regression: Example
1
yt = 0 + 1x1t + 2x2t +.. kxkt + et
Example 1: Consider the statement that „this whole

model is worthless‟.
If 1 = 2=….= k=0
then the model is
yt = 0 + et
and the restricted regression would be an OLS
regression of y on a constant. The estimate for the
constant will just be the sample mean of y.
160
Restricted Regression: Example 3
yt = 0 + 1x1t + 2x2t +.. kxkt + et
If 1 = 2 = 0 then the model is

yt = 0 + 3x3t + . . kxkt + et
and the restricted regression is an OLS
regression of y on a constant and x3 to xk
161
Properties of Restricted
Regressions
Imposing a restriction always increases the residual
sum of squares, since you are forcing the estimates to
take the values implied by the restriction, rather than
letting OLS choose the values of the estimates to
minimize the SSR
If the SSR increases a lot, it implies that the
restriction is relatively „unbelievable‟. That is, the
model fits a lot worse with the restriction imposed.
This last point is the basic intuition of the F-test –
impose the restriction and see if SSR goes up „too
much‟.
http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html
162
The F-test
To test a restriction we need to run the restricted
regression as well as the unrestricted regression (i.e.
the original regression). Let q be the number of
restrictions.
Intuitively, we want to know if the change in SSR is
big enough to suggest the restriction is wrong
SSRr SSRur q
F , where
SSRur n k 1 )
r is restricted and ur is unrestricted 163
The F statistic
The F statistic is always positive, since the SSR
from the restricted model can‟t be less than the
SSR from the unrestricted
Essentially the F statistic is measuring the relative
increase in SSR when moving from the unrestricted
to restricted model
q = number of restrictions
164
The F statistic (cont)
To decide if the increase in SSR when we move to
a restricted model is “big enough” to reject the
restrictions, we need to know about the sampling
distribution of our F stat
Not surprisingly, F ~ Fq,n-k-1, where q is referred to
as the numerator degrees of freedom and n – k-1 as
the denominator degrees of freedom
165
The F statistic Reject H0 at
significance level
f(F) if F > c
0 c F
fail to reject reject
166
Equivalently, using p-values
f(F) Reject H0if p-value <
0 c F
fail to reject reject
167
The R 2 form of the F statistic
Because the SSR‟s may be large and unwieldy, an
alternative form of the formula is useful
We use the fact that SSR = TSS(1 – R2) for any
regression, so can substitute in for SSRu and SSRur
2 2
R ur R r q
F 2 , where again
1 R ur n k-1
r is restricted and ur is unrestricted 168
Overall Significance (example 1)
A special case of exclusion restrictions is to test H0: 2 =

3 =…= k=0
R2 =0 for a model with only an intercept
This is because the OLS estimator is just the sample mean,
implying the TSS=SSR
the F statistic is then
2
R k
F 2
1 R n k-1
169
2
R k
Dependent Variable: TESTSCR F 2
Date: 06/05/08 Time: 15:29
Sample: 1 420
1 R n k-1
C 675.6082 5.308856 127.2606 0.0000

MEAL_PCT -0.396366 0.027408 -14.46148 0.0000
AVGINC 0.674984 0.083331 8.100035 0.0000
STR -0.560389 0.228612 -2.451272 0.0146
EL_PCT -0.194328 0.031380 -6.192818 0.0000

Log likelihood -1489.676 Hannan-Quinn criter. 7.136515
F-statistic 429.1152 Durbin-Watson stat 1.545766
Prob(F-statistic) 0.000000
[.8053/4]/[{1-.8053}/(420-5)] = 429
170
General Linear Restrictions
The basic form of the F statistic will work for any
set of linear restrictions
First estimate the unrestricted model and then
estimate the restricted model
In each case, make note of the SSR
Imposing the restrictions can be tricky – will likely
have to redefine variables again
171
F Statistic Summary
Just as with t statistics, p-values can be calculated
by looking up the percentile in the appropriate F
distribution
If only one exclusion is being tested, then F = t2,
and the p-values will be the same
F-tests are done mechanically – you don‟t have to
do the restricted regressions (though you have to
understand how to do them for this course).
172
F-tests are Easy in EVIEWs
To test hypotheses like these in EVIEWs, use the
Wald test. After you run your regression, type
„View, Coefficient tests, Wald‟
Try testing a single restriction (which you can use
a t-test for) and see that t2=F, and, that the p-
values are the same.
Try testing all the coefficients except the intercept
are zero, and compare it with the F-test
automatically calculated in EVIEWs.
SW discusses the shortcomings of F-tests at
length. They crucially depend upon the
assumption of homoskedasticity.
173
Start Big and Go Small
General to Specific Modeling relies upon the fact that
omitted variable bias is a serious problem.
Start with a very big model to avoid OVB
Do t-tests on individual coefficients. Delete the most
insignificant, run the model again, delete the most
insignificant variable, run the model again, and so
on….until every individual coefficient is significant.
Finally, Do an F-test on the original model excluding all
the coefficients required to get to your final model at once.
If the null is accepted, you have verified the model.
Test for Hetero, and correct for it if need be.
174
Chapter 8
Nonlinear Regression
Functions
175
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables
1. Nonlinear regression functions – general comments

2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions
176

2. Polynomials
3. Logs
177
Nonlinear Regression Population Regression
Functions – General Ideas (SW Section 8.1)
If a relation between Y and X is nonlinear:
The effect on Y of a change in X depends on the value of X –

that is, the marginal effect of X is not constant
A linear regression is mis-specified – the functional form is
wrong
The estimator of the effect on Y of X is biased – it needn‟t
even be right on average.
The solution to this is to estimate a regression function that is
nonlinear in X
178
Nonlinear Functions of a Single
Independent Variable (SW Section 8.2)
We‟ll look at two complementary approaches:
1. Polynomials in X
The population regression function is approximated by a
quadratic, cubic, or higher-degree polynomial
2. Logarithmic transformations
Y and/or X is transformed by taking its logarithm
this gives a “percentages” interpretation that makes sense
in many applications
179

2. Polynomials
3. Logs
180
2. Polynomials in X
Approximate the population regression function by a polynomial:
2 i +…+
2 r
Yi = 0+ 1X i + X X
r i + ui
This is just the linear multiple regression model – except that

the regressors are powers of X!
Estimation, hypothesis testing, etc. proceeds as in the
multiple regression model using OLS
The coefficients are difficult to interpret, but the regression
function itself is interpretable
181
Example: the TestScore – Income
relation
Incomei = average district income in the ith district
(thousands of dollars per capita)
Quadratic specification:
2
TestScorei = 0+ 1Incomei + 2 (Incomei + ui
)
Cubic specification:
2
TestScorei = 0+ 1Incomei + 2 (Incomei)
3
+ 3 (Incomei + ui
)
182
Estimation of the quadratic
specification in EViews
Method: Least Squares Create a quadratic regressor
Sample: 1 420
White Heteroskedasticity-Consistent Standard Errors & Covariance
TESTSCR=C(1)+C(2)*AVGINC + C(3)*AVGINC*AVGINC

C(1) 607.3017 2.901754 209.2878 0.0000
C(2) 3.850995 0.268094 14.36434 0.0000
C(3) -0.042308 0.004780 -8.850509 0.0000

Test the null hypothesis of linearity against the alternative that

the regression function is a quadratic….
183
Interpreting the estimated
regression function:
(a) Plot the predicted values
TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2
(2.9) (0.27) (0.0048)
184
Interpreting the estimated
regression function, ctd:
(b) Compute “effects” for different values of X

(2.9) (0.27) (0.0048)
Predicted change in TestScore for a change in income from

$5,000 per capita to $6,000 per capita:
TestScore = 607.3 + 3.85 6 – 0.0423 62

– (607.3 + 3.85 5 – 0.0423 52)
= 3.4
185
Predicted “effects” for different values of X:
Change in Income ($1000 per capita) TestScore

from 5 to 6 3.4
from 25 to 26 1.7
from 45 to 46 0.0
The “effect” of a change in income is greater at low than high

income levels (perhaps, a declining marginal benefit of an
increase in school budgets?)
Caution! What is the effect of a change from 65 to 66?
Don’t extrapolate outside the range of the data!
186
Predicted “effects” for different values of X:
Change in Income ($1000 per capita) TestScore

from 5 to 6 3.4
from 25 to 26 1.7
from 45 to 46 0.0
Alternatively, dTestscore/dIncome = 3.85-.0846 (Income)

gives the same numbers (approx)
187
Summary: polynomial regression
functions
Yi = 0+ 1Xi + 2 X i2 +…+ X r
r i + ui
Estimation: by OLS after defining new regressors

Coefficients have complicated interpretations
To interpret the estimated regression function:
plot predicted values as a function of x
compute predicted Y/ X at different values of x
Hypotheses concerning degree r can be tested by t- and F-
tests on the appropriate (blocks of) variable(s).
Choice of degree r
plot the data; t- and F-tests, check sensitivity of estimated
effects; judgment.
188
A Final Warning: Polynomials
Can Fit Too Well
When fitting a polynomial regression function, we
need to be careful not to fit too many terms, despite
the fact that a higher order polynomial will always
fit better.
If we do fit too many terms, then any prediction
may become unrealistic.
The following applet lets us explore fitting
different polynomials to some data.
http://www.scottsarra.org/math/courses/na/nc/polyRegression.html
189
3. Are Polynomials Enough?
We can investigate the appropriateness of a
regression function by graphing the regression
function over the top of the scatterplot.
For some models, we may need to transform the
data
For example, take logs of the response variable
The site below allows us to do this, exploring some
common regression functions
http://www.ruf.rice.edu/%7Elane/stat_sim/transformations/index.html
190

2. Polynomials
3. Logs
191
3. Logarithmic functions of Y and/or X
ln(X) = the natural logarithm of X
Logarithmic transforms permit modeling relations in
“percentage” terms (like elasticities), rather than linearly.
Here’s why: ln( x) 1

x x
ln( x) 1 x
ln( x) proportion al change in x
x x x
Numerically:
ln(1.01)-ln(1) = .00995-0 =.00995 (correct % .01);
ln(40)-ln(45) = 3.6889-3.8067=-.1178 (correct % = -.1111) 192
Three log regression specifications:
Case Population regression function

I. linear-log Yi = 0 + 1ln(Xi) + ui
II. log-linear ln(Yi) = 0 + 1Xi + ui
III. log-log ln(Yi) = 0 + 1ln(Xi) + ui
The interpretation of the slope coefficient differs in each case.

The interpretation is found by applying the general “before
and after” rule: “figure out the change in Y for a given change
in X.”
193
Summary: Logarithmic
transformations
Three cases, differing in whether Y and/or X is transformed
by taking logarithms.
The regression is linear in the new variable(s) ln(Y) and/or
ln(X), and the coefficients can be estimated by OLS.
Hypothesis tests and confidence intervals are now
implemented and interpreted “as usual.”
The interpretation of 1 differs from case to case.
Choice of specification should be guided by judgment (which
interpretation makes the most sense in your application?),
tests, and plotting predicted values
194

2. Polynomials
3. Logs
195
Regression when X is Binary
(Section 5.3)
Sometimes a regressor is binary:

X = 1 if small class size, = 0 if not
X = 1 if female, = 0 if male
X = 1 if treated (experimental drug), = 0 if not
Binary regressors are sometimes called “dummy” variables.
So far, 1 has been called a “slope,” but that doesn‟t make sense
if X is binary.
How do we interpret regression with a binary regressor?

196
Interpreting regressions with a
binary regressor
Yi = 0 + 1Xi + ui, where X is binary (Xi = 0 or 1):
When Xi = 0, Yi = 0 + ui
the mean of Yi is 0
that is, E(Yi|Xi=0) = 0
When Xi = 1, Yi = 0 + 1 + ui
the mean of Yi is 0 + 1
that is, E(Yi|Xi=1) = 0 + 1
so:
1 = E(Yi|Xi=1) – E(Yi|Xi=0)
= population difference in group means
197
Interactions Between Independent
Variables (SW Section 8.3)
Perhaps a class size reduction is more effective in some
circumstances than in others…
Perhaps smaller classes help more if there are many English
learners, who need individual attention
TestScore
That is, might depend on PctEL
STR
Y
More generally, might depend on X2
X1
How to model such “interactions” between X1 and X2?
We first consider binary X‟s, then continuous X‟s
198
(a) Interactions between two binary
variables
Yi = 0 + 1D1i + 2D2i + ui
D1i, D2i are binary

1 is the effect of changing D1=0 to D1=1. In this specification,
this effect doesn’t depend on the value of D2.
To allow the effect of changing D1 to depend on D2, include the
“interaction term” D1i D2i as a regressor:
Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui
199
Interpreting the coefficients:
Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui
It can be shown that
Y
D1 = 1 + 3D 2
The effect of D1 depends on D2 (what we wanted)

3 = increment to the effect of D1 from a unit change in D2
200
Example: TestScore, STR, English
learners
Let
1 if STR 20 1 if PctEL l0
HiSTR = and HiEL =
0 if STR 20 0 if PctEL 10
TestScore = 664.1 – 18.2HiEL – 1.9HiSTR – 3.5(HiSTR HiEL)

(1.4) (2.3) (1.9) (3.1)
“Effect” of HiSTR when HiEL = 0 is –1.9

“Effect” of HiSTR when HiEL = 1 is –1.9 – 3.5 = –5.4
Class size reduction is estimated to have a bigger effect when
the percent of English learners is large
This interaction isn‟t statistically significant: t = 3.5/3.1
201
(b) Interactions between continuous
and binary variables
Yi = 0 + 1Xi + 2Di + ui
Di is binary, X is continuous
As specified above, the effect on Y of X (holding constant D) =
1, which does not depend on D
To allow the effect of X to depend on D, include the
“interaction term” Di Xi as a regressor:
Yi = 0 + 1Xi + 2Di + 3(Di Xi) + ui
202
Binary-continuous interactions: the
two regression lines
Yi = 0 + 1Xi + 2Di + 3(Di Xi) + ui
Observations with Di= 0 (the “D = 0” group):
Yi = 0 + 1Xi + ui The D=0 regression line
Observations with Di= 1 (the “D = 1” group):
Yi = 0 + 1Xi + 2 + 3Xi + ui
= ( 0+ 2) + ( 1+ 3)Xi + ui The D=1 regression line
203
Binary-continuous interactions, ctd.
D=1 D=0 D=1
D=0
B3=0
All Bi non-zero
D=1
D=0
B2=0
204
Interpreting the coefficients:
Yi = 0 + 1 Xi + 2 Di + 3(Xi Di) + ui
Or, using calculus,
Y
X = 1 + 3D
The effect of X depends on D (what we wanted)

3 = increment to the effect of X1 from a change in the level
of D from D=0 to D=1
205
Example: TestScore, STR, HiEL
(=1 if PctEL  10)
TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)
(11.9) (0.59) (19.5) (0.97)
When HiEL = 0:
TestScore = 682.2 – 0.97STR
When HiEL = 1,
TestScore = 682.2 – 0.97STR + 5.6 – 1.28STR
= 687.8 – 2.25STR
Two regression lines: one for each HiSTR group.
Class size reduction is estimated to have a larger effect when
the percent of English learners is large.
206
Example, ctd: Testing hypotheses
TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)
(11.9) (0.59) (19.5) (0.97)
The two regression lines have the same slope the
coefficient on STR HiEL is zero: t = –1.28/0.97 = –1.32
The two regression lines have the same intercept the
coefficient on HiEL is zero: t = –5.6/19.5 = 0.29
The two regression lines are the same population
coefficient on HiEL = 0 and population coefficient on
STR HiEL = 0: F = 89.94 (p-value < .001) !!
We reject the joint hypothesis but neither individual
hypothesis (how can this be?)
207
Summary: Nonlinear Regression
Functions
Using functions of the independent variables such as ln(X)
or X1 X2, allows recasting a large family of nonlinear
regression functions as multiple regression.
Estimation and inference proceed in the same way as in
the linear multiple regression model.
Interpretation of the coefficients is model-specific, but the
general rule is to compute effects by comparing different
cases (different value of the original X‟s)
Many nonlinear specifications are possible, so you must
use judgment:
What nonlinear effect you want to analyze?
What makes sense in your application?
208
Chapter 9
Misleading Statistics
209
Statistics Means Description and
Inference
Descriptive Statistics is about describing datasets.
Various visual tricks can distort these descriptions
Inferential Statistics is about statistical inference.
You know something about tricks to distort
inference (eg. Putting in lots of variables to raise
R2 or lowering to get in a variable you want).
210
Pitfalls of Analysis
There are several ways that misleading statistics
can occur (which effect both inferential and
descriptive statistics)
Obtaining flawed data
Not understanding the data
Not choosing appropriate displays of data
Fitting an inappropriate model
Drawing incorrect conclusions from analysis.
211
Poor Displays of Data: Chart
Junk
Source: Wainer (1984), How to display data badly

212
Poor Displays of Data: 2D
picture
213
Poor Displays of Data: Axes
Increments of 100,000
A jump in the scale from

800,000 to 1,500,000
214
How to Display Data
• The golden rule for displaying data in a graph is to
keep it simple
• Graphs should not have any chart junk.
– “minimise the ratio of ink to data” - Tufte
• Axes should be chosen so they do not inflate or deflate
the differences between observations
– Where possible, start the Y-axis at 0
– If this is not possible then you should consider graphing the
change in the observation from one period to the next
• Some general tips on how to properly display data can
be found at
http://lilt.ilstu.edu/gmklass/pos138/datadisplay/sections/goodcharts.htm
215
How to Display Data
216
Incorrect Conclusions: Causality
Correlation: 0.848
Excess money supply (%) Increase in prices two years
later (%)
1965 4.7 1967 2.5
1966 1.9 1968 4.7
1967 7.8 1969 5.4
1968 4.0 1970 6.4
1969 1.3 1971 9.4
1970 7.8 1972 7.1
1971 11.4 1973 9.2
1972 23.4 1974 16.1
1973 22.2 1975 24.2
Source: Grenville and Macfarlane (1988) 217

Accompanying Letter
Sir,
Professor Lord Kaldor today (March 31) states that
“there is no historical evidence whatever” that the money
supply determines the future movement of prices with
a time lag of two years. May I refer Professor Kaldor to
your article in The Times of July 13, 1976.
Data
If one calculates the correlation between these two sets

of figures the coefficient r=0.848 and since there are seven
degrees of freedom the P value is less than 0.01. If Mr
Rees-Mogg‟s figures are correct, this would appear to a
biologist to be a highly significant correlation, for it means
that the probability of the correlation occurring by chance
is less than one in a hundred. Most betting men would
think that those were impressive odds.
Until Professor Kaldor can show a fallacy in the figures,
I think Mr Rees-Mogg has fully established his point.
Yours faithfully,
IVOR H. MILLS,
University of Cambridge Clinical School,
Department of Medicine,
218
Response
Sir,
Professor Mills today (April 4) uses correlation
analysis in your columns to attempt to resolve the
theoretical dispute over the cause(s) of inflation. He
cites a correlation coefficient of 0.848 between the rate
of inflation and the rate of change of “excess” money
supply two years before.
We were rather puzzled by this for we have always
believed that it was Scottish Dysentery that kept prices
down (with a one-year lag, of course). To reassure
ourselves, we calculated the correlation between the
following sets of figures:
219
Incorrect Conclusions: Causality
Correlation: -0.868
Cases of Dysentery in Increase in prices one year
Scotland („000) later (%)
1966 4.3 1967 2.5
1967 4.5 1968 4.7
1968 3.7 1969 5.4
1969 5.3 1970 6.4
1970 3.0 1971 9.4
1971 4.1 1972 7.1
1972 3.2 1973 9.2
1973 1.6 1974 16.1
1974 1.5 1975 24.2
Source: Grenville and Macfarlane (1988) 220
A Final Warning
We have to inform you that the correlation coefficient
is -0.868 (which is statistically slightly more significant
than that obtained by Professor Mills). Professor Mills says
that “Until … a fallacy in the figures [can be shown], I
think Mr Rees-Mogg has fully established his point.” By
the same argument, so have we.
Yours faithfully.
G. E. J. LLEWELLYN, R. M. WITCOMB.
Faculty of Economics and Politics,
221

All Econonmetrics Lectures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

All Econonmetrics Lectures

Uploaded by

Copyright:

Available Formats

Chapter 4

Linear Regression with

Simple regression allows us answer all three

The regression R2 can be seen from the applet

Estimated slope = ˆ1 = – 2.28

Estimated regression line: TestScore = 698.9 – 2.28 STR

TestScore = 698.9 – 2.28 STR, R2 = .05, SER = 18.6

Coefficient Std. Error t-Statistic Prob.

R-squared 0.051240 Mean dependent var 654.1565

Regression with a Single

X is the independent variable or regressor

Yi = 0 + 1Xi + ui, i = 1,…, n

u is everything left out of the model

If this assumption holds, the errors are said to be

A violation of this is called autocorrelation

Draw a least squares

. y (a) homoskedasticity (b) heteroskedasticity

This will be discussed extensively next lecture

Dependent Variable: TESTSCR

Coefficient Std. Error t-Statistic Prob.

C 698.9330 9.467491 73.82451 0.0000

R-squared 0.051240 Mean dependent var 654.1565

This expression gives a lot of information

The standard error of ˆ1 is 0.52

The R2 is .05; the standard error of the regression is 18.6

This expression gives a lot of information

The standard error of ˆ1 is 0.52

The R2 is .05; the standard error of the regression is 18.6

where 1,0 is the hypothesized value under the null.

Null hypothesis and one-sided alternative:

estimator - hypothesized value

where the SE of the estimator is the square root of an

Comparing distance between estimate and your hypothesized

See textbook pp 72-81, and look up p-value in the index

1. What values of the test statistic would make you more

For a two sided test, p-value is p = Pr[|t| > |tact|] = probability

Coefficient Std. Error t-Statistic Prob.

C 655.1223 1.126888 581.3553 0.0000

„Prob‟ only equals p-value for a two sided test

(b) H0:B1=0, H1:B1<0, with =.05 using the critical-values approach

(c) H0:B1=0, H1:B1≠0, with =.05 using the critical-values approach

(d) H0:B1=0, H1:B1>0, with =.05 using the p-value approach

(e) H0:B1=0, H1:B1<0, with =.05 using the p-value approach

(f) H0:B1=0, H1:B1≠0, with =.05 using the p-value approach

(g) H0:B1=-.05, H1:B1<-.05, with =.10

This is very hard to do with p-values!

SE( ˆ0 ) = 10.4 SE( ˆ1 ) = 0.52

95% confidence interval for ˆ1 :

{ ˆ1 1.96 SE( ˆ1 )} = {–2.28 1.96 0.52}

But be careful of the Simple Regression assumption

Sometimes, the omission of those variables can lead to bias in the

1. A determinant of Y (i.e. Z is part of u); and

2. Correlated with the regressor X (i.e. corr(Z,X) 0)

Both conditions must hold for the omission of Z to result in

u the volatility of the error and the

Y is the dependent variable

We can use calculus to interpret the coefficients:

0 = predicted value of Y when X1 = X2 = 0.

The OLS estimator minimizes the average squared difference

Coefficient Std. Error t-Statistic Prob.

R-squared 0.426431 Mean dependent var 654.1565

TestScore = 686.0 – 1.10 STR – 0.65 PctEL