You are on page 1of 221

Chapter 4

Linear Regression with


One Regressor
(„Simple Regression‟)
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives a number for 1→1 association
3. Simple Regression is Better than Correlation

2
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives a number for 1→1 association
3. Simple Regression is Better than Correlation

3
Does Having Too Many Students
Per Teacher Lower Test Marks?

4
Scatterplots are Pictures of 1→1
Association

5
Is there a Number for This
Relationship?

6
What about Mean? Variance?

7
Treat this as a Dataset on Student-
teacher ratio (STR), called „X‟

8
Treat this as a Dataset on Student-
teacher ratio (STR), called „X‟
Imagine Falling Rain

9
Collapse onto „X‟ (horizontal) axis

10
Ignore „Y‟ (vertical)

11
Sample Mean

n
xi xi
i 1
x
n n

12
Sample Variance

n
( xi x) 2
( xi x )2
2 i 1
S x
n 1 n 1

13
Standard error/deviation is the
square root of the variance

n
( xi x) 2
( xi x )2
Sx S x2 i 1
n 1 n 1

14
(It is very close to a typical departure of x from mean.
„standard‟ = „typical‟; „deviation/error‟ = departures from mean)

| xi x|
Sx
n

15
Treat this as a Dataset on Test
score, called „Y‟

16
Collapse onto Y axis

17
Calculate Mean and Variance

2
S y

18
Is there a Number for This
Relationship? Not Yet

19
Break up All Observations
into 4 Quadrants
x
II I

III IV

20
Fill In the Signs of Deviations from
Means for Different Quadrants
x
II I
xi x 0 yi y 0 xi x 0 yi y 0

III IV

xi x 0 yi y 0 xi x 0 yi y 0

21
Fill In the Signs of Deviations from
Means for Different Quadrants
x
II I
xi x 0 yi y 0 xi x 0 yi y 0

III IV

xi x 0 yi y 0 xi x 0 yi y 0

22
The Products are Positive in I and III

x
II I
( xi x )( y i y ) 0 ( xi x )( y i y ) 0

III IV

( xi x )( y i y ) 0 ( xi x )( y i y ) 0

23
The Products are Negative in II and IV

x
II I
( xi x )( y i y ) 0 ( xi x )( y i y ) 0

III IV

( xi x )( y i y ) 0 ( xi x )( y i y ) 0

24
Sample Covariance, Sxy, describes
the Relationship between X and Y
1
S xy ( xi x )( y i y )
(n 1)
If Sxy > 0 most data lies in I and III:
This concurs with our visual common sense because
it looks like a positive relationship
If Sxy < 0 most data lies in II and IV
This concurs with our visual common sense because
it looks like a negative relationship
If Sxy = 0 data is „evenly spread‟ across I-IV 25
What About Our Data?
x
II I

III IV

26
Large Negative Sxy

27
Large Positive Sxy

28
Zero Sxy

29
Our Data has a Mild Negative
Covariance Sxy<0
x
II I

III IV

30
Correlation, rXY, is a Measure of
Relationship that is Unit-less
S XY
rXY
S X SY
It can be proved that it lies between -1 and 1.
1 rXY 1
It has the same sign as SXY so ….

31
Mild Negative Correlation
rXY= -.2264
x
II I

III IV

32
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation

33
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation

But…
How much does Y change when X changes?
What is a good guess of Y if X =25?
What does correlation = -.2264 mean anyway?
34
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives a number for 1→1 association
3. Simple Regression is Better than Correlation

35
What is Simple Regression?

Simple regression allows us answer all three


questions:
“How much does Y change when X changes?”
“What is a good guess of Y if X =25?”
“What does correlation = -.2264 mean anyway?”
…by fitting a straight line to data
on two variables, Y and X.

Y b0 b1. X
36
Y b0 b1. X
fadsf
fadsf

sa

sa

37
We Get our Guessed Line Using
„(Ordinary) Least Squares [OLS]‟
OLS minimises the squared difference between a
regression line and the observations.
We can view these squared differences as squares.
This task then becomes the minimisation of the
area of the squares.

Applet: http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html

38
Measures of Fit
(Section 4.3)

The regression R2 can be seen from the applet


http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html
It is the proportional reduction in the sum of squares as one
moves from modeling Y by a constant (with LS estimator
being a sample mean, and sum of squares equal to „total sum
of squares‟) to a line. R2=[TSS-„sum of squares‟]/TSS
If the model fits perfectly „sum of squares‟ = 0 and R2=1
If model does no better than a constant it = TSS and R2=0
The standard error of the regression (SER) measures the
magnitude of a typical regression residual in the units of Y.

39
The Standard Error of the
Regression (SER)
The SER measures the spread of the distribution of u. The SER
is (almost) the sample standard deviation of the OLS residuals:
n
1
SER = (uˆi uˆ )2
n 2 i 1

n
1
= uˆi2
n 2 i 1

n
1
(the second equality holds because û = uˆi = 0).
n i 1

40
n
1
SER = uˆi2
n 2 i 1

The SER:
has the units of u, which are the units of Y
measures the average “size” of the OLS residual (the average
“mistake” made by the OLS regression line)
Don‟t worry about the n-2 (instead of n-1 or n) – the reason is
too technical, and doesn‟t matter if n is large.

41
How the Computer Did it
(SW Key Concept 4.2)

42
The OLS Line has a Small Negative
Slope

Estimated slope = ˆ1 = – 2.28


Estimated intercept = ˆ = 698.9
0

Estimated regression line: TestScore = 698.9 – 2.28 STR


43
Interpretation of the estimated slope
and intercept
Test score= 698.9 – 2.28 STR
Districts with one more student per teacher on average have
test scores that are 2.28 points lower.
Test score
That is, = –2.28
STR
The intercept (taken literally) means that, according to this
estimated line, districts with zero students per teacher would
have a (predicted) test score of 698.9.
This interpretation of the intercept makes no sense – it
extrapolates the line outside the range of the data – here, the
intercept is not economically meaningful.
44
Remember Calculus?
Test score = 698.9 – 2.28 STR
dTest score
Differentiation gives dSTR = –2.28
„d‟ means „infinitely small change‟ but for a „very small
change‟ called „ ‟ it will still be pretty close to the truth. So,
an approximation is:
Test score
= –2.28
STR
How to interpret this? Take denominator over the other side.
Test score 2.28 STR
So, if STR goes up by one, Test score falls by 2.28.
If STR goes up by, say, 20, Test score falls by 2.28 (20)=45.6
45
Predicted values & residuals:

One of the districts in the data set is Antelope, CA, for which
STR = 19.33 and Test Score = 657.8
predicted value: Yˆ
Antelope= 698.9 – 2.28 19.33 = 654.8
residual: uˆ Antelope = 657.8 – 654.8 = 3.0
46
R 2 and SER evaluate the Model

TestScore = 698.9 – 2.28 STR, R2 = .05, SER = 18.6


By using STR, you only reduce the sum of squares by 5%
compared with just ‘modeling’ Test score by its average. That is,
STR only explains a small fraction of the variation in test scores.
The standard residual size is 19, which looks large. 47
Seeing R2 and the SER in EVIEWs
Dependent Variable: TESTSCR
Method: Least Squares
Sample: 1 420
Included observations: 420
TESTSCR=C(1)+C(2)*STR

Coefficient Std. Error t-Statistic Prob.


C(1) 698.9330 9.467491 73.82451 0.0000
C(2) -2.279808 0.479826 -4.751327 0.0000

R-squared 0.051240 Mean dependent var 654.1565


Adjusted R-squared 0.048970 S.D. dependent var 19.05335
S.E. of regression 18.58097 Akaike info criterion 8.686903
Sum squared resid 144315.5 Schwarz criterion 8.706143
Log likelihood -1822.250 Durbin-Watson stat 0.129062

= 698.9 – 2.28.STR

48
Recap
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation
But…
How much does Y change when X changes?
What is a good guess of Y if X =25?
What does correlation = -.2264 mean anyway?

49
Recap
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation
But…
How much does Y change when X changes? b1 x
What is a good guess of Y if X =25? b0+b1(25)
What does correlation = -.2264 mean anyway?

50
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation
But…
How much does Y change when X changes? b1 x
What is a good guess of Y if X =25? b0+b1(25)
What does correlation = -.2264 mean anyway?
Surprise: R2=rXY2
51
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation
But…
How much does Y change when X changes? b1 x
What is a good guess of Y if X =25? b0+b1(25)
What does correlation = -.2264 mean anyway?
Surprise: R2=rXY2
52
Chapter 5

Regression with a Single


Regressor: Hypothesis Tests
What is Simple Regression?
We‟ve used Simple regression as a means of
describing an apparent relationship between two
variables. This is called descriptive statistics.
Simple regression also allows us to estimate, and
make inferences, under the OLS assumptions,
about the slope coefficients of an underlying
model. We do this, as before, by fitting a straight
line to data on two variables, Y and X. This is
called inferential statistics.

54
The Underlying Model
(or „Population Regression Function‟)

Yi = 0 + 1X i + ui, i = 1,…, n

X is the independent variable or regressor


Y is the dependent variable
0 = intercept
1 = slope
ui = the regression error or residual
The regression error consists of omitted factors, or possibly
measurement error in the measurement of Y. In general, these
omitted factors are other factors that influence Y, other than
the variable X
55
What Does it Look Like in This Case?

Yi = 0 + 1Xi + ui, i = 1,…, n

X is the STR
Y is the Test score
0= intercept
Test score
1=
STR
= change in test score for a unit change in STR
If we also guess 0 we can also predict Test score when STR
has a particular value.
Clearly, we want good guesses (estimates) of 0 and 1.
56
A Picture is Worth 1000 Words

57
From Now on we Use „b‟ or „ ˆ ‟ to Signify our
Guesses, or „Estimates‟ of the Slope or Intercept.
We never see the True Line.

û1

û 2

b0+b1x

58
From Now on we Use „b‟ or „ ˆ ‟ to Signify our
Estimates of the Slope or Intercept and for
guesses for u. We never see the True Line or u‟s.

û1

û 2

b0+b1x

59
Our Estimators are Really Random
Least squares estimators have a distribution; they are
different every time you take a different sample. (like an
average of 5 heights, or 7 exam marks)
The estimators are Random Variables. A Random
variable generate numbers with a central measure called
a mean and a volatility called the Standard Errors
Least squares estimators b0 & b1 have means 0 & 1
Hypothesis testing:
Eg. How to test if the slope 1
is zero, or -37?
Confidence intervals:
Eg. What is a reasonable range of guesses for the slope 1?

60
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals

61
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals

62
Outline
1. OLS Assumptions (Very Technical)
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals

63
Outline
1. OLS Assumptions (When will OLS be ‘good’?)
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals

64
Estimator Distributions Depend on
Least Squares Assumptions
A key part of the model is the assumptions made
about the residuals ut for t=1,2….n.

1. E(ut)=0
2. E(ut2)=σ2 =SER2 (note: not σ2t – invariant)
3. E(utus)=0 t≠s
4. Cov(Xt, ut)=0
5. ut~Normal
65
SW has different assumptions;
Use mine for any Discussions
The conditional distribution of u given X has
mean zero, that is, E(u|X = x) = 0. (a combination
of 1. and 4.)
(Xi,Yi), i =1,…,n, are i.i.d. (unnecessary in many
applications)
Large outliers in X and/or Y are rare. (technical
assumption)

66
How reasonable
are these assumptions?
To answer, we need to understand them.

1. E(ut)=0
2. E(ut2)=σ2 =SER2 (note: not σ2t – invariant)
3. E(utus)=0 t≠s
4. Cov(Xt, ut)=0
5. ut~Normal

67
It‟s All About u

u is everything left out of the model

E(ut)=0
1. E(ut2)=σ2 =SER2 (note: not σ2t – invariant)
2. E(utus)=0 t≠s
3. Cov(Xt, ut)=0
4. ut~Normal

68
1. E(ut)=0 is not a big deal
Providing the model has a constant, this is not a
restrictive assumption.
If „all the other influences‟ don‟t have a zero
mean, the estimated constant will just adjust to
the point u does have a zero mean.
Really, B0+u could be thought of as everything
else that affects y apart from x

69
2. E(ut2)=σ2 =SER2 is Controversial

If this assumption holds, the errors are said to be


homoskedastic
If it is violated, the errors are said to be
heteroskedastic (hetero for short)
There are many conceivable forms of hetero, but
perhaps the most common is when the variance
depends upon the value of x

70
Hetero Related to X is very Common
Homoskedastic Heteroskedastic

71
Our Data Looks OK, but Don‟t be Complacent

Homoskedastic Heteroskedastic

72
3. E(utus)=0 t≠s

A violation of this is called autocorrelation


If the underlying model generates data for a time
series, it is highly likely that „left out‟ variables
will be autocorrelated (i.e. z depends on lagged z;
most time series are like this) and so u will be too.
But if the model describes a cross-section
assumption 3 is likely to hold.

73
Aside: Hetero and Auto are not a
Disaster
Hetero plagues cross-sectional data, Auto plagues
time series.
Remarkably, Heteroskedasticity and
Autocorrelation do not bias the Least Squares
Estimators.
This is a very strange result!

74
Hetero Doesn‟t Bias
y

Draw a least squares


line through these points

75
If You Could See the True Line
You‟d Realize hetero is bad for OLS

. y (a) homoskedasticity (b) heteroskedasticity

76
But OLS is Still Unbiased!
In case (b), OLS is still unbiased because the next draw
is just as likely to find the third error above the true
line, pulling up the (negative) slope of the least squares
line. On average, the true line would be revealed
with many samples.
y (a) (b)

x
But we will make an adjustment to our analysis later OLS is no
longer „best‟ which means minimum variance
77
Conquer Hetero and Auto with Just
One Click
SW recommend you correct standard errors for
hetero and auto. In EVIEWs you do this by:
estimate/options/heteroskedasticity consistent
coefficient covariance/ input: leave white ticked if
only worried about hetero. Tick Newey West to
correct for both.
Because OLS is unbiased, the correction only
occurs for the standard errors.
Sometimes, we will use standard errors corrected
in this way
78
4. Cov(Xt, ut)=0

This will be discussed extensively next lecture


When there is only one variable in a regression, it
is highly likely that that variable will be correlated
with a variable that is left out of the model, which
is implicitly in the error term.
Before proceeding with assumption 5, it is worth
stating that 1. – 4. are all that are required to prove
the so-called Gauss-Markov Theorem, that OLS
is Best Linear Unbiased Estimator (SW Section
5.5)
79
5. ut~Normal
With this assumption, OLS is minimum volatility
estimator among all consistent estimators.
Many variables are Normal
http://rba.gov.au/Statistics/AlphaListing/index.html
The assumption „delivers‟ a known distribution of OLS
estimators (a „t‟ distribution) if n is small. But if n is large
(>30) the OLS estimators become Normal, so it is
unnecessary. This is due to the Central Limit Theorem
http://onlinestatbook.com/stat_sim/sampling_dist/index.html
http://www.rand.org/statistics/applets/clt.html
80
Assessment of Assumptions
1. E(ut)=0 harmless if model has constant
2. E(ut2)=σ2 =SER2 not too serious since
3. E(utus)=0 t≠s OLS still unbiased
4. Cov(Xt, ut)=0 Serious – see next lecture
5. ut~Normal Nice Property to have, but if
sample size is big it doesn’t matter
We assume 2. and 3. hold, or just adjust the
standard errors. We‟ll also assume n is large and
we‟ll always keep a constant, so 5. and 1. are not
relevant. This lecture, we assume 4 holds.
81
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals

82
With OLS Assumptions the CLT
Gives Us the Distribution of 1
ˆ ~ N( 1 , SE ( 1
2
) )
1

83
t-distribution (small n) vs Normal

0.4

0.3

0.2

0.1

3 2 1 1 2 3

84
With OLS Assumptions the CLT
Gives Us the Distribution of 1
ˆ ~ N( 1 , SE ( 1
2
) )
1

85
With OLS Assumptions the CLT
Gives Us the Distribution of 1
ˆ ~ N( 1 , SE ( 1
2
) )
1

86
EVIEWs output gives us SE(B1)

Dependent Variable: TESTSCR


Method: Least Squares
Date: 06/04/08 Time: 22:13
Sample: 1 420
Included observations: 420

Coefficient Std. Error t-Statistic Prob.

C 698.9330 9.467491 73.82451 0.0000


STR -2.279808 0.479826 -4.751327 0.0000

R-squared 0.051240 Mean dependent var 654.1565


Adjusted R-squared 0.048970 S.D. dependent var 19.05335
S.E. of regression 18.58097 Akaike info criterion 8.686903
Sum squared resid 144315.5 Schwarz criterion 8.706143
Log likelihood -1822.250 Hannan-Quinn criter. 8.694507
F-statistic 22.57511 Durbin-Watson stat 0.129062
Prob(F-statistic) 0.000003

87
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals

88
EVIEWs Output Can be Summarized
in Two Lines
Put standard errors in parentheses below the estimated
coefficients to which they apply.
2
TestScore = 698.9 – 2.28 STR, R = .05, SER = 18.6
(10.4)
9.4675 (0.52)
0.4798

This expression gives a lot of information


The estimated regression line is
TestScore = 698.9 – 2.28 STR
The standard error of ˆ is 10.4
0
9.4675

The standard error of ˆ1 is 0.52


0.4798

The R2 is .05; the standard error of the regression is 18.6


89
We Only Need Two Numbers For
Hypothesis Testing
Put standard errors in parentheses below the estimated
coefficients to which they apply.
2
TestScore = 698.9 – 2.28 STR, R = .05, SER = 18.6
(10.4)
9.4675 (0.52)
0.4798

This expression gives a lot of information


The estimated regression line is
TestScore = 698.9 – 2.28 STR
The standard error of ˆ is 10.4
0
9.4675

The standard error of ˆ1 is 0.52


0.4798

The R2 is .05; the standard error of the regression is 18.6


90
Remember Hypothesis Testing?
1. H0 = null hypothesis = „status quo‟ belief = what
you believe without good reason to doubt it.
2. H1 = alternative hypothesis = what you believe if
you reject H0
3. Collect evidence and create a calculated Test
Statistic
4. Decide on a significance level = test size =
Prob(type I error) =
5. The test size defines a rejection region and a
critical value (the changeover point)
6. Reject H0 if Test Statistic lies in Rejection Region
91
Hypothesis Testing and the Standard
Error of ˆ1 (Section 5.1)
The objective is to test a hypothesis, like 1 = 0, using data – to
reach a tentative conclusion whether the (null) hypothesis is
correct or incorrect.
General setup
Null hypothesis and two-sided alternative:
H0: 1 = 1,0 vs. H1: 1 1,0

where 1,0 is the hypothesized value under the null.

Null hypothesis and one-sided alternative:


H 0: 1 = 1,0 vs. H1: 1 < 1,0
92
General approach: construct t-statistic, and compute p-value (or
compare to N(0,1) critical value)

estimator - hypothesized value


In general: t=
standard error of the estimator

where the SE of the estimator is the square root of an


estimator of the variance of the estimator.
ˆ
For testing 1, t= 1 1,0
,
SE ( ˆ1 )
where SE( ˆ1 ) = the square root of an estimator of the variance
of the sampling distribution of ˆ1

Comparing distance between estimate and your hypothesized


value is obvious; doing it in units of volatility is less so.
93
Summary: To test H0: 1 = 1,0 vs.
H1: 1 ≠ 1,0,
Construct the t-statistic
ˆ
t= 1 1,0

SE ( ˆ1 )
Reject at 5% significance level if |t| > 1.96
This procedure relies on the large-n approximation; typically
n = 30 is large enough for the approximation to be excellent.

94
p-values are another method

See textbook pp 72-81, and look up p-value in the index

1. What values of the test statistic would make you more


determined to reject the null than you are now?
2. If the null is true, what is the probability of obtaining
those values? This is the p-value.
“the p-value, also called the significance probability [not in
QBA] is the probability of drawing a statistic at least as
adverse to the null hypothesis as the one you actually
computed in your sample, assuming the null hypothesis is
correct” pg. 73
95
p-values are another method
See textbook pp 72-81, and look it up in the index

For a two sided test, p-value is p = Pr[|t| > |tact|] = probability


in tails of normal outside |tact|;
you reject at the 5% significance level if the p-value is < 5%
(or < 1% or <10% depending on test size)
REJECT H0 IF PVALUE <

96
Example: Test Scores and STR,
California data
Estimated regression line: TestScore = 698.9 – 2.28STR
Regression software reports the standard errors:
(The standard errors are corrected for heteroskedasticity)
SE( ˆ0 ) = 10.4 SE( ˆ1 ) = 0.52

ˆ 2.28 0
t-statistic testing 1,0 = 0 = 1 1,0
= = –4.38
ˆ
SE ( 1 ) 0.52
The 1% 2-sided significance level is 2.58, so we reject the null
at the 1% significance level.
Alternatively, we can compute the p-value…
97
The p-value based on the large-n standard normal approximation
to the t-statistic is 0.00001 (10–5)
98
Hypothesis Testing Can be Tricky
Dependent Variable: TESTSCR
Method: Least Squares
Date: 06/05/08 Time: 22:35
Sample: 1 420
Included observations: 420

Coefficient Std. Error t-Statistic Prob.

C 655.1223 1.126888 581.3553 0.0000


COMPUTER -0.003183 0.002106 -1.511647 0.1314

„Prob‟ only equals p-value for a two sided test

99
Try These Hypotheses
(a) H0:B1=0, H1:B1>0, with =.05 using the critical-values approach

(b) H0:B1=0, H1:B1<0, with =.05 using the critical-values approach

(c) H0:B1=0, H1:B1≠0, with =.05 using the critical-values approach

(d) H0:B1=0, H1:B1>0, with =.05 using the p-value approach

(e) H0:B1=0, H1:B1<0, with =.05 using the p-value approach

(f) H0:B1=0, H1:B1≠0, with =.05 using the p-value approach

(g) H0:B1=-.05, H1:B1<-.05, with =.10


100
H0:B1=0, H1:B1>0, with =.05 using the critical-values approach

101
(b) H0:B1=0, H1:B1<0, with =.05 using the critical-values approach

102
(c) H0:B1=0, H1:B1≠0, with =.05 using the critical-values approach

103
(d) H0:B1=0, H1:B1>0, with =.05 using the p-value approach

This is very hard to do with p-values!

104
(e) H0:B1=0, H1:B1<0, with =.05 using the p-value approach

105
(f) H0:B1=0, H1:B1≠0, with =.05 using the p-value approach

106
(g) H0:B1=-0.05, H1:B1<-0.05, with =.10

107
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals

108
With OLS Assumptions the CLT
Gives Us the Distribution of 1
ˆ ~ N( 1 , SE ( 1
2
) )
1

109
Confidence
Intervals

ˆ ~ N ( , SE ( ) 2 )
1 1 1

ˆ 2
~ N (0, SE ( 1 ) )
1 1

ˆ
1 1 ~ N (0, 1)
SE ( 1 )
110
95% Confidence Intervals Catch the
True Parameter 95% of the Time
ˆ-
Prob(-1.96 1.96) .95
SE( ˆ )
Prob(-1.96. SE( ˆ ) ˆ - 1.96.SE( ˆ )) .95
Prob(1.96.SE( ˆ ) ˆ 1.96.SE( ˆ )) .95
Prob( ˆ 1.96. SE( ˆ ) ˆ 1.96.SE( ˆ )) .95
So, the probabilit y will be captured by the random
interval ˆ 1.96. SE( ˆ ) is 0.95.
http://bcs.whfreeman.com/bps4e/content/cat_010/applets/confidenceinterval.html
111
Confidence Intervals are
Reasonable Ranges
If we cannot reject H 0 : in favour of H1 : at, say 5%
ˆ-
It implies 1.96
SE( ˆ )
ˆ- ˆ
1.96 1.96 1.96 1.96
SE( )ˆ ˆ
SE( )
ˆ 1.96 SE( ˆ ) ˆ 1.96 SE( ˆ )
But this just says must lie in a 95% CI.
Going the other way, we can define a 1 - Confidence Interval
as a range of values that could not be rejected as nulls in a two - sided
test of significan ce with test size .
112
Confidence interval example: Test Scores and STR
Estimated regression line: TestScore = 698.9 – 2.28 STR

SE( ˆ0 ) = 10.4 SE( ˆ1 ) = 0.52

95% confidence interval for ˆ1 :

{ ˆ1 1.96 SE( ˆ1 )} = {–2.28 1.96 0.52}


= (–3.30, –1.26)

113
If You Make 1→1 Associations, Use
Simple Regression, not Correlation
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals

But be careful of the Simple Regression assumption


Cov(Xt, ut)=0

114
Chapter 6

Introduction to
Multiple Regression

115
Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator

116
It‟s all about u
(SW Section 6.1)

The error u arises because of factors that influence Y but are not
included in the regression function; so, there are always omitted
variables.

Sometimes, the omission of those variables can lead to bias in the


OLS estimator. This occurs because the assumption
4. Cov(Xt, ut)=0
Is violated

117
Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator

118
Omitted variable bias=OVB
The bias in the OLS estimator that occurs as a result of an
omitted factor is called omitted variable bias.
Let y= 0+ 1x+u and let u=f(Z)
omitted variable bias is a problem if the omitted factor “Z” is:

1. A determinant of Y (i.e. Z is part of u); and

2. Correlated with the regressor X (i.e. corr(Z,X) 0)

Both conditions must hold for the omission of Z to result in


omitted variable bias.
119
What Causes Long Life?
Gapminder (http://www.gapminder.org/world) is an online
applet that contains demographic information about each
country in the world.
Suppose that we are interested in predicting life
expectancy, and think that both income per capita and the
number of physicians per 1000 people would make good
indicators.
Our first step would be to graph these predictors against
life expectancy
We find that both are positively correlated with life expectancy

120
…Doctors or Income or Both?
Simple Linear Regression only allows us to use
one of these predictors to estimate life expectancy.
But income per capita is correlated with the
number of physicians per 1000 people. Suppose
the truth is:
Life=B0+B1Income+B2Doctors+u but you run

Life=B0+B1Income+u* (u*=B2Doctors+u)

121
OVB=„Double Counting‟
B1 is the impact of Income on Life, holding
everything else constant including the residual
But if correlation exists between the Doctors (in
the residual) and income (rIncDoct≠0 ), and, if the
true impact of Doctors (B2≠0) is non-zero, then B1
counts both effects – it „double counts‟

Life=B0+B1Income+u* (u*=B2Doctors+u)

122
Our Test score Reg has OVB
In the test score example:
1. English language deficiency (whether the student is learning
English) plausibly affects standardized test scores: Z is a
determinant of Y.
2. Immigrant communities tend to be less affluent and thus
have smaller school budgets – and higher STR: Z is
correlated with X.

Accordingly, ˆ1 is biased.

123
What is the bias? We have a formula
STR is larger for those classes with a higher PctEL (both being a
feature of poorer areas), the correlation between STR and PctEL
will be positive
PctEL appears in u with a negative sign in front of it – higher
PctEL leads to lower scores. Therefore the correlation between
STR and u[ minus PctEL], must be negative (ρXu < 0).
Here is the formula. (Standard deviations are always positive)
Bias ˆ
X X

u the volatility of the error and the


Xu
X included variable matter
0
So the coefficient of student-teacher ratio is negatively biased by
the exclusion of the percentage of English learners. It is „too big‟
in absolute value.
124
Including PctEL Solves Problem
Some ways to overcome omitted variable bias
1. Run a randomized controlled experiment in which treatment
(STR) is randomly assigned: then PctEL is still a determinant
of TestScore, but PctEL is uncorrelated with STR. (But this is
unrealistic in practice.)
2. Adopt the “cross tabulation” approach, with finer gradations
of STR and PctEL – within each group, all classes have the
same PctEL, so we control for PctEL (But soon we will run
out of data, and what about other determinants like family
income and parental education?)
3. Use a regression in which the omitted variable (PctEL) is no
longer omitted: include PctEL as an additional regressor in a
multiple regression.
125
Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator

126
The Population Multiple Regression
Model (SW Section 6.2)
Consider the case of two regressors:
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n

Y is the dependent variable


X1, X2 are the two independent variables (regressors)
(Yi, X1i, X2i) denote the ith observation on Y, X1, and X2.
0 = unknown population intercept
1 = effect on Y of a change in X1, holding X2 constant
2 = effect on Y of a change in X2, holding X1 constant
ui = the regression error (omitted factors)
127
Partial Derivatives in Multiple
Regression = Cet. Par. in Economics
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n

We can use calculus to interpret the coefficients:

Y
1 = X1 , holding X2 constant= Ceteris Paribus

Y
2 = X2 , holding X1 constant = Ceteris Paribus

0 = predicted value of Y when X1 = X2 = 0.

128
The OLS Estimator in Multiple
Regression (SW Section 6.3)
With two regressors, the OLS estimator solves:

n
min b0 ,b1 ,b2 [Yi (b0 b1 X 1i b2 X 2i )]2
i 1

The OLS estimator minimizes the average squared difference


between the actual values of Yi and the prediction (predicted
value) based on the estimated line.
This minimization problem is solved using calculus
This yields the OLS estimators of 0 , 1 and 2.

129
Multiple regression in EViews
Dependent Variable: TESTSCR
Method: Least Squares
Sample: 1 420
Included observations: 420
White Heteroskedasticity-Consistent Standard Errors & Covariance

TESTSCR=C(1)+C(2)*STR+C(3)*EL_PCT

Coefficient Std. Error t-Statistic Prob.


C(1) 686.0322 8.728224 78.59930 0.0000
C(2) -1.101296 0.432847 -2.544307 0.0113
C(3) -0.649777 0.031032 -20.93909 0.0000

R-squared 0.426431 Mean dependent var 654.1565


Adjusted R-squared 0.423680 S.D. dependent var 19.05335
S.E. of regression 14.46448 Akaike info criterion 8.188387
Sum squared resid 87245.29 Schwarz criterion 8.217246
Log likelihood -1716.561 Durbin-Watson stat 0.685575

TestScore = 686.0 – 1.10 STR – 0.65 PctEL

More on this printout later… 130


Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator

131
Measures of Fit for Multiple
Regression (SW Section 6.4)
R2 now becomes the square of the correlation coefficient
between y and predicted y.

It is still the proportional reduction in the residual sum of


squares as we move from modeling y with just a sample
mean, to modeling it with a group of variables.

132
2 2
R and R
The R2 is the fraction of the variance explained – same definition
as in regression with a single regressor:

2 ESS SSR
R = =1 ,
TSS TSS

n n n
where ESS = (Yˆi Yˆ ) , SSR =
2
uˆ , TSS =
2
i (Yi Y ) 2 .
i 1 i 1 i 1

The R2 always increases when you add another regressor


(why?) – a bit of a problem for a measure of “fit”

133
2 2
R and R
The R 2 (the “adjusted R2”) corrects this problem by “penalizing”
you for including another regressor – the R 2 does not necessarily
increase when you add another regressor.

Adjusted R2:
2 n 1 SSR n 1 SSR 2
R =1 = 1 (1-R )
n k 1 TSS n k 1 TSS

Note that R 2 < R2, however if n is large the two will be very
close.
134
Measures of fit, ctd.
Test score example:

(1) TestScore = 698.9 – 2.28STR,


STR
R2 = .05, SER = 18.6

(2) TestScore = 686.0 – 1.10STR – 0.65PctEL,


R2 = .426, R 2 = .424, SER = 14.5

What – precisely – does this tell you about the fit of regression
(2) compared with regression (1)?
Why are the R2 and the R 2 so close in (2)?
135
Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator

136
Sampling Distribution Depends on
Least Squares Assumptions (SW Section 6.5)
yi=B0+B1x1i+B2x2i+……Bkxki+ui
• E(ut)=0
1. E(ut2)=σ2 =SER2 (note: not σ2t – invariant)
2. E(utus)=0 t≠s
3. Cov(Xt, ut)=0
4. ut~Normal plus
5. There is no perfect multicollinearity

137
Assumption #4: There is no perfect multicollinearity
Perfect multicollinearity is when one of the regressors is an
exact linear function of the other regressors.

Example: Suppose you accidentally include STR twice:

138
Perfect multicollinearity is when one of the regressors is an
exact linear function of the other regressors.
In such a regression (where STR is included twice), 1 is the
effect on TestScore of a unit change in STR, holding STR
constant (???)
The Standard Errors become Infinite when perfect
multicollinearity exists

139
OLS Wonder Equation
Suˆ 1
SE (bi )
S xi n(1 R 2
xi on X )
• Multicollinearity increases R2 and therefore
increases variance of bi
Perfect multicollinearity (R2=1) makes
regression impossible
You expect a low standard error the more
variables you add to a regression. The more
you add, the higher the R-squared on the
denominator becomes, because it always
rises with extra variables. 140
Quality of Slope Estimate (R2
and Suˆ fixed)
High SE(bi) Low SE(bi) Low SE(bi)

xi xi xi

Sxi2 = Sxi2 < Sxi2


n=6 n=20 n=6

141
The Sampling Distribution of the
OLS Estimator (SW Section 6.6)
Under the Least Squares Assumptions,
The exact (finite sample) distribution of ˆ1 has mean 1,
var( ˆ1 ) is inversely proportional to n; so too for ˆ2 .
Other than its mean and variance, the exact (finite-n)
distribution of ˆ1 is very complicated; but for large n…
p
ˆ is consistent: ˆ 1 (law of large numbers)
1 1

ˆ
1 1

SE ( ˆ ) is approximately distributed N(0,1) (CLT)


1

So too for ˆ2 ,…, ˆk


Conceptually, there is nothing new here!
142
Multicollinearity, Perfect and
Imperfect (SW Section 6.7)
Some more examples of perfect multicollinearity
The example from earlier: you include STR twice.
Second example: regress TestScore on a constant, D, and
Bel, where: Di = 1 if STR ≤ 20, = 0 otherwise; Beli = 1 if STR
>20,
= 0 otherwise, so Beli = 1 – Di and there is perfect
multicollinearity because Bel+D=1 (the 1 „variable‟ for the
constant)
To fix this, drop the constant

143
Perfect multicollinearity, ctd.
Perfect multicollinearity usually reflects a mistake in the
definitions of the regressors, or an oddity in the data
If you have perfect multicollinearity, your statistical software
will let you know – either by crashing or giving an error
message or by “dropping” one of the variables arbitrarily
The solution to perfect multicollinearity is to modify your list
of regressors so that you no longer have perfect
multicollinearity.

144
Imperfect multicollinearity
Imperfect and perfect multicollinearity are quite different despite
the similarity of the names.

Imperfect multicollinearity occurs when two or more regressors


are very highly correlated.
Why this term? If two regressors are very highly
correlated, then their scatterplot will pretty much look like a
straight line – they are collinear – but unless the correlation
is exactly 1, that collinearity is imperfect.

145
Imperfect multicollinearity, ctd.
Imperfect multicollinearity implies that one or more of the
regression coefficients will be imprecisely estimated.
Intuition: the coefficient on X1 is the effect of X1 holding X2
constant; but if X1 and X2 are highly correlated, there is very
little variation in X1 once X2 is held constant – so the data are
pretty much uninformative about what happens when X1
changes but X2 doesn‟t, so the variance of the OLS estimator
of the coefficient on X1 will be large.
Imperfect multicollinearity (correctly) results in large
standard errors for one or more of the OLS coefficients as
described by the OLS wonder equation
Next topic: hypothesis tests and confidence intervals…

146
Portion of X that “explains” Y

High R2
Y
For any two circles,
the overlap tells the
size of the R2

147
Portion of X that “explains” Y

LowR2
Y
For any two circles,
the overlap tells the
size of the R2

148
Adding Another X Increases R2

X2

X1

Now the R is the overlap of both X1


and X2 with Y
149
Imperfect (but high)
multicollinearity
Y Since X2 and X1 share a lot of
the same information, adding
X2 allows us to work out
independent effects better, but
we realize we don‟t have
X2
much information (area) to do
X1
this with. Larger n makes all
circles bigger and, as before,
the overlap tells the size of R2
Suˆ 1
SE (b1 )
S x1 n(1 R 2
x1 on x2 )
150
Chapter 7:
Multiple Regression:
Multiple Coefficient Testing

151
Multiple Coefficients Tests?

yt = 0 + 1x1t + 2x2t +.. kxkt + et

We know how to obtain estimates of the


coefficients, and each one is a ceteris paribus („all
other things equal‟) effect
Why would we want to do hypotheses tests about
groups of coefficients?

152
Multiple Coefficients Tests

yt = 0 + 1x1t + 2x2t +.. kxkt + et

Example 1: Consider the statement that „this whole


model is worthless‟.
One way of making that statement mathematically
formal is to say
1 = 2=….= k=0
because if this is true then none of the variables x1, x2.
. xk helps explain y
153
Multiple Coefficients Tests

yt = 0 + 1x1t + 2x2t +.. kxkt + et

Example 2: Suppose y is the share of the population that votes


for the ruling party and x1 and x2 is the spending on TV and
radio advertising.
The Prime Minister might want to know if TV is more effective
than radio, as measured by the impact on the share of the
popular vote for of an extra dollar spent on each. The way to
write this mathematically is

1 > 2

154
Multiple Coefficients Tests

yt = 0 + 1x1t + 2x2t +.. kxkt + et

Example 3: Suppose y is the growth in GDP, x1 and x2 are the


cash rate one and two quarters ago, and that all the other X‟s are
different macroeconomic variables.
Suppose we are interested in testing the effectiveness of
monetary policy. One way of doing this is asking if the cash
rate at any lag has an impact on GDP growth. Mathematically,
this is

1 = 2= 0
155
Multiple Coefficients Tests

yt = 0 + 1x1t + 2x2t +.. kxkt + et

In each case, we are interested in making


statements about groups of coefficients.
What about just looking at the estimates?
Same problem as in t-testing. You ought to care
about reliability.
What about sequential testing?
errors compound, even if possible (SW Sect. 7.2)
156
Multiple Coefficients Tests

yt = 0 + 1x1t + 2x2t +.. kxkt + et

The so-called F-test can do all of these


restrictions, except for example 2.
Before turning to the F-test, let‟s do example 2,
which can be done with a t-test

157
Example 2 Solution
yt = 0 + 1x1t + 2x2t +.. kxkt + et
If then . Sub this in.
yt = 0 + ( ) x1t + 2 x2t + . + et
= 0 + x1t + 2(x1t + x2t)+ . + et
So, to test 1 > 2 just run a new regression including
x1+x2 instead of x2 (everything else is left the same) and do a
t-test for H0: =0 vs. H1: >0. Naturally, if you accept H1: >0,
this implies 1 2 >0 which implies 1 > 2
This technique is called reparameterization

158
Restricted Regressions

yt = 0 + 1x1t + 2x2t +.. kxkt + et

One more thing before we do the F-test, we


must define a „restricted regression‟. This is
just the model you get when a hypothesis is
assumed true

159
Restricted Regression: Example
1

yt = 0 + 1x1t + 2x2t +.. kxkt + et

Example 1: Consider the statement that „this whole


model is worthless‟.
If 1 = 2=….= k=0
then the model is
yt = 0 + et
and the restricted regression would be an OLS
regression of y on a constant. The estimate for the
constant will just be the sample mean of y.
160
Restricted Regression: Example 3

yt = 0 + 1x1t + 2x2t +.. kxkt + et

If 1 = 2 = 0 then the model is


yt = 0 + 3x3t + . . kxkt + et
and the restricted regression is an OLS
regression of y on a constant and x3 to xk

161
Properties of Restricted
Regressions
Imposing a restriction always increases the residual
sum of squares, since you are forcing the estimates to
take the values implied by the restriction, rather than
letting OLS choose the values of the estimates to
minimize the SSR
If the SSR increases a lot, it implies that the
restriction is relatively „unbelievable‟. That is, the
model fits a lot worse with the restriction imposed.
This last point is the basic intuition of the F-test –
impose the restriction and see if SSR goes up „too
much‟.
http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html

162
The F-test
To test a restriction we need to run the restricted
regression as well as the unrestricted regression (i.e.
the original regression). Let q be the number of
restrictions.
Intuitively, we want to know if the change in SSR is
big enough to suggest the restriction is wrong

SSRr SSRur q
F , where
SSRur n k 1 )
r is restricted and ur is unrestricted 163
The F statistic
The F statistic is always positive, since the SSR
from the restricted model can‟t be less than the
SSR from the unrestricted
Essentially the F statistic is measuring the relative
increase in SSR when moving from the unrestricted
to restricted model
q = number of restrictions

164
The F statistic (cont)
To decide if the increase in SSR when we move to
a restricted model is “big enough” to reject the
restrictions, we need to know about the sampling
distribution of our F stat
Not surprisingly, F ~ Fq,n-k-1, where q is referred to
as the numerator degrees of freedom and n – k-1 as
the denominator degrees of freedom

165
The F statistic Reject H0 at
significance level
f(F) if F > c

0 c F
fail to reject reject
166
Equivalently, using p-values
f(F) Reject H0if p-value <

0 c F
fail to reject reject
167
The R 2 form of the F statistic
Because the SSR‟s may be large and unwieldy, an
alternative form of the formula is useful
We use the fact that SSR = TSS(1 – R2) for any
regression, so can substitute in for SSRu and SSRur

2 2
R ur R r q
F 2 , where again
1 R ur n k-1
r is restricted and ur is unrestricted 168
Overall Significance (example 1)

A special case of exclusion restrictions is to test H0: 2 =


3 =…= k=0
R2 =0 for a model with only an intercept
This is because the OLS estimator is just the sample mean,
implying the TSS=SSR
the F statistic is then

2
R k
F 2
1 R n k-1
169
2
R k
Dependent Variable: TESTSCR F 2
Method: Least Squares
Date: 06/05/08 Time: 15:29
Sample: 1 420
1 R n k-1
Included observations: 420

Coefficient Std. Error t-Statistic Prob.

C 675.6082 5.308856 127.2606 0.0000


MEAL_PCT -0.396366 0.027408 -14.46148 0.0000
AVGINC 0.674984 0.083331 8.100035 0.0000
STR -0.560389 0.228612 -2.451272 0.0146
EL_PCT -0.194328 0.031380 -6.192818 0.0000

R-squared 0.805298 Mean dependent var 654.1565


Adjusted R-squared 0.803421 S.D. dependent var 19.05335
S.E. of regression 8.447723 Akaike info criterion 7.117504
Sum squared resid 29616.07 Schwarz criterion 7.165602
Log likelihood -1489.676 Hannan-Quinn criter. 7.136515
F-statistic 429.1152 Durbin-Watson stat 1.545766
Prob(F-statistic) 0.000000

[.8053/4]/[{1-.8053}/(420-5)] = 429
170
General Linear Restrictions
The basic form of the F statistic will work for any
set of linear restrictions
First estimate the unrestricted model and then
estimate the restricted model
In each case, make note of the SSR
Imposing the restrictions can be tricky – will likely
have to redefine variables again

171
F Statistic Summary
Just as with t statistics, p-values can be calculated
by looking up the percentile in the appropriate F
distribution
If only one exclusion is being tested, then F = t2,
and the p-values will be the same
F-tests are done mechanically – you don‟t have to
do the restricted regressions (though you have to
understand how to do them for this course).

172
F-tests are Easy in EVIEWs
To test hypotheses like these in EVIEWs, use the
Wald test. After you run your regression, type
„View, Coefficient tests, Wald‟
Try testing a single restriction (which you can use
a t-test for) and see that t2=F, and, that the p-
values are the same.
Try testing all the coefficients except the intercept
are zero, and compare it with the F-test
automatically calculated in EVIEWs.
SW discusses the shortcomings of F-tests at
length. They crucially depend upon the
assumption of homoskedasticity.

173
Start Big and Go Small
General to Specific Modeling relies upon the fact that
omitted variable bias is a serious problem.
Start with a very big model to avoid OVB
Do t-tests on individual coefficients. Delete the most
insignificant, run the model again, delete the most
insignificant variable, run the model again, and so
on….until every individual coefficient is significant.
Finally, Do an F-test on the original model excluding all
the coefficients required to get to your final model at once.
If the null is accepted, you have verified the model.
Test for Hetero, and correct for it if need be.
174
Chapter 8

Nonlinear Regression
Functions

175
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables

1. Nonlinear regression functions – general comments


2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions

176
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables

1. Nonlinear regression functions – general comments


2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions

177
Nonlinear Regression Population Regression
Functions – General Ideas (SW Section 8.1)
If a relation between Y and X is nonlinear:

The effect on Y of a change in X depends on the value of X –


that is, the marginal effect of X is not constant
A linear regression is mis-specified – the functional form is
wrong
The estimator of the effect on Y of X is biased – it needn‟t
even be right on average.
The solution to this is to estimate a regression function that is
nonlinear in X
178
Nonlinear Functions of a Single
Independent Variable (SW Section 8.2)
We‟ll look at two complementary approaches:

1. Polynomials in X
The population regression function is approximated by a
quadratic, cubic, or higher-degree polynomial

2. Logarithmic transformations
Y and/or X is transformed by taking its logarithm
this gives a “percentages” interpretation that makes sense
in many applications
179
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables

1. Nonlinear regression functions – general comments


2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions

180
2. Polynomials in X
Approximate the population regression function by a polynomial:

2 i +…+
2 r
Yi = 0+ 1X i + X X
r i + ui

This is just the linear multiple regression model – except that


the regressors are powers of X!
Estimation, hypothesis testing, etc. proceeds as in the
multiple regression model using OLS
The coefficients are difficult to interpret, but the regression
function itself is interpretable
181
Example: the TestScore – Income
relation
Incomei = average district income in the ith district
(thousands of dollars per capita)

Quadratic specification:

2
TestScorei = 0+ 1Incomei + 2 (Incomei + ui
)

Cubic specification:

2
TestScorei = 0+ 1Incomei + 2 (Incomei)
3
+ 3 (Incomei + ui
)
182
Estimation of the quadratic
specification in EViews
Dependent Variable: TESTSCR
Method: Least Squares Create a quadratic regressor
Sample: 1 420
Included observations: 420
White Heteroskedasticity-Consistent Standard Errors & Covariance
TESTSCR=C(1)+C(2)*AVGINC + C(3)*AVGINC*AVGINC

Coefficient Std. Error t-Statistic Prob.


C(1) 607.3017 2.901754 209.2878 0.0000
C(2) 3.850995 0.268094 14.36434 0.0000
C(3) -0.042308 0.004780 -8.850509 0.0000

R-squared 0.556173 Mean dependent var 654.1565


Adjusted R-squared 0.554045 S.D. dependent var 19.05335
S.E. of regression 12.72381 Akaike info criterion 7.931944
Sum squared resid 67510.32 Schwarz criterion 7.960803
Log likelihood -1662.708 Durbin-Watson stat 0.951439

Test the null hypothesis of linearity against the alternative that


the regression function is a quadratic….
183
Interpreting the estimated
regression function:
(a) Plot the predicted values
TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2
(2.9) (0.27) (0.0048)

184
Interpreting the estimated
regression function, ctd:
(b) Compute “effects” for different values of X

TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2


(2.9) (0.27) (0.0048)

Predicted change in TestScore for a change in income from


$5,000 per capita to $6,000 per capita:

TestScore = 607.3 + 3.85 6 – 0.0423 62


– (607.3 + 3.85 5 – 0.0423 52)
= 3.4
185
TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2

Predicted “effects” for different values of X:

Change in Income ($1000 per capita) TestScore


from 5 to 6 3.4
from 25 to 26 1.7
from 45 to 46 0.0

The “effect” of a change in income is greater at low than high


income levels (perhaps, a declining marginal benefit of an
increase in school budgets?)
Caution! What is the effect of a change from 65 to 66?
Don’t extrapolate outside the range of the data!

186
TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2

Predicted “effects” for different values of X:

Change in Income ($1000 per capita) TestScore


from 5 to 6 3.4
from 25 to 26 1.7
from 45 to 46 0.0

Alternatively, dTestscore/dIncome = 3.85-.0846 (Income)


gives the same numbers (approx)

187
Summary: polynomial regression
functions
Yi = 0+ 1Xi + 2 X i2 +…+ X r
r i + ui

Estimation: by OLS after defining new regressors


Coefficients have complicated interpretations
To interpret the estimated regression function:
plot predicted values as a function of x
compute predicted Y/ X at different values of x
Hypotheses concerning degree r can be tested by t- and F-
tests on the appropriate (blocks of) variable(s).
Choice of degree r
plot the data; t- and F-tests, check sensitivity of estimated
effects; judgment.
188
A Final Warning: Polynomials
Can Fit Too Well
When fitting a polynomial regression function, we
need to be careful not to fit too many terms, despite
the fact that a higher order polynomial will always
fit better.
If we do fit too many terms, then any prediction
may become unrealistic.
The following applet lets us explore fitting
different polynomials to some data.
http://www.scottsarra.org/math/courses/na/nc/polyRegression.html

189
3. Are Polynomials Enough?
We can investigate the appropriateness of a
regression function by graphing the regression
function over the top of the scatterplot.
For some models, we may need to transform the
data
For example, take logs of the response variable
The site below allows us to do this, exploring some
common regression functions
http://www.ruf.rice.edu/%7Elane/stat_sim/transformations/index.html

190
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables

1. Nonlinear regression functions – general comments


2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions

191
3. Logarithmic functions of Y and/or X
ln(X) = the natural logarithm of X
Logarithmic transforms permit modeling relations in
“percentage” terms (like elasticities), rather than linearly.

Here’s why: ln( x) 1


x x
ln( x) 1 x
ln( x) proportion al change in x
x x x

Numerically:
ln(1.01)-ln(1) = .00995-0 =.00995 (correct % .01);
ln(40)-ln(45) = 3.6889-3.8067=-.1178 (correct % = -.1111) 192
Three log regression specifications:

Case Population regression function


I. linear-log Yi = 0 + 1ln(Xi) + ui
II. log-linear ln(Yi) = 0 + 1Xi + ui
III. log-log ln(Yi) = 0 + 1ln(Xi) + ui

The interpretation of the slope coefficient differs in each case.


The interpretation is found by applying the general “before
and after” rule: “figure out the change in Y for a given change
in X.”
193
Summary: Logarithmic
transformations
Three cases, differing in whether Y and/or X is transformed
by taking logarithms.
The regression is linear in the new variable(s) ln(Y) and/or
ln(X), and the coefficients can be estimated by OLS.
Hypothesis tests and confidence intervals are now
implemented and interpreted “as usual.”
The interpretation of 1 differs from case to case.
Choice of specification should be guided by judgment (which
interpretation makes the most sense in your application?),
tests, and plotting predicted values
194
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables

1. Nonlinear regression functions – general comments


2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions

195
Regression when X is Binary
(Section 5.3)

Sometimes a regressor is binary:


X = 1 if small class size, = 0 if not
X = 1 if female, = 0 if male
X = 1 if treated (experimental drug), = 0 if not

Binary regressors are sometimes called “dummy” variables.

So far, 1 has been called a “slope,” but that doesn‟t make sense
if X is binary.

How do we interpret regression with a binary regressor?


196
Interpreting regressions with a
binary regressor
Yi = 0 + 1Xi + ui, where X is binary (Xi = 0 or 1):
When Xi = 0, Yi = 0 + ui
the mean of Yi is 0

that is, E(Yi|Xi=0) = 0

When Xi = 1, Yi = 0 + 1 + ui
the mean of Yi is 0 + 1

that is, E(Yi|Xi=1) = 0 + 1

so:
1 = E(Yi|Xi=1) – E(Yi|Xi=0)
= population difference in group means
197
Interactions Between Independent
Variables (SW Section 8.3)
Perhaps a class size reduction is more effective in some
circumstances than in others…
Perhaps smaller classes help more if there are many English
learners, who need individual attention
TestScore
That is, might depend on PctEL
STR
Y
More generally, might depend on X2
X1
How to model such “interactions” between X1 and X2?
We first consider binary X‟s, then continuous X‟s
198
(a) Interactions between two binary
variables
Yi = 0 + 1D1i + 2D2i + ui

D1i, D2i are binary


1 is the effect of changing D1=0 to D1=1. In this specification,
this effect doesn’t depend on the value of D2.
To allow the effect of changing D1 to depend on D2, include the
“interaction term” D1i D2i as a regressor:

Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui

199
Interpreting the coefficients:
Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui

It can be shown that

Y
D1 = 1 + 3D 2

The effect of D1 depends on D2 (what we wanted)


3 = increment to the effect of D1 from a unit change in D2

200
Example: TestScore, STR, English
learners
Let
1 if STR 20 1 if PctEL l0
HiSTR = and HiEL =
0 if STR 20 0 if PctEL 10

TestScore = 664.1 – 18.2HiEL – 1.9HiSTR – 3.5(HiSTR HiEL)


(1.4) (2.3) (1.9) (3.1)

“Effect” of HiSTR when HiEL = 0 is –1.9


“Effect” of HiSTR when HiEL = 1 is –1.9 – 3.5 = –5.4
Class size reduction is estimated to have a bigger effect when
the percent of English learners is large
This interaction isn‟t statistically significant: t = 3.5/3.1
201
(b) Interactions between continuous
and binary variables
Yi = 0 + 1Xi + 2Di + ui

Di is binary, X is continuous
As specified above, the effect on Y of X (holding constant D) =
1, which does not depend on D
To allow the effect of X to depend on D, include the
“interaction term” Di Xi as a regressor:

Yi = 0 + 1Xi + 2Di + 3(Di Xi) + ui

202
Binary-continuous interactions: the
two regression lines
Yi = 0 + 1Xi + 2Di + 3(Di Xi) + ui

Observations with Di= 0 (the “D = 0” group):

Yi = 0 + 1Xi + ui The D=0 regression line

Observations with Di= 1 (the “D = 1” group):

Yi = 0 + 1Xi + 2 + 3Xi + ui
= ( 0+ 2) + ( 1+ 3)Xi + ui The D=1 regression line
203
Binary-continuous interactions, ctd.
D=1 D=0 D=1

D=0
B3=0

All Bi non-zero

D=1
D=0

B2=0

204
Interpreting the coefficients:
Yi = 0 + 1 Xi + 2 Di + 3(Xi Di) + ui

Or, using calculus,

Y
X = 1 + 3D

The effect of X depends on D (what we wanted)


3 = increment to the effect of X1 from a change in the level
of D from D=0 to D=1

205
Example: TestScore, STR, HiEL
(=1 if PctEL  10)
TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)
(11.9) (0.59) (19.5) (0.97)

When HiEL = 0:
TestScore = 682.2 – 0.97STR
When HiEL = 1,
TestScore = 682.2 – 0.97STR + 5.6 – 1.28STR
= 687.8 – 2.25STR
Two regression lines: one for each HiSTR group.
Class size reduction is estimated to have a larger effect when
the percent of English learners is large.
206
Example, ctd: Testing hypotheses
TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)
(11.9) (0.59) (19.5) (0.97)
The two regression lines have the same slope the
coefficient on STR HiEL is zero: t = –1.28/0.97 = –1.32
The two regression lines have the same intercept the
coefficient on HiEL is zero: t = –5.6/19.5 = 0.29
The two regression lines are the same population
coefficient on HiEL = 0 and population coefficient on
STR HiEL = 0: F = 89.94 (p-value < .001) !!
We reject the joint hypothesis but neither individual
hypothesis (how can this be?)

207
Summary: Nonlinear Regression
Functions
Using functions of the independent variables such as ln(X)
or X1 X2, allows recasting a large family of nonlinear
regression functions as multiple regression.
Estimation and inference proceed in the same way as in
the linear multiple regression model.
Interpretation of the coefficients is model-specific, but the
general rule is to compute effects by comparing different
cases (different value of the original X‟s)
Many nonlinear specifications are possible, so you must
use judgment:
What nonlinear effect you want to analyze?
What makes sense in your application?
208
Chapter 9

Misleading Statistics

209
Statistics Means Description and
Inference
Descriptive Statistics is about describing datasets.
Various visual tricks can distort these descriptions
Inferential Statistics is about statistical inference.
You know something about tricks to distort
inference (eg. Putting in lots of variables to raise
R2 or lowering to get in a variable you want).

210
Pitfalls of Analysis
There are several ways that misleading statistics
can occur (which effect both inferential and
descriptive statistics)
Obtaining flawed data
Not understanding the data
Not choosing appropriate displays of data
Fitting an inappropriate model
Drawing incorrect conclusions from analysis.

211
Poor Displays of Data: Chart
Junk

Source: Wainer (1984), How to display data badly


212
Poor Displays of Data: 2D
picture

213
Poor Displays of Data: Axes

Increments of 100,000

A jump in the scale from


800,000 to 1,500,000

214
How to Display Data
• The golden rule for displaying data in a graph is to
keep it simple
• Graphs should not have any chart junk.
– “minimise the ratio of ink to data” - Tufte
• Axes should be chosen so they do not inflate or deflate
the differences between observations
– Where possible, start the Y-axis at 0
– If this is not possible then you should consider graphing the
change in the observation from one period to the next
• Some general tips on how to properly display data can
be found at
http://lilt.ilstu.edu/gmklass/pos138/datadisplay/sections/goodcharts.htm

215
How to Display Data

216
Incorrect Conclusions: Causality
Correlation: 0.848
Excess money supply (%) Increase in prices two years
later (%)
1965 4.7 1967 2.5
1966 1.9 1968 4.7
1967 7.8 1969 5.4
1968 4.0 1970 6.4
1969 1.3 1971 9.4
1970 7.8 1972 7.1
1971 11.4 1973 9.2
1972 23.4 1974 16.1
1973 22.2 1975 24.2

Source: Grenville and Macfarlane (1988) 217


Accompanying Letter
Sir,
Professor Lord Kaldor today (March 31) states that
“there is no historical evidence whatever” that the money
supply determines the future movement of prices with
a time lag of two years. May I refer Professor Kaldor to
your article in The Times of July 13, 1976.

Data

If one calculates the correlation between these two sets


of figures the coefficient r=0.848 and since there are seven
degrees of freedom the P value is less than 0.01. If Mr
Rees-Mogg‟s figures are correct, this would appear to a
biologist to be a highly significant correlation, for it means
that the probability of the correlation occurring by chance
is less than one in a hundred. Most betting men would
think that those were impressive odds.
Until Professor Kaldor can show a fallacy in the figures,
I think Mr Rees-Mogg has fully established his point.
Yours faithfully,
IVOR H. MILLS,
University of Cambridge Clinical School,
Department of Medicine,
218
Response
Sir,
Professor Mills today (April 4) uses correlation
analysis in your columns to attempt to resolve the
theoretical dispute over the cause(s) of inflation. He
cites a correlation coefficient of 0.848 between the rate
of inflation and the rate of change of “excess” money
supply two years before.
We were rather puzzled by this for we have always
believed that it was Scottish Dysentery that kept prices
down (with a one-year lag, of course). To reassure
ourselves, we calculated the correlation between the
following sets of figures:

219
Incorrect Conclusions: Causality
Correlation: -0.868
Cases of Dysentery in Increase in prices one year
Scotland („000) later (%)
1966 4.3 1967 2.5
1967 4.5 1968 4.7
1968 3.7 1969 5.4
1969 5.3 1970 6.4
1970 3.0 1971 9.4
1971 4.1 1972 7.1
1972 3.2 1973 9.2
1973 1.6 1974 16.1
1974 1.5 1975 24.2
Source: Grenville and Macfarlane (1988) 220
A Final Warning
We have to inform you that the correlation coefficient
is -0.868 (which is statistically slightly more significant
than that obtained by Professor Mills). Professor Mills says
that “Until … a fallacy in the figures [can be shown], I
think Mr Rees-Mogg has fully established his point.” By
the same argument, so have we.
Yours faithfully.

G. E. J. LLEWELLYN, R. M. WITCOMB.
Faculty of Economics and Politics,
221

You might also like