You are on page 1of 19

Lecture 1: Causality and Endogeneity

Mark Schankerman
January 10, 2011
Contents
1 Causality in Econometrics 2
1.1 Random Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Non Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Partial Eects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Are Partial Eects Causal? . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Specication of the Econometric Model 8
3 Endogeneity 10
3.1 Omitted Variable Bias (OVB) . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Proxy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Optional: Omitted Variable Eect on Variances . . . . . . . . . . 15
3.1.3 Optional: Inclusion of Irrelevant Regressors . . . . . . . . . . . 16
3.2 Measurement Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Simultaneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1
1 Causality in Econometrics
You have studied how OLS controls for the eects of other variables on . Does this
mean that the eect of some regressor r
j
on when estimated by OLS can be given a
causal interpretation? Before we address this issue we will rst attempt to dene what
we mean by causality.
We start with a denition of causality borrowed from the experimental sciences. An
eect is said to be causal if it is the result of a controlled experiment where everything
else is the same and only the treatment dosage changes among observations. We can
then be sure that the observed eect is a result of the dierent treatments and from
nothing else.
Suppose you want to check if a new drug improves health. We are interested in
the causal eect of the drug since we want to be sure that it is the drug aecting health
outcomes and not other things that change among patients taking the drug. You run
an experiment where you have 100 identical individuals and you give the drug to 50 of
them (the treated group). The eect of the drug is the dierence between the average
health outcome of the 50 individuals in the treated group and the 50 individuals in the
control group (those who did not receive the drug). Let
1
denote the health outcome
when the drug is received and let
0
denote the health outcome when the drug is not
received. Then the eect of the drug is estimated by the dierence in sample means
between the treated and control groups

0
(1)
This dierence estimates the causal eect of the drug because this dierence cannot be
driven by other things since by construction all the individuals are identical. This naive
estimator of the treatment eect is correct because of the experimental design.
1
1.1 Random Assignment
Finding identical individuals is practically impossible so experiments are usually done
in a dierent way. The key issue in an experiment is to assign the treatment (e.g., the
drug) in a random way. This is called a randomized experiment, and we say that
the treatment was randomly assigned.
We now have 100 possibly not identical individuals and we assign the drug ran-
domly to 50 of them. Expression (1) still estimates the causal eect of the drug. Why?
Random assignment ensures that receiving the drug is not systematically correlated
with other variables aecting the health outcome (e.g., age, gender, health history, etc.).
So even if the individuals dier between the two groups, random assignment ensures that
the average characteristics of the treatment and control groups are the same (same
average age, same proportion of males, same average health history, etc.). We say that
the other characteristics are balanced. The only systematic dierence between the
1
In fact, since all individuals are identical, comparing any two individuals should result in the same
eect (up to random noise). There is no dierence between individual and average behavior.
2
treatment and control groups is the treatment itself and nothing else. This is what is
needed to estimate the average causal eect of the treatment. That is, we estimate
an average treatment eect and this is all that we can aspire to when individuals are
not identical.
To be precise, we will never be able to estimate the causal eect for a particular
individual because a particular individual is either treated or not treated and therefore
we observe only one value of for this person (i.e., the value when treated or when not
treated). Having data on before and after the experiment does not solve the problem
unless we can control for all other things occurring between the two time periods that
can aect .
When comparing treated to control individuals we want to hold all other fac-
tors (or characteristics) xed or keep all other things equal. In a randomized
experiment, this is always the case because the experiment is designed so as to ensure
that all other factors are, on average, the same between the treated and control pop-
ulations. The dierence between treated and control average outcomes, equation (1)
thus estimates an average causal eect because the average dierence is the result of the
dierent treatments and from nothing else. It is therefore easy to estimate the causal
eect when treatment assignment is random. Indeed we could dene causality as:
A causal eect can be dened as the eect obtained when the treat-
ment is randomly assigned.
1.2 Non Experimental Data
But if the treatment (e.g., the drug, the amount of education, etc.) is distributed or
assigned in a non-random way, estimator (1) usually does not estimate the causal
eect. When the assignment is not random, the treatment and control groups may dier
in other characteristics (not only in the treatment). If these characteristics aect the
outcome variable then the simple comparison of averages also picks up this eect, in
addition to the causal eect.
Example 1 Eect of an R&D grant on R&D investment. Suppose large rms receive
most of the R&D grants (grants are non randomly assigned) and that large rms also
invest more resources in R&D than other rms (say because the technological size of
the projects are larger). Thus, the dierence between the mean R&D investment of the
rms that received the R&D grant (the treated rms) and those that did not receive the
grant (the control rms) overestimates the causal eect of the grant on R&D investment.
Firm size is a characteristic that aects both the extent of R&D expenditures and the
probability of receiving an R&D grant.
Randomized experiments are unusual in the social sciences. For ethical/moral
reasons we do not randomly select people to attend school for dierent number of years.
Nor do we give away government money randomly. The data available to us is non
experimental and it usually comes from government agencies or private companies.
3
The challenge of empirical work in the social sciences is to uncover causal re-
lationships using non experimental data. This is why it is challenging to do con-
vincing empirical research in the social sciences (yes ... in natural/medical sciences it
is easier). Nevertheless, the idealized experiment is a useful concept for understanding
when an estimated partial eect is causal. Essentially, we will need to argue that even
though the treatment (years of education, grant, etc.) was not randomly allocated to
the subjects we can, under certain conditions, treat it as if it were randomly assigned.
1.3 Correlation
Before we examine how econometrics deals with non-experimental data we clarify the dif-
ference between correlation and causality. Causality is not the same as correlation.
Recall that the correlation between two random variables r
j
and is
j =
Co(r
j
. )
_
\ (r
j
)\ ()
(2)
The correlation is the covariance between r
j
and normalized by the standard
deviations. Simply nding a correlation between r
j
and is not enough to conclude that
a change in r
j
causes a change in . It would be enough if r
j
is randomly assigned. But
correlation can be the result, for example, of a third factor c. That is, r
j
is correlated
with c and c is correlated with . r
j
and might then be correlated, when c is not
accounted for, but r
j
does not necessarily cause .
Example 2 Persons with higher intellectual ability (c) study more years (r
j
) and also
earn higher incomes (). The data will show a positive correlation between number of
years of education and income which does not necessarily reect a causal eect from
education to income. Because education is not randomly assigned, the positive correlation
can be the result of other factors gender, ability, location, etc. and the challenge is to
understand whether it reects, in addition, a causal eect.In this example, the correlation
between and r
j
arises because of common factors aecting both and r
j
that are omitted
from or not accounted for in the analysis. In short, omitted factors can give rise to a
correlation that has nothing to do with causality.
Example 3 (Levitt, 1997) City-level data in the U.S. usually shows a positive correlation
between police and crime The estimated correlation between crime and police ocers is
0.86. This positive correlation could be due to city size but even after controlling for
population size we get a positive correlation between crime per capita and police ocers
per capita of 0.37.
This is an example of another problem we face when trying to uncover causal
relationships. Both variables may be simultaneously determined and the observed
data reects their equilibrium values. We will later analyze in detail a demand and supply
example where price and quantity are simultaneously determined. It is possible that a
larger police force does indeed reduce crime but it is also likely that the level of crime
4
aects the number of police ocers assigned to a city.
2
Thus, we observe a positive
correlation in the data because cities with more crime have larger police forces. In
short, simultaneity among the dependent and independent variables is another source
of correlation among these variables that is not related to causality. Levitt (American
Economic Review, 1997) shows that accounting for this simultaneity implies that the
causal eect of police on crime is negative despite the positive correlation. Here also
correlation does not equal causality.
As these examples show we should never infer causality from correlation. Anything
is possible. We mentioned before that we are interested in the causal connection between
two variables because policy changes should be based on knowledge of causal relation-
ships and not merely correlations. In addition, economic theories (implicitly) talk about
causal relationships, not correlations, and therefore knowledge of the existence (and/or
strength) of a causal relationship can serve to test economic theories. Also, in many
instances economic theory may be ambiguous as to the eect of a policy change. For
example, do higher taxes increase tax revenue? Do higher R&D grants increase company-
nanced R&D investments?. If we estimate the causal eect of a policy change we can
evaluate the eectiveness of such policy change and make informed recommendations.
For this it is crucial to know whether r
j
causes and not merely whether r
j
and
are correlated. Unless causality can be established, the estimated correlation has little
interest for economists.
If an estimated relationship can be given a causal interpretation then we can use the
estimated causal eects to answer what if questions (what happens to R&D invest-
ment if R&D subsidies are increased, all other factors remaining unchanged?) which is
crucial for recommending and evaluating policies. One goal of the course is to understand
under what conditions it is possible to estimate causal eects with non-experimental data,
and how to do it.
1.4 Partial Eects
In economics we are interested in the change in the mean of due to a change in r
j
,
holding all other relevant factors constant (partial eect of r
j
on 1([r))
J1([r)
Jr
j
(3)
for continuous r
j
.
For example, in the model 1([r
1
. r
2
) = ,
1
r
1
+,
2
r
2
+,
3
r
2
2
the partial eect of r
2
on is
@E(yjx)
@x
2
= ,
2
+ 2,
3
r
2
. One thing that people do not often realize is that, in this
case, we are not that much interested in ,
2
and ,
3
per se because, by themselves, they
do not tell us much. Our interest is in ,
2
+ 2,
3
r
2
and in tracing how this partial eect
varies with r
2
. If, for example, r
2
measures size then we would like to know whether
the eect of r
2
on diers with size.
When r
j
is discrete, partial eects are computed by comparing 1([r) at dierent
settings of r
j
. holding all other variables xed. For example, if r
j
is a 0/1 binary (dummy)
2
It would be silly to treat police force as randomly assigned across cities.
5
variable then its partial eect is the change in 1([r = r
0
) when only r
j
changes, say,
from 0 to 1,
1([r
1
= r
10
. . . . . r
j
= 1. . . . . r
k
= r
k0
) 1([r
1
= r
10
. . . . . r
j
= 0. . . . . r
k
= r
k0
)
Example 4 Suppose that 1([r
1
. r
2
) = ,
1
r
1
+ ,
2
r
2
. and that r
2
is a dummy variable.
Then the partial eect of r
2
is
1([r
1
. r
2
= 1) 1([r
1
. r
2
= 0) = ,
2
In this simple example the partial eect of r
2
does not depend on r
1
.
Example 5 If the model now is 1([r
1
. r
2
) = ,
1
r
1
+ ,
2
r
2
+ ,
3
r
1
r
2
. the partial eect
of r
2
is
1([r
1
. r
2
= 1) 1([r
1
. r
2
= 0) = ,
2
+ ,
3
r
1
which depends on r
1
. The partial eect of r
1
is
@E(yjx
1
;x
2
=1)
@x1
= ,
1
+,
3
or
@E(yjx
1
;x
2
=0)
@x1
= ,
1
.
depending on the choice of r
2
.
Example 6 Suppose r is just the dummy variable 1 for receiving a drug treatment and
we are interested in 1([1). We already know that the CEF is linear
1([1) = ,
1
+ ,
2
1
and the partial eect of 1 on is ,
2
1([1 = 1) 1([1 = 0) = ,
2
which is the population version of the dierence in means between the treated and control
groups, equation (1).
This example shows that ,
2
can be equivalently estimated by a simple OLS re-
gression with a single (dummy) regressor (and intercept term). This would also be true
if we were interested in 1([1. c) and assume that 1([1. c) = ,
1
+ ,
2
1 + c`
Note: Sometimes the dependent variable is in (natural) logs, e.g., 1(ln [r) =
,
1
+ ,
2
ln r
2
+ ,
3
r
3
and the partial eect of interest is
@E(ln yjx)
@ ln x
2
= ,
2
which is the
elasticity of with respect to r
2
.
1.5 Are Partial Eects Causal?
We can always estimate the partial eect
@E(yjx)
@x
j
by OLS. But can the (estimated) partial
eect always be interpreted as the causal eect of r
j
on ? The partial eect of r
j
on
can be given a causal interpretation if we control for all other things aecting both
and r
j
. This is what the other r
0
: in the model are supposed to do.
3
The dicult part
3
We will later see that sometimes even this may not be enough (because of errors in variables or
simultaneity).
6
in applied econometrics is to know what the other things that need to be controlled are,
and how to measure them! This is what we mean by specication of the econometric
model: the choice of regressors (and functional form) that will allow us to interpret the
estimated partial eects as causal eects.
Economic models usually focus on the relationship of interest, say between and
r
1
but abstract from many other variables (call them r
2
) that also aect and may
be correlated with r
1
. What is wrong with estimating 1([r
1
) instead of 1([r
1
. r
2
)?
Nothing is wrong with this .... it is just not interesting because the estimated eects will
likely not be causal unless we can argue that r
1
is not correlated with r
2
.
4
This is not
always easy to do.
For a causal interpretation we need to have a situation where the only systematic
dierence between those individuals treated (by r
j
) and the non-treated is the treat-
ment itself, and not other variables that also aect outcomes. This is guaranteed to
happen in randomized experiments when r
j
is randomly assigned to the observations.
Recall that a causal eect can be dened as the eect obtained when the treatment is
randomly assigned. Random assignment guarantees that the treatment (e.g., schooling)
is uncorrelated with anything else. With non-experimental data, however, we need to
make sure that there are no other average dierences between the subjects. We do this
by controlling for other factors. With non experimental data we interpret a partial eect
as being causal when we can be sure that the treatment (regressor) was assigned as if
in a randomized experiment.
Let us go over the thought process involved in specifying an econometric model
for the eect of R&D subsidies on R&D expenditures.
Example 1: R&D grants and investment in R&D
We have non experimental data on = company-nanced R&D investment 1 =
dummy variable for receiving an R&D grant. Let
1([1) = ,
1
+ ,
2
1
be the relationship between R&D investment and R&D grants. Is it a causal relationship?
The partial eect of a change in 1 from 1 = 0 to 1 = 1 is
,
2
= 1([1 = 1) 1([1 = 0) (4)
which can be estimated by the dierence in sample means between treated and control
groups, equation (1) or by an OLS regression of on 1 (and constant term).
For this partial eect to be causal, we need to believe that the R&D grants received
are unrelated to other characteristics of the rm that aect R&D investment such as rm
size, industry, whether the rm exports or not, etc. If this is true, then ,
2
represents a
causal eect because the other characteristics aecting are on average the same between
4
If x
1
is correlated with x
2
but x
2
does not aect y then this does not aect the causal interpretation
(x
2
is said to be an irrelevant variable).
7
the rms receiving the grant and those that did not receive the grant. It is as if 1 was
assigned randomly to rms.
But if the R&D grant amount received is correlated with rm size, or with industry
aliation or with any other rm characteristic that also aects the dependent variable,
then ,
2
or its estimate will not be the causal eect of the R&D grant since they will also
reect the eect of other characteristics correlated with 1 that are also aecting R&D
investment.
Suppose that larger rms (in terms of sales or employment) are more likely to
receive R&D grants and that larger rms also do more R&D. In this case, an estimate
of (4) will result in a (upward) bias estimate of the causal eect of the R&D grant on
R&D investment.
Suppose now that instead of focusing on 1([1) we would focus on an expectation
that conditions also on other rm characteristics:
1([1. r
2
. . . . . r
k
)
where r
2
. . . . . r
k
are controls (rm size, industry, export status, etc.)
1([1. r
2
. . . . . r
k
) is a multivariate regression function. Note that even if we
are interested in the relationship between and 1 only, we want to control for other
variables that aect both 1 and in order to be able to argue that we are estimating
the causal eect of 1 on . To be clear, even after controlling for r
2
. . . . . r
k
(which is
a nite list of factors) we need to assume that 1 is not systematically related to other
unobserved factors (i.e., not among r
2
. . . . . r
k
) aecting . for the partial eect to be
causal. We can then run a regression of on 1. r
2
. . . . . r
k
and interpret the estimated
partial eect as causal.
5
The general message here is that, when the regressor of interest is a choice made
by an economic agent (e.g., years of education, number of patents, investment in R&D,
labor force participation), we need to control for those factors aecting the agents choice
that also aect the dependent variable. Otherwise we will not be able to infer causality
from the choice variable to the outcome variable. The problem is that many of these
factors are unobserved. Many methodological advances are motivated by the need to
deal with unobserved variables in econometric models.
2 Specication of the Econometric Model
The example in the previous section makes clear that our ability to interpret the partial
eect as a causal eect depends on the specication of the CEF, i.e., on what other
variables we are able to control for. This is the tricky part of doing econometrics.
How many or which r
0
: are enough? What does enough mean? To answer these
5
Provided the functional form of E(y[D; x
2
; : : : ; x
k
) is specied correctly. There is a recent large
literature on how to estimate E(y[D = 1; x
2
; : : : ; x
k
) E(y[D = 0; x
2
; : : : ; x
k
) without assuming a
parametric functional form for the conditional expectations. Much of this is accomplished by matching
a treated unit (with D = 1) to one or more control units (with D = 0) having the same values of
x
2
; : : : ; x
k
:
8
questions is to address a core issue in applied econometric work the specication of
the econometric model.
Let us analyze this issue deeper using the standard wage model as an example.
For simplicity, let us abstract from demographic characteristics (gender, origin, location,
etc.) by assuming that our population is homogeneous in this sense.
6
Let
= log ncqc:
r
1
= :. exp. exp
2
r
2
= c
where : are years of schooling, exp are years of on-the-job experience, and c is ability.
We assume linear C11
0
:. We have two specications of the model with and
without ability ,
1([r
1
. r
2
) = 1([:. exp. c) = ,
1
+ ,
2
: + ,
3
exp +,
4
exp
2
+,
5
c (5)
1([r
1
) = 1([:. exp) = :
1
+ :
2
: + :
3
exp +:
4
exp
2
(6)
Note that we have used dierent notation for the parameters since the CEF is
determined by a dierent joint PDF in each equation. Both specications can be written
in error form with an error that is mean independent of the regressors by construction.
= ,
1
+ ,
2
: + ,
3
exp +,
4
exp
2
+,
5
c + n 1(n[:. exp. c) = 0 (7)
= :
1
+ :
2
: + :
3
exp +:
4
exp
2
+ 1([:. exp) = 0 (8)
Because of the error assumption, in both cases, regressing against its own regres-
sors gives OLS estimators which are unbiased and consistent estimators of the parameters
of that model. The real question we face is this :
1s , or : of interest?
Interesting here means that the partial eect can be given a causal inter-
pretation. Can we assert that controlling for experience only makes the allocation of
schooling years random? The short model does not control for natural ability while
the long one does. Is it more believable to interpret ,
2
. rather than :
2
. as a causal
eect because the long regression controls for ability, which is deemed very important
both in the determination of years of education and wage?.
We know there are a lot of other factors that aect education and also aect
wages (e.g., family background, type of education received, occupation, etc.) so that,
6
In reality, these characteristics, which certainly aect wages and education, are usually controlled
for through demographic variables.
9
in principle, we can start with a richer model and then argue that ,
2
above is also
not causal. This is precisely the main issue that we have to tackle when examining an
empirical study. We have to believe that we have controlled for enough factors so
as to interpret the partial eects as causal eects. This is what is meant by a correct
specication of the model.
The problem of equation (8) is that the economic model says that the conditional
expectation of interest is 1(ln n[:. exp. c) but in fact we estimate 1(ln n[:. exp). In
general, the parameters in 1(ln n[:. exp) dier from the parameters in 1(ln n[:. exp. c)
so it should not be surprising that regressing on (1. :. exp. exp
2
) does not estimate ,.
the parameter of interest.
The moral of this example is that once we specify the CEF, the error will be mean
independent of the regressors and therefore we will estimate the parameters of the CEF
correctly. The question is whether those parameters (partial eects) are interesting. And
this depends on what you put in the CEF. In other words, when regressing on r we
always estimate some partial eects, but are these partial eects causal eect? That is,
are we estimating , or :?
It should be clear that we do not need data on all variables aecting this would
be impossible but we require that the r
0
: being used in the regression were determined
in a way which is uncorrelated with other unobserved factors aecting the dependent
variable (and implicitly embedded in the error term). This depends on which r
0
: we
include in the regression. In other words, the interpretation of the partial eects as causal
eects relies on how well the model is specied, i.e., on the choice of r
0
: in 1([r).
3 Endogeneity
Consider the following model in the population
= ,
1
+ r
2
,
2
+ r
k
,
k
+ n = r, + n (9)
Writing an equation like (9) is meaningless unless we make some assumptions regarding
the relationship between n and the regressors. In the classical regression model the rst
assumption is concerned with this issue, namely
1(n[r) = 0 (10)
This says that the error term n is mean-independent of r. (10) is essential for proving
unbiasedness of the OLS estimator. If 1(n[r) = 0 iterated expectations imply that
for any regressor r
j
we have 1(n[r
j
) = 0, , = 1. . . . . /. We call the / assumptions
1(n[r
j
) = 0 the identifying assumptions (or orthogonality conditions) because without
them, OLS does not estimate , in an unbiased fashion. If they hold, OLS is unbiased.
Consistency of C1o depends on a weaker independence assumption between n
and the r
0
j
:. namely, 1(r
j
n) = 0. Because 1(n) = 0. condition 1(r
j
n) = 0 implies
Co(r
j
. n) = 0. i.e., r
j
and n are uncorrelated..
In econometrics, an explanatory variable r
j
is said to be endogenous in equation
(9) if
1(r
j
n) ,= 0
10
That is, a regressor is endogenous if it is correlated with the error n. If r
j
is
uncorrelated with n. we say that r
j
is exogenous in equation (9).
If a regressor is endogenous then /. the OLS estimator of ,. is not consistent. It is
important to understand that endogeneity of a single regressor usually makes the OLS
estimator of all / parameters inconsistent. The extent to which this occurs depends
on the correlation between the endogenous variable and the other regressors. Suppose
the last regressor, r
k
. is endogenous, 1(r
k
n) ,= 0 but that the others are exogenous.
Examining the OLS consistency proof we arrive at
j lim/ = , +
_
j lim
A
0
A
:
_
1
j lim
1
:
A
0
l
= , + [1(r
0
r)]
1
1(r
0
n)
= , + [1(r
0
r)]
1

_
0
.
.
.
1(r
k
n)
_

_
so unless the rst / 1 elements in the /
th
column of [1(r
0
r)]
1
are zero, endogeneity
of r
k
aects the plim of the other coecient estimators.
This denition of endogeneity/exogeneity implies that we are concerned only with
the consistency of the OLS estimator. Exogeneity implies consistency but not
unbiasedness of OLS. In some textbooks, you might nd a more strict denition
of exogeneity, namely that 1(n[r
j
) = 0 and, in this case, an exogenous regressor would
also imply unbiasedness of OLS. Notice also that the denition of endogeneity/exogeneity
refers to a specic econometric model: r
j
can be exogenous in one equation but endoge-
nous in another.
Endogeneity cannot be tested directly. Condition 1(r
j
n) = 0 is not directly veri-
able because n is not observed. Using the OLS residuals instead of n is pointless because
the residuals n are always orthogonal to r
j
by construction

n
i=1
n
i
r
ij
= 0 for any
, = 1. . . . . / regardless of the correlation between n and r
j
.
We argued that we cannot test for exogeneity directly with the data at hand. This
is not precisely true. With additional information (on instrumental variables) we can
design specication tests which test for 1(r
j
n) = 0. We study this (Hausman) test later.
In any case, we usually rely on economic theory and on the intuition underlying the
econometric model to assess whether the regressors are endogenous or exogenous. It
is therefore important to understand the sources of endogeneity. In most econometric
applications, endogeneity usually arises in one of three ways:
1. Omitted variables
2. Measurement error
3. Simultaneity
11
3.1 Omitted Variable Bias (OVB)
Suppose theory tells us that the relevant object of study is the conditional expectation
of given two variables r and . For some reason, we omit from the regression. How
is the OLS estimator of the coecient of r aected?
Why would we omit from the regression if theory tells us to include it? The typical
reason is either that we lack data to measure it or because the variable is intrinsically
unmeasurable. For example, when estimating the eect of schooling on wages using
household data we would like to control for the eect of the rm in which the individual
works because rms may have dierent wage policies. Unfortunately, household data
sets do not carry information on the employing rm. We would also like to control for
the individuals natural ability since it certainly aect wages but ability is intrinsically
unobserved and, possibly, unmeasurable.
Let the model be
1([r
1
. . . . . r
k
. ) = ,
1
+ r
2
,
2
+ r
k
,
k
+ (11)
where is the variable that will be omitted from the regression. We are interested in
the ,
0
j
:. which are the partial eects of the observed explanatory variables holding the
other explanatory variables constant, including the unobservable .
Model (11) in error form is
= ,
1
+ r
2
,
2
+ r
k
,
k
+ + . 1([r
1
. r
2
. . . . . r
k
. ) = 0 (12)
If we regress on the observable variables only we are, in eect, putting the unobservable
into the error term,
= ,
1
+ r
2
,
2
+ r
k
,
k
+ n. n = + (13)
which is an equation like (9).
Now, has zero mean and is uncorrelated with all the r
0
j
: (and ). is sometimes
called the structural error. We can also assume 1() = 0 because an intercept is always
included in the regression. Thus, 1(n) = 0. But for n to be uncorrelated with each
regressor r
1
. . . . . r
k
it must be that is uncorrelated with each of them. If is correlated
with any of the regressors, then so will n be and we have an endogeneity problem: C1o
would not be estimating , consistently.
To understand this omitted variable bias, denote the best linear projection (BLP)
of onto r by
1([r) = ro. o = [1(r
0
r)]
1
1(r
0
).
We can therefore write
= o
1
+ o
2
r
2
+ + o
k
r
k
+ : (14)
where, by denition of a linear projection, 1(:) = 0 and Co(r
j
. :) = 0 for each , =
1. . . . . /. Plugging equation (14) into (12) gives
= (,
1
+ o
1
)
. .

1
+ r
2
(,
2
+ o
2
)
. .

2
+ r
k
(,
k
+ o
k
)
. .

k
+ : + (15)
12
The error term in this equation, : + . is uncorrelated with all the regressors.
That is, the r
0
s are exogenous in this equation, so OLS consistently estimates the
coecients of this regression. Thus, as discussed in Section 2, the OLS estimator of
the coecients in the model excluding which we denote by /

would consistently
estimate the parameter : and not ,. namely
j lim/

j
= ,
j
+ o
j
= :
j
(16)
which says that the OLS estimator estimates the direct eect of r
j
on plus an indirect
eect amounting to the eect of r
j
on the unobserved times the eect of on . i.e.,
o
j
.
If we suspect that a relevant variable is omitted from the specication, then it is
not surprising that some regressors are endogenous. The economic reason is that these
regressors may be the result of choices made by individuals or rms and these choices
may also be aected by (which is known to the individual or rm but is not observed
by the econometrician). In this case, a correlation between the included and excluded
variables results (e.g., schooling and ability, inputs and managerial qualities, etc.).
We can use equation (16) to get an idea of the direction of the bias o
j
. if not its
magnitude. Usually we do have a good prior about the sign of . However, in attempting
to sign the o
0
j
: we should remember that it is not enough to have an idea about the sign
of the simple correlation coecient between and r
j
; o
j
refers to the partial correlation
between and r
j
..
The only two cases in which omitting a variable has no eect on the OLS estimates
of the included variables are:
1. When the omitted variable has no eect on . i.e., = 0 so that is not really an
omitted variable. We say that is not relevant.
2. When the omitted variable is orthogonal to the included variables. That is when
1(r
0
) = 0 because this implies o = 0.
For completeness, we now show the relationship between the OLS estimator in the
regression when is included (sometimes referred to as the long regression) and the
OLS estimator from the regression when is excluded (the short regression). To do
this we write 1 = A/ +Q +

\ . where Q is the : 1 vector of observations on . (/
0
. )
is the (/ + 1) 1 OLS estimator and

\ is the : 1 vector of OLS residuals in the long
regression. Now, regressing 1 on A only (the short regression) gives
/

= (A
0
A)
1
A
0
1 = (A
0
A)
1
A
0
_
A/ + Q +

\
_
= / + (A
0
A)
1
A
0
Q + (A
0
A)
1
A
0

\
= / +

o
because A
0

\ = 0 by construction, and

o = (A
0
A)
1
A
0
Q is the / 1 vector of OLS
estimators of the coecients in the linear projection of on r
1
. . . . . r
k
(see equation
(14)). This relationship is the sample counterpart of (16).
13
The estimates of the short regression estimate the direct (partial) eect of r on
plus the indirect eect on that operates through the correlation between r and .
7
Clearly /

is inconsistent,
j lim/

= , + o = : (17)
which is equal to (16).
3.1.1 Proxy Variables
The eect of omitting a relevant variable can be reduced if we have a proxy variable
for the unobserved variable . A proxy variable . should satisfy two requirements:
1.
1([r. . .) = 1([r. ) (18)
2.
1([r. .) = 1([.) = o
0
+ o
1
. (19)
Condition 1 means that . does not play a role in explaining once r and have
been controlled for as in model (11). This is not a strong assumption since . and not ..
is what aects (otherwise . would be part of r). In the wage-education example, let
be ability and . be an IQ test score. The model says that it is ability that aects wages;
the IQ score should not matter for wages given ability.
Condition 2 is more important and says that is not correlated with r once we
partial out (account for) .. That is, the BLP of on (.. r) in error form is
= o
0
+ o
1
. + : (20)
with 1(:) = 0 and Co(.. :) = 0 by denition. Condition 2 assumes, in fact, that .
accounts for all the possible correlation between and the r
0
:.
Co(r
j
. :) = 0. , = 1. . . . . /
To see how using the proxy . instead of aects the estimated coecients we
replace in (12) with equation (20) to get an estimable equation
= (,
1
+ o
0
) + r
2
,
2
+ r
k
,
k
+ o
1
. + ( + :)
The composite error n = + : is uncorrelated with the regressors under the
assumptions made. Condition 1 ensures that . is uncorrelated with while . is uncor-
related with : by construction. The r
0
j
: are uncorrelated with the structural error and
condition 2 ensures that the r
0
j
: are uncorrelated with :. Thus, we know that the OLS
regression
on 1. r
2
. . . . . r
k
. . (21)
7
In the analogy with calculus, b

estimates the total derivative while b estimates the partial derivative.


14
gives consistent estimators of
(,
1
+ o
0
) . ,
2
. . ,
k
. o
1
We can estimate , consistently if we use proxy variables (except for ,
1
and . of
course). If one of the r
0
j
:. say r
k
. does not satisfy condition 2 then
= o
0
+ `
k
r
k
+ o
1
. + : (22)
and (21) would be estimating
(,
1
+ o
0
) . ,
2
. . (,
k
+ `
k
). o
1
consistently.
3.1.2 Optional: Omitted Variable Eect on Variances
Sometimes it is useful to know which estimator the one based on the longer regression
or the one based on the shorter regression has a smaller variance. This is not a very
interesting question since we are comparing a consistent estimator with an inconsistent
one, but something useful will come out from this exercise.
Recall that the variances of / and /

can be written as
\ (/[A. Q) = o
2
(A
0
`
q
A)
1
`
q
= 1 Q(Q
0
Q)
1
Q
0
\ (/

[A. Q) = o
2
(A
0
A)
1
where Q is the : 1 vector of observations on . We condition on A and Q to ensure
that o
2
= \ ([r. ) is the same in both expressions. This implies
\ (/[A. Q) \ (/

[A. Q) = o
2
_
(A
0
`
q
A)
1
(A
0
A)
1
_
In order to evaluate the sign in a matrix sense of this expression we appeal
to the following result. Let . 1 be two positive denite matrices. If 1 is positive
denite then so is
1
1
1
. In our case, both matrices in question are positive denite
and
\ (/

[A. Q)
1
. .
B
\ (/[A. Q)
1
. .
A
= o
2
[(A
0
A) (A
0
`
q
A)]
= o
2
A
0
(1 `
q
)A
= o
2
A
0
Q(Q
0
Q)
1
Q
0
A
Now,
A
0
Q(Q
0
Q)
1
Q
0
A =
A
0
QQ
0
A
Q
0
Q
Q
0
Q =

n
i=1

2
i
is a positive scalar and the numerator is clearly a positive denite matrix.
Then \ (/

[A. Q)
1
\ (/[A. Q)
1
is positive denite and therefore so is \ (/) \ (/

).
or \ (/) \ (/

) _ 0 in a matrix sense.
15
The conclusion is that the OLS estimator of , in the long regression has a
larger covariance matrix than in the short regression.
Consider the particular case, = c + ,r + + . The regressors are a constant,
the scalar r and the omitted variable . Let o
xx
=

n
i=1
(r
i
r)
2
. It is well known that
in the short regression the regression excluding the variance of /

is
\ (/

[r. ) =
o
2
o
xx
while the variance of / in the long regression (including r and ) is
\ (/[r. ) =
o
2

n
i=1
c
i
2
=
o
2
o
xx
(1 1
2
xq
)
where c is the OLS residual in the regression of r on (1. ), and 1
2
xq
is the 1
2
from that
regression, 1
2
xq
= 1
P
n
i=1
b
i
2
S
xx
. Because 0 _ 1
2
xq
_ 1, the variance in the long regression
is larger than the variance in the short regression. The higher the correlation between r
and , the larger the variance of /.
The upshot of this discussion is that omitting relevant variables produces biased
estimators, but with a variance that is no larger than the one obtained from the long
regression.
8
3.1.3 Optional: Inclusion of Irrelevant Regressors
Suppose that the true model is as in (11) but with = 0. That is, we assume in fact
that 1([r. ) = r,. the variable is irrelevant or ignorable in this model. Nevertheless
we regress on r and on . There is no problem with including the irrelevant variable
. Our estimate of . the coecient of should be close to zero. Indeed, we know that
1( ) = 0 because we know that in this particular case the value of happens to be zero.
:
2
is also an unbiased estimator of o
2
.
Hence, there is nothing wrong with the inclusion of irrelevant variables in terms of
bias. We simply are not using some correct information about the value of . In contrast,
when omitting relevant variables, we impose incorrect information into our estimation,
i.e., we assume that = 0 when this is in fact not true.
Nevertheless, there should be something that stops us from adding variables to
the regression model. Otherwise, we should just keep adding regressors. As seen in the
previous subsection, the cost to adding irrelevant regressors is in terms of the
precision of the estimator: the variance of the estimator when is included is larger
than when is omitted.
8
However, the estimated variance of the OLS estimator in the short regression is biased upwards
because the estimator of
2
from the short regression based on

u
2
i
is biased upwards as it estimates
the variance of v and of q:
16
3.2 Measurement Errors
A typical problem in applied econometrics is that the data are usually measured with
errors. Even though these errors may average to zero, the OLS estimator will be incon-
sistent when errors of measurement are present. To see this point let us analyze a simple
model using a single regressor
= c + ,r

+ . 1([r

) = 0 (23)
but the true regressor r

is not observed, so that this equation cannot be estimated.


Instead we observe r, where
r = r

+ . 1([r

) = 0 (24)
is the measurement error, and therefore we say that r

is measured with error.


In the classical error-in-variables (EIV) formulation it is assumed that
_

i

i
_
v i.i.d.
_
0.
_
o
""
0
0 o
vv
__
In order to understand the eect of using r instead of r

we replace r

in (23) with
r to obtain an estimable equation,
= c + ,(r ) +
= c + ,r , +
= c + ,r + n. n = , (25)
Hence,
1(rn) = 1(r

+ )( ,) = ,o
""
,= 0
in general.
Thus, r is endogenous in (25) and OLS would not give a consistent estimator of ,.
Note that the error-in-variables problem can be interpreted as an omitted variable
case: if we add to the regression then the problem would disappear. In other words, if
we observe the measurement error then we can recover the true regressor.
Recall that the OLS estimator of / in (25) is / = , +
P
n
i=1
(x
i
x)u
i
P
n
i=1
(x
i
x)
2
. The plim of / is
obtained as follows:
j lim/ = , +
j lim
1
n

n
i=1
(r
i
r)n
i
j lim
1
n

n
i=1
(r
i
r)
2
= , +
Co(r. n)
\ (r)
= ,
,o
""
\ (r

) + o
""
= ,
_
\ (r

)
\ (r

) + o
""
_
17
Note that in this case the OLS estimator underestimates the true parameter
if , 0, and overestimates , if , < 0. So it is generally said that classical EIV
makes the OLS estimator to be attenuated towards zero. This is a powerful result
because we not only know that the OLS estimator is inconsistent but also the direction
of the inconsistency. In this case, the bias or inconsistency depends on the ratio of error
variance to true variance.
Milton Friedmans celebrated insight as to why empirical estimates of the marginal
propensity to consume (MPC) were too low can be interpreted in the light of the classical
EIV bias. The true regressor in the consumption function should be permanent income
(r

) but we use current or transitory income (r) which is a noisy indicator of the true
regressor r

.
In the standard wage equation, omitting ability from the regression would bias
upwards the estimated return to schooling. If, however, schooling is mismeasured, and
there are indications of this, then this would bias its coecient downwards. In fact, some
researchers argue that both biases more or less cancel each other...so that OLS gives the
right estimate of the return to schooling after all!
Sometimes there is confusion between the EIV formulation and the specication of
r as a proxy variable for r

(which is unobserved). Recall that a proxy variable r satises


r

= o
0
+ o
1
r + : and Co(r. :) = 0. The proxy r and the error : are uncorrelated,
but in the EIV model r and the error are correlated. What dierentiates the two
cases is the assumed correlation between the observed variable and an error term that
is added to the structural error term of the regression (and not whether r is on the left
hand side or the right hand side of an equation).
In the event that there are more regressors in the regression (with or without mea-
surement errors), the general result is that all the estimators are potentially inconsistent,
even if just only one variable is badly measured. The OLS estimator of the mismeasured
regressor is still attenuated towards zero but it is harder to sign the direction of the bias
in the other (correctly measured) variables as this depends on the correlation between
the regressors. What is important to remember is that measurement error in one variable
can potentially contaminate the estimators of the coecients of all other variables.
Exercise 1 Consider the case of the dependent variable being measured with error, =

+ j where

is the true value but we observe it with error j. Are there conditions on
j under which OLS is consistent? What are they?
3.3 Simultaneity
Simultaneity arises when at least one of the explanatory variables is partially determined
by . If, say, r
k
is determined partly as a function of . then we will show that that this
usually induces a correlation between r
k
and n. which makes r
k
endogenous in the
regression for .
For example, if is a citys crime rate and r
k
is the size of its police force, size of
the police force is partly determined by the crime rate. If the amount of labor determines
production but production also determines the demand for labor then both production
and labor are simultaneously determined.
18
Suppose the equation of interest is
= r, + n (26)
but the last regressor is partly determined by (and other regressors . which can also
include the other / 1 r
0
:).
r
k
= c + .o + c (27)
We ask ourselves: is r
k
uncorrelated with n? Think about increasing n. This increase
in n increases directly through the structural equation (26). But an increase in
also aects r
k
through the structural equation (27) when c ,= 0. Thus, n and r
k
will be
correlated when helps to determine r
k
. This is the simultaneity problem.
Examples:
1. Production function where technology shock induces more production from
given inputs but also higher levels of optimal inputs
2. Demand function where price is a regressor (or supply function where quantity
is a regressor), and both price and quantity depend on both supply and demand
We now use this second example to make this descriptive argument formally:
Example 7 Supply and Demand
1 : = , + cj + n
D
o : j = n + o + n
S
Assume that 1(n
D
) = 1(n
S
) = 0 and that 1(n
D
) = 1(n
S
) = 0 and 1(n
D
n) =
1(n
S
n) = 0. That is (income and other demographic variables) and n (wages and
other cost-related variables) are exogenous in both the demand and supply equations. We
will show that the price regressor (j) in the demand equation is endogenous.
1(jn
D
) = 1
__
n + o + n
S
_
n
D
_
= o1(n
D
) + 1(n
S
n
D
)
= o1(
_
, + cj + n
D
_
n
D
) + 1(n
S
n
D
)
= oc1(jn
D
) + o\ (n
D
) + Co(n
S
. n
D
)
== 1(jn
D
) =
o\ (n
D
) + Co(n
S
. n
D
)
1 oc
,= 0
even if the demand and supply errors (shocks) are uncorrelated. It is as easy to show
that the quantity regressor in the supply equation is also endogenous.
19

You might also like