Microeconometrics Lecture Notes

ECO 7377
Microeconometrics
Daniel L. Millimet
Southern Methodist University
Fall 2011
DL Millimet (SMU) ECO 7377 Fall 2011 1 / 407
Introduction
Applied research in economics can be loosely classied into two types
1
Descriptive analysis
2
Causal analysis
While the rst is important and useful, the second is of primary
interest
Causal analysis is needed to predict the impact of changing
circumstances or policies, or for the evaluation of existing policies or
interventions
Prior to conducting, or when reviewing, causal analyses, questions
that need to be answered:
1
What is the causal relationship of interest? [Is it economically
interesting?]
2
What is the identication strategy?
3
What is the method of statistical inference?
Several statistical issues are confronted when answering these
questions in economic research:
Specication of the causal relationship of interest entails more than
just dening x and y ... lots of parameters could be estimated
I
Heterogenous vs. homogeneous eects
I
Know what you are estimating
I
To whom does it apply?
I
What question does it answer?
Statistical inference is often dicult and overlooked
I
Spherical vs. non-spherical errors
I
Derivation/computation of estimated asymptotic variances of
estimators
Identication of the causal relationship of interest frequently
encounters
I
Selection issues
F
Self-selection (endogeneity)
F
Sample selection (missing data, attrition)
I
Measurement issues
F
Classical vs. non-classical error
F
Dependent vs. independent variable
F
Continuous vs. discrete variables
I
Modeling issues
F
Functional form (P, SNP, NP)
F
Role of space (spillovers, spatial correlation)
F
Consistency with theory
Dissertation considerations (applied work):
Whats the question? Is it economically interesting?
Whats the identication strategy (if question is causal)?
I
Selection on observables vs. unobservables
I
Parameter of interest
Whats the data requirement? Is it feasible?
Has it been done? Is there value added?
I
Tension between hot topics and ability to contribute
Dissertation Writing Advice
Be organized
I
Outline paper before writing
I
Most papers have a common structure
F
Abstract: Very important. Be concise. No abbreviations, notation. Include the motivation, punchline.
F
Intro: Outline the question. Explain why we care, and what is new in the paper. Give a slightly longer
summary than the abstract of what is done in the paper, and emphasize the major ndings.
F
Lit review (may be incorporated in intro if short)
F
Theoretical model: Be only as complicated as necessary. Understand ramications of assumptions. If
innovation is in the empirics, theory is only needed if it adds something not well understood.
F
Empirical model: Be clear. Understand where identication comes from. Consider relevant specication
tests. Acknowledge deciencies, circumstances under which estimates are inconsistent.
F
Data: Explain the sample selection criteria and variables used. If building on an existing literature, note
any dierences between the sample selection criteria and those used in existing papers.
F
Results: Be sure to spend enough time discussing the actual results. If results dier from existing
literature, try to pin down the reason(s) why.
F
Conclusion: Emphasize importance of new ndings, as well as shortcomings of the current paper.
Discuss potential future work still to be done. End on a positive note.
I
Put discussions in relevant sections
F
Avoid discussing the same point in multiple locations
F
Discuss data in data section; discuss results in results section; most econometric issues belong in the
empirical model section
Be considerate to your readers
I
Invest the time to proofread the paper many times; if you are unwilling
to go through your paper carefully, why should others invest their time?
F
Pascal: The letter I have written today is longer than usual because I
lacked the time to make it shorter.
F
Quintilian: One should aim not at being possible to understand, but
at being impossible to misunderstand.
I
Spell check, grammar check, check formatting issues, check spacing,
check indenting, etc.
I
Dene notation, abbreviations, etc.
I
Avoid redundant notation, excessive notation, awkward notation, etc.
I
Avoid overly critical remarks about other papers; other authors are not
idiots, and may be your referees
I
Tables should be easy to read, and self-explanatory (need to refer back
to the text should be kept to a minimum); include notes under the
tables to explain things; avoid using abbreviations for variable names
unless necessary
I
References should be double-checked; be sure they are accurate and all
are included in the bibliography
Be professional (this is not a term paper)
I
Avoid unsubstantiated claims, sweeping or grand statements, and
generalizations
I
Be upfront; do not hide assumptions/restrictions hoping they will be
overlooked, and justify their use
I
Do not be unnecessarily complex in order to feel smart or show o (see
Siegfried 1970)
F
Da Vinci: Simplicity is the ultimate sophistication.
F
Einstein: Any fool can make things bigger, more complex, and more
violent. It takes a touch of genius-and a lot of courage-to move in the
opposite direction.
F
Fowler: Any one who wishes to become a good writer should
endeavour, before he allows himself to be tempted by the more showy
qualities, to be direct, simple, brief, vigorous, and lucid.
F
Mingus: Making the simple complicated is commonplace; making the
complicated simple, awesomely simple, thats creativity.
F
Jeerson: The most valuable of all talents is that of never using two
words when one will do.
I
Avoid contractions
I
Be consistent with the use of I or we if the paper uses rst person,
consistent with present vs. past tense
Plagiarism
I
Be careful, be ethical!
I
Give credit where credit is due; cite others ideas (in parentheses, not
footnotes)
F
Milton: Copy from one, its plagiarism; copy from two, its research.
F
Donatus: Perish those who said our good things before we did.
F
Kuralt: I could tell you which writers rhythms I am imitating. Its
not exactly plagiarism, its falling in love with good language and trying
to imitate it.
I
Any statement in a paper should t one of the following categories: (i)
factual (agreeable to any reader), or (ii) debatable (but then references
in support, or it should be supported by the work done in the paper
itself, or it should be written in the appropriate language: If one
believes X, then Y.)
F
But, any statement should be in your own words, or should be in
quotations
What to include?
I
Dissertation chapters can/should be longer than papers submitted for
publication
I
Chapters may include greater detail on:
F
Literature review
F
Data construction
F
Empirical methodology
Bootstrap
Introduction
General structure of estimation
population =
|
random sample =

Problem:

is an estimate; need to assess its dbn for proper inference
Solutions
I
Asymptotic theory
I
Simulation methods = bootstrap
Stata: -bootstrap-, -bsample-
Idea
I
Re-sample (with replacement) from the random sample multiple times
and assess the dbn of the estimates
population =
|
random sample =

|
bootstrap sample =

+
I
Results in a vector of estimates,

+
b
, b = 1, ..., B, where B is the # of
bootstrap repetitions
Many dierent bootstrap methods
I
Parametric vs. nonparametric
I
Resampling algorithms
F
iid
F
Block/cluster
F
Sub-sampling (M/N)
I
Imposing the null or not imposing
Bootstrap
Condence Intervals
Consider a regression model
y
i
= x
i
+
i
Problem: given sample estimates,

, need to obtain std errors or
condence intervals
There are two common sampling methods
1
Resampling the data
2
Resampling the errors
Data
I
Resample (with replacement) observations (y
i
, x
i
) = y
+
i
, x
+
i

N
i =1
I
Estimate the original model (OLS) on the re-sampled data set =

+
I
Repeat B times =

+
b
, b = 1, ..., B
Residuals
I
Given

from OLS on original sample, obtain residuals =
i
,
i = 1, ..., N
I
Resample (with replacement) a vector of N residuals =
+
i
, i = 1, ..., N
F
This represents a random draw from the (nonparametric) empirical dbn
of the residuals
I
Alternative (parametric):
F
Estimate

2
=
1
N K

i
2
i
F
Draw N random numbers,
+
i
, i = 1, ..., N, from N(0,
2
)
I
Generate y
+
i
= x
i
+
+
i
(which imposes =

)
I
Regress y
+
on x by OLS =

+
I
Repeat B times =

+
b
, b = 1, ..., B
Resampling data is typically preferred since it less model dependent
What to do with

+
b
, b = 1, ..., B? Several options...
Obtain std error for original sample estimate,

, given by
se(
) =
_
1
B 1
b
_
+
b

+
_
Obtain symmetric CI using normal approximation

_
t
1
2
,B1
se(
)
_
Obtain asymmetric CI using percentile method

_
2
,
2
_
where subscript refers to the quantile of the empirical dbn of

+
Obtain asymmetric bias corrected and accelerated CIs (BC
a
)
I
Calculate
z
0
=
1
_
1
B
b
I
_
+
b
6
_
_
(median bias)
a =

i
_
J
(i )
_
3
6
_
i
_
J
(i )
_
2
_
3/2
(acceleration parameter)
where

J
(i )
is the jacknife estimate (omitting obs i from original
sample) and

J
is the mean of the jacknife estimates
I
Calculate lower and upper quantiles
p
1
=
_
z
0
+
z
0
z
1
2
1 a(z
0
z
1
2
)
_
; p
2
=
_
z
0
+
z
0
+ z
1
2
1 a(z
0
+ z
1
2
)
_
where z
1
2
is the (1 /2)
th
quantile of the std normal distribution
I
CI given by
_
p
1
,
p
2
_
Notes:
I
BC CI obtained by setting a = 0
I
BC
a
requires B > 1000
I
z
0
= 0 when

= median of

+
I
a reects the rate of change of the standard error of

with respect to
the true value,
F
The standard normal approximation assumes that the standard error is
invariant with respect to the true value
F
The acceleration parameter corrects for deviations in practice
Obtain asymmetric CI using bootstrap-t
I
When estimating the model on the re-sampled data, collect the
t-statistics obtained from testing H
o
:
+
=

t
+
=

se(
+
)
I
Yields t
+
b
, b = 1, ..., B
I
Dene
t
+
=
1
B
b
I
(t
+
b
6 t
+
) =
= t
+
is the
th
quantile of the empirical dbn of t
+
I
CI given by

_
t
+
1
2
se(
),
+ t
+
2
se(
)
_
I
Notes
F
Method assumes se(
) is known based on asymptotic theory

F
If unknown, then use double bootstrap
Obtain asymmetric CI using bootstrap-t with double bootstrap
I
Estimate original model by OLS =

I
Obtain bootstrap samples, estimate by OLS, form t
+
given by
t
+
=

se(
+
)
I
Since denominator is not known, resample from the bootstrap sample
B
2
times =

++
b
, b = 1, ..., B
2
I
Obtain the estimated std error of

+
as the std deviation of the B
2
estimates
I
Repeat process B
1
times
I
Obtain CI as above, but with se(
) replaced by the std deviation of the

B
2
estimates of

+
Example: x ~ N(0, 1), N = 1000, x
a
~ N(0, 0.001)
0
1
0
2
0
3
0
-.2 0 .2
Bootstrap Asymptotic
Reps = 20
0
5
1
0
1
5
2
0
-.2 0 .2
Reps = 100
0
5
1
0
1
5
-.2 0 .2
Reps = 500
0
5
1
0
1
5
-.2 0 .2
Reps = 1000
Bootstrap
Imposing the Null
Goal: estimate the model, derive some estimate or test statistic, and
you wish to test whether the true value of the parameter is equal to
some value or derive a p-value associated with the test statistic
Strategy
I
When re-sampling the data, generate new data sets where the null is
true (imposed)
I
Estimate the original model on the re-sampled data
I
Compare the value of the test statistics obtained from the re-sampled
data sets with the value of the test statistic from the original sample
I
If the test statistic from the original sample is very dierent
(statistically), then it is unlikely the null is true in the original sample
Regression example
Model
y
i
=
0
+
1
x
i
+
i
,
i
~ N(0,
2
)
Hypothesis of interest:
H
o
:
1
= 0
H
1
:
1
,= 0
Algorithm
I
Estimate model on original data =

0
,
1
= t
1
(t-statistic for
1
)
I
Obtain the residuals =
i
, i = 1, ..., N
I
Resample (with replacement) a vector of N residuals =
+
i
, i = 1, ..., N
F
This represents a random draw from the (nonparametric) empirical dbn
of the residuals
I
Alternative (parametric):
F
Estimate

2
=
1
N K

i
2
i
F
Draw N random numbers,
+
i
, i = 1, ..., N, from N(0,
2
)
I
Generate y
+
i
=

0
+ 0 x
i
+
+
i
=

0
+
+
i
(which imposes
1
= 0)
I
Regress y
+
on x by OLS = t
+
1
I
Repeat B times = t
+
1
,b
, b = 1, ..., B
I
Obtain p-value as
p-value =
1
B
b
I
([t
+
1
[ > [t
1
[)
I
Reject null if p < < 0.5, where is the signicance level
Distributional example
Want to test equality of CDFs of two random variables (e.g., wages of
job training participants and non-participants)
Data sample
I
x
i
, i = 1, ..., N, is random sample of one variable (participants), with
CDF F(x)
I
y
i
, i = 1, ..., M, is random sample of another variable
(non-participants), with CDF G(y)
Hypothesis of interest:
H
o
: F = G
H
1
: F ,= G
Algorithm
I
Estimate empirical CDF in each sample:

F(x) and

G(y)
I
Compute test statistic
d =
_
NM
N + M
max
zSupp(X,Y )
_
F(z)

G(z)
_
I
Pool data, re-sample (with replacement), sample size = N + M =
q
1
, ..., q
N+M
I
Split the sample: denote rst N obs from F; nal M obs from G
(imposes F = G)
I
Compute d
+
I
Repeat B times = d
+
b
, b = 1, ..., B
I
Obtain p-value as
p-value =
1
B
b
I
(d
+
> d)
I
Reject null if p < < 0.5, where is the signicance level
Bootstrap
Other Issues
Non-iid data
All previous discussion assumes iid data since re-sampling occurs
without regard to any dependence across observations
If there exists some sort of dependence in the data, then resample
blocks or clusters of data
Example #1: Time series data with serial correlation
I
Model
y
t
= x
t
+
t
, t = 1, ..., T
I
Resample blocks of length l by drawing obs randomly from
t = 1, ..., T l
I
If obs t
/
is chosen for the bootstrap sample, also include obs
t = t
/
+ 1, ..., t
/
+ (l 1)
I
Draw T/l obs so nal bootstrap sample size remains T
Example #2: Panel data
I
For example, individuals within hhs, or employees within rms, or
individuals over time
I
Model
y
if
= x
if
+
if
, i = 1, ..., N
where i represents individuals and f represents rms
I
Several individuals are sampled from each of F < N rms
I
Generate bootstrap samples by resampling (with replacement) the F
rms
I
If rm f is chosen for the bootstrap sample, include all employees i
from that rm
I
If identical number of employees from each rm are in the sample, then
bootstrap samples are still of size N
Blocks/clusters are chosen such that data are iid across blocks
Sub-sampling (Politis and Romano 1992, 1994)
M of N re-sampling with or without replacement
Evaluate a statistic of interest at subsamples of the data
Use these subsampled values to build up an estimated sampling
distribution
The consistency properties of this sampling distribution hold for
dependent data under very weak assumptions and even in situations
where the bootstrap collapses
Jacknife estimation
Leave-one-out estimation
Algorithm
I
Estimate model using original sample =

(if OLS model, say)
I
Omit obs i and re-estimate model on sample of N 1 obs =

(i )
I
Repeat omitting each i once (implies N estimations)
I
Standard error obtained as
se(
) =
_
N 1
N

i
_
(i )
(i )
_
2
In some situations, delete-d jacknife achieves superior performance
Failure of the bootstrap or jacknife ...
Resampling methods are not guaranteed to work; theoretical
justication is needed
Most common case of failure occurs when parameter of interest is a
non-smooth function of the data (e.g., median vs. mean)
Example: x ~ N(0, 1), N = 1000, x
med
a
~ N(0, 0.00157)
0
2
0
4
0
6
0
8
0
-.1 -.05 0 .05 .1
Reps = 20
0
1
0
2
0
3
0
-.1 -.05 0 .05 .1
Reps = 100
0
5
1
0
1
5
-.1 -.05 0 .05 .1
Reps = 500
0
5
1
0
1
5
-.15 -.1 -.05 0 .05 .1
Reps = 1000
How to choose B?
Andrews & Buchinsky
Davidson & MacKinnon
Causation
Introduction
General goal of most (applied) econometrics exercises is to distinguish
between causation and correlation
Many empirical questions of concern to economists and/or
policymakers pertains to the causal eect of a program or policy
Statistical and econometric literature analyzing causation has seen
tremendous growth over the past several decades
Central problem concerns evaluation of the causal eect of exposure
to a treatment or program by a set of units on some outcome
I
In economics, these units are economic agents such as individuals, hhs,
rms, geographical areas, etc.
I
The eect of an exposure is only well-dened if the comparison is also
dened; typically the comparison is dened as not exposed, but
sometimes it is not obvious (particularly with non-binary treatments)
Philosophy of causality...
I
Rich literature in analytic philosophy on causality
I
Two main approaches to dening causality:
F
Regularity approaches: Hume: We may dene a cause to be an
object followed by another, and where all the objects, similar to the
rst, are followed by objects similar to the second. (from An Enquiry
Concerning Human Understanding, section VII)
F
Counterfactual approaches: Hume: Or, in other words, where, if the
rst object had not been, the second never had existed. (from An
Enquiry Concerning Human Understanding, section VII)
Regularity approach: a minimal constant conjunction between the
two objects (Suppes: a probabilistic association between the two
objects, which cannot be explained away by other factors)
I
Basic idea behind Granger causality
I
Diculty: what are the other factors? Limiting to only observable
factors is unsatisfying... if some factors are unobservable, then what?
I
Example...
F
C is a potential cause of E if Pr(E[C) > Pr(E[not C)
F
May be spurious if there exists some factor B s.t. Pr(E[C) > Pr(E[not
C) and Pr(E[C, B) = Pr(E[not C, B)
(e.g., E = wages, C = educ, B = ability)
F
May also be a spurious zero correlation if there exists some factor B
s.t. Pr(E[C) = Pr(E[not C) and Pr(E[C, B) > Pr(E[not C, B)
(e.g., E = wages, C = training, B = shock)
F
B is known as a confounder or confounding variable
Be wary: correlation does not imply causation as things are not
always as they seem ...
and the truth may be dicult to see ...
Counterfactual approach: Lewis (1973) proposes to imagine a
range of possible worlds
I
Holland (1986, 2003): a treatment (cause) is a potential manipulation
that one can imagine
F
NO CAUSATION WITHOUT MANIPULATION
F
Gender, race are not treatments?!? (see Greiner and Rubin 2011)
I
Imbens and Wooldridge (2009):
F
A CRITICAL FEATURE IS THAT, IN PRINCIPLE, EACH UNIT CAN
BE EXPOSED TO MULTIPLE LEVELS OF THE TREATMENT.
I
Angrist and Pischke (2009): a treatment should be manipulatable
conditional on other factors = Pr(C[B), Pr(not C[B) (0, 1)
F
NO FUNDAMENTALLY UNIDENTIFIED QUESTIONS
F
Example: school start age = biological age - time in school;
if B = bio age, time in school, then school age is not an identiable
treatment
Microeconometrics today emphasizes the counterfactual view
I
Greiner & Rubin (2011):
For analysts from a variety of elds, the intensely practical goal of
causal inference is to discover what would happen if we changed the
world in some way.
Econometric methods are categorized by the type of selection involved
Selection types
I
Selection on observables: all potential Bs are observed
I
Selection on unobservables: some potential Bs are unobserved
Causation
Potential Outcomes Model
Most causal research is couched in the potential outcomes framework
Typically referred to as the Rubin Causal Model (RCM); attributed to
Neyman (1923, 1935), Fisher (1935), Roy (1951), Quandt (1972,
1988), Rubin (1974)
Notation
y
1i
= outcome of observation i with treatment
y
0i
= outcome of observation i without treatment
D
i
= treatment indicator ...
D
i
=
_
1 treated
0 untreated
y
1i
, y
0i
, D
i
is a draw from the population of interest
y
1
, y
0
, D is a sample from the population of interest
Notes
I
Key insight is to model not just the observed outcome for each unit i ,
but also the unobserved potential outcomes
I
Implicit in this representation is the Stable Unit Treatment Value
Assumption (SUTVA, Rubin 1978), which assumes that outcome of
obs i with and without the treatment does not vary depending on the
treatment assignment of all other agents (rules out general equilibrium
or indirect eects)
F
Allows one to write potential outcomes solely as a function of own
treatment assignment
y
0i
= y
i
(D
1
, D
2
, ..., D
i 1
, 0, D
i +1
, ..., D
N
) = y
i
(0)
y
1i
= y
i
(D
1
, D
2
, ..., D
i 1
, 1, D
i +1
, ..., D
N
) = y
i
(1)
F
Imbens & Wooldridge (2009) provide some references to papers looking
at GE eects; see also Ferracci et al. (2009), Heckman et al. (1999),
Lewis (1963)
I
Also implicit and sometimes lumped into SUTVA is the assumption
that the mechanism for assignment treatments does not aect
potential outcomes (rules out Hawthorne eects, whereby agents may
act dierently if they know they are being observed)
Parameters of interest
i
= y
1i
y
0i
= treatment eect for obs i
I
This is a random variable as it is obs-specic
I
Can summarize the distribution of this variable by focusing on dierent
aspects
ATE
=
E
[
i
] =
E
[y
1
y
0
]
ATT
=
E
[
i
[D = 1] =
E
[y
1
y
0
[D = 1]
ATU
=
E
[
i
[D = 0] =
E
[y
1
y
0
[D = 0]
Notes: Dierent parameters answer dierent questions, may be useful
for dierent policy conclusions, and may require dierent assumptions
to identify
Three other parameters that often appear
1
Local Average Treatment Eect (Imbens & Angrist 1994, Angrist et al.
1996)
F
Dened as
LATE
=
E
[y
1
y
0
[i ], where refers to some
specied subpopulation
2
Marginal Treatment Eect (Heckman & Vytlacil 1999, 2001, 2005,
2007)
F
Dened later
3
Policy Relevant Treatment Eect (Heckman & Vytlacil 2001)
F
Dened as
PRTE
=
E
[y
P
y
NP
], where P (NP) refers to the state
where the program is fully (not) implemented
F
With the program, all agents have access to the program, but may
choose not to participate
F
Implies
PRTE
=
E
[y
P
1
[D
P
= 1] Pr(D
P
= 1) +
E
[y
P
0
[D
P
= 0] Pr(D
P
= 0)
E
[y
NP
]
=
E
[y
1
y
0
[D
P
= 1] Pr(D
P
= 1)
where y
P
0
, y
P
1
, and y
NP
are the three potential outcomes, D
P
is the
treatment indicator in the world with the program, and the second line
follows if one assumes policy invariance (i.e., potential outcomes are
unaected by the existence of the program)
Relationship among the parameters
I
Let
y
1i
=
E
[y
1
] +
i 1
y
0i
=
E
[y
0
] +
i 0
I
This implies
i
= y
1i
y
0i
=
E
[y
1
y
0
] +
i 1
i 0
=
ATE
+
i 1
i 0
and
ATT
=
ATE
+
E
[
i 1
i 0
[D = 1]
ATU
=
ATE
+
E
[
i 1
i 0
[D = 0]
where
E
[
i 1
i 0
[D = j ] is the average, obs-specic gain from
treatment for group j
Can re-dene any of the above parameters for sub-population dened
on the basis of attributes, x
ATE
(x) =
E
[y
1
y
0
[x]
ATT
(x) =
E
[y
1
[x, D = 1]
E
[y
0
[x, D = 1]
ATU
(x) =
E
[y
1
[x, D = 0]
E
[y
0
[x, D = 0]
The previous unconditional parameters are obtained by integrating
over the dbn of x in the relevant population
ATE
=
_

ATE
(x)f (x)dx
ATT
=
_

ATT
(x)f (x[D = 1)dx
ATU
=
_

ATU
(x)f (x[D = 0)dx
Aside
While the preceding parameters, based on dierences in expectations,
are the near universal focus in economics, this need not be the case
Can also dene treatment eects based on ratios
RATE
=
E
[y
1
]/
E
[y
0
]
RATT
=
E
[y
1
[D = 1]/
E
[y
0
[D = 1]
RATU
=
E
[y
1
[D = 0]/
E
[y
0
[D = 0]
These are referred to as relative treatment eects (and prior
parameters are referred to as absolute or dierenced treatment
eects)
Note, however, that relative eects lack a bit of intuitive appeal since
if we dene
i
= y
1i
/y
0i
, then
E
[
i
] =
E
[y
1i
/y
0i
] ,=
E
[y
1
]/
E
[y
0
] and
same for RATT and RATU
Evaluation Problem
Only observe one potential outcome at a point in time for any
observation
Implies...
Attributes of i Observed for i
y
1i
, y
0i
, D
i
y
i
, D
i
where y
i
= D
i
y
1i
+ (1 D
i
)y
0i
= observed outcome for observation i
Missing potential outcome is the missing counterfactual
I
Holland (1986) refers to this as the fundamental problem of causal
inference
I
Because of this, the central issue in the RCM is the relationship
between treatment assignment and potential outcomes
F
Typically referred to as the treatment assignment rule
F
Growing literature on assignment rules (Manski 2000, 2004; Pepper
2002, 2003; Dehejia 2005; Lechner & Smith 2007)
Example #1... ATT
I
Consider estimating
ATT
=
E
[y
1
[D = 1]
E
[y
0
[D = 1]
I
E
[y
1
[D = 1] can be estimated from the data, but one does not observe
E
[y
0
[D = 1]
I
If one uses outcomes of the untreated, we can dene
ATT
=
E
[y
1
[D = 1]
E
[y
0
[D = 0]
I
Which implies selection bias equal to
ATT
=
E
[y
1
[D = 1]
E
[y
0
[D = 0] +
E
[y
0
[D = 0]
E
[y
0
[D = 1]
=bias =

ATT
ATT
=
E
[y
0
[D = 1]
E
[y
0
[D = 0]
I
This is generally non-zero, and may be decomposed into 3 components
(Heckman et al. 1996, 1998):
1
Self-selection into treatment in a manner related to outcome in the
untreated state
2
Observables, x, impacting outcome may not overlap at certain values
across the treatment and control groups
3
Even with overlap, the distribution of x may vary across the treatment
and control groups
Example #2... ATE
I
Consider estimating
ATE
=
E
[y
1
]
E
[y
0
]
I
Neither unconditional expectation can be estimated from the data
I
If one uses conditional expectations, we can dene
ATE
=
E
[y
1
[D = 1]
E
[y
0
[D = 0]
I
Which implies selection bias equal to
ATE
ATE
=
E
[y
1
[D = 1]
E
[y
0
[D = 0] (
E
[y
1
]
E
[y
0
])
=bias = (
E
[y
1
[D = 1]
E
[y
1
[D = 0])[1 Pr(D = 1)]
+ (
E
[y
0
[D = 1]
E
[y
0
[D = 0]) Pr(D = 1)
which is a weighted average of the selection bias for the ATT and ATU
Question: How does one circumvent the missing counterfactual
problem to estimate
ATE
,
ATT
,
ATU
, or any other summary
statistic of the distribution of ?
Early Example of Potential Outcomes: Roy Model (Roy 1951)
As noted previously, at the heart of the RCM is the interplay between
assignment of treatments, potential outcomes, and observed outcomes
Problem is one of self-selection; highlighted in a very clever fashion in
Roy (1951)
Specic issue in Roy (1951) was occupational choice
I
Individuals have potential earnings associated with dierent occupation
choices
I
Realized earnings reect the chosen occuption
Example
Suppose
_
y
0
y
1
_
~ N
_
0
1
,
_
Unconditional outcome distributions look like
0
.
1
.
2
.
3
.
4
-4 -2 0 2 4 6
Support
kdensity y0 kdensity y1
Simulated data, 1000 obs, rho=0.7
Unconditional Distributions of Potential Outcomes
Conditional distributions
Depends on
I
Who selects into treatment or control group, and
I
Correlation of potential outcomes
Positive correlation in above example ( - 0.7)
Positive selection: Assume those above the mean in y
1
distribution select
into treatment
0
.
2
.
4
.
6
.
8
-4 -2 0 2 4 6
Support
kdensity yy0 kdensity yy1
Simulated data, 1000 obs, rho=0.7; positive selection into treatment.
Conditional Distributions of Potential Outcomes
Negative selection: Assume those below the mean in y
1
distribution
select into treatment
0
.
2
.
4
.
6
.
8
-4 -2 0 2 4
Support
Simulated data, 1000 obs, rho=0.7; negative selection into treatment.
Random assignment:
0
.
1
.
2
.
3
.
4
-4 -2 0 2 4 6
Support
Simulated data, 1000 obs, rho=0.7; random assignment into treatment.
Lesson to be learned: observed distributions are not the unconditional
distributions
Roy Model
Two occupations: hunter, sher
Potential incomes
y
d
= g
d
(x) +
d
, d = 0 (h), 1 (f)
Decision rule
D =
I
(y
1
y
0
> 0)
=
I
(g
1
(x) g
0
(x) +
1
0
> 0)
Observed income
y = Dy
1
+ (1 D)y
0
Treatment assignment depends on observables, x, and unobservables,
0
Notes:
1
Cov(D,
1
0
) ,= 0 referred to as essential heterogeneity (Heckman
et al. 2006)
2
Cov(D,
1
0
) ,= 0 =Cov(D, D(
1
0
)) ,= 0
Generalized Roy Model
Replace income maximization decision rule with a more general rule
Decision rule
D =
I
(h(x) u > 0)
When D is a voluntary program (e.g., job training), u may reect (i)
costs of participation and (ii) foregone earnings (opportunity costs)
Implies that treatment assignment depends on observables, x, and
unobservables, u
I
Essential heterogeneity implies Corr(u,
d
) ,= 0 \d
Moving Forward
Guided by the potential outcomes framework, gure out conditions
under which dierent estimators may provide consistent estimates of
the ATE, ATT, ATU, etc.
Key points:
I
Given the missing counterfactual problem, any estimator of the causal
eects of a treatment must rely on some assumptions
I
Dierent estimators rely on dierent assumptions and thus should not
be expected to yield similar estimates unless the identifying
assumptions of each hold in the data
I
While extraneous assumptions may be testable overidentifying
restrictions not all assumptions can be tested
I
Dierent estimators estimate dierent aspects of the dbn of and thus
answer dierent questions
Causation
Random Experiments
First solution is to randomize treatment assignment
Generally speaking, randomization is the preferred solution; often
called the gold standard
Reason: randomization ensures that treatment assignment is
independent of potential outcomes in expectation
Freedman (2006): Experiments oer more reliable evidence on
causation than observational studies.
Imbens (2009): More generally, and this is the key point, in a situation
where one has control over the assignment mechanism, there is little to
gain, and much to lose, by giving that up through allowing individuals to
choose their own treatment regime. Randomization ensures exogeneity of
key variables, where in a corresponding observational study one would
have to worry about their endogeneity.
That said, not everyone is convinced by experiments (without doing
some more mental work)
Much of the criticism about experiments is about the
diculty of generalizing fom the evaluation of one particular
program to predicting what would happen to this program in a
dierent context. Clearly, without theory to guide us on why a
result extends from a context to another, it is dicult to jump
directly to a policy conclusion. However, when experiemtns are
motivated by a theory, the results of experiments (not only on
the nal outcomes, but on the entire chain of intermediate
outcomes that led to the endpoint of interest) serve as a test of
some of the implications of that theory. The combination of data
points then eventually provides sucient evidence to make policy
recommendations.
Duo (2010),
http://www.aeaweb.org/econwhitepapers/white_papers/Esther_Duo.pdf
From an ex post evaluation standpoint, a carefully planned
experiment using random assignment of program status
represents the ideal scenario, delivering highly credible causal
inference. But from an ex ante evaluation standpoint, the causal
inferences from a randomized experiment may be a poor forecast
of what were to happen if the program were to be scaled up.
DiNardo & Lee (2011)
Ex post evaluation answers the question: What happened?
(descriptive)
Ex ante evaluation answers the question: What would happen?
(predictive)
Randomization may occur at dierent stages
1
Population-level: randomize among agents in the population; typically
not feasible since it would entail compelling treatment by some
2
Eligibility-level: randomize among the population of eligibles by
randomly denying eligibility to a subset
3
Application-level: randomize among the population of program
applicants by randomly accepting/rejecting a subset
Stage at which randomization occurs generally aects what can be
learned unless additional assumptions are made
Assumptions (with population-level randomization)
(A.i) y, D is iid sample from the population
(A.ii) y
0
, y
1
l D
(A.iii) Pr(D = 1) (0, 1)
Notes
I
(A.i) implies SUTVA
I
(A.ii) implies
E
[y
1
[D = 1] =
E
[y
1
[D = 0] =
E
[y
1
]; similarly for
E
[y
0
]
I
(A.ii) also implies
ATE
=
ATT
=
ATU
since
E
[y
1
y
0
]
. .
ATE
=
E
[y
1
y
0
[D = 1]
. .
ATT
=
E
[y
1
y
0
[D = 0]
. .
ATU
I
(A.ii) relies on perfect compliance; imperfect compliance may invalidate
the assumption if such non-compliance is related to potential outcomes
F
Dierence in experimental means based on initial assignment still yields
estimate of intent to treat under imperfect compliance; may actually be
more policy relevant
I
(A.iii) ensures all agents have some probability of receiving and not
receiving the treatment
I
Population-level randomization is feasible if compensation is oered to
ensure compliance and this compensation does not aect y
0
and y
1
Estimation
ATE
=
\
E
[y
i
[D = 1]
\
E
[y
i
[D = 0]
=

N
i =1
y
i
I
[D
i
= 1]
N
i =1
I
[D
i
= 1]

N
i =1
y
i
I
[D
i
= 0]
N
i =1
I
[D
i
= 0]
p
E
[y
i
[D = 1]
E
[y
i
[D = 0]
=
E
[D
i
y
1i
+ (1 D
i
)y
0i
[D = 1]
E
[D
i
y
1i
+ (1 D
i
)y
0i
[D = 0]
=
E
[y
1i
[D = 1]
E
[y
0i
[D = 0]
=
E
[y
1i
]
E
[y
0i
]
=
ATE
Properties
I
Unbiased
I
Consistent
I
Asymptotically normal
I
Nonparametrically identied: no parametric or functional form
assumptions needed
Notes
I
(A.ii) may be replaced by a mean independence assumption ...
E
[y
j
[D = j ] =
E
[y
j
], j = 0, 1
I
Randomization succeeds by balancing (in expectation) both observable
and unobservable attributes of participants in the treatment and
control group
I
Randomization can be assessed by testing for dierences in the joint
dbn of predetermined attributes across the treatment and control
groups
I
Randomization at the eligibility or application stage only yield an
estimate of the ATT, which does not equal the ATE unless (i)
treatment eects are homogeneous or (ii) agents do not become
eligible or apply due to unobserved, observation-specic gains to the
treatment,
1
0
Selection on Observables
Randomization is typically not feasible in economics
Applied economists typically must rely on observational (or
non-experimental ) data
Data structure is now given by...
attributes of i observed for i
y
1i
, y
0i
, D
i
, x
i
y
i
, D
i
, x
i
where x
i
is a vector of observable attributes of i
Strong Ignorability
Assumptions
(A.i) iid sample: y, D, x is iid sample from the population
(A.ii) Conditional independence or unconfoundedness: y
0
, y
1
l D[x
(A.iii) Common support or overlap: Pr(D = 1[x) (0, 1)
Note: CIA is sometime referred to as selection on observables (or
observed variables) assumption because if D is a deterministic fn of x,
then CIA will hold. However, the CIA is broader than this case; D may
also depend on unobservables as long as these unobservables are not
correlated with potential outcomes.
Notes...
(A.i) implies SUTVA
(A.ii) implies
Pr(D
i
= 1[x
i
, y
1i
, y
0i
) = Pr(D
i
= 1[x
i
)
(A.iii) ensures one observes agents with a particular x in both the
treatment and control groups
(A.ii), (A.iii) = stong ignorability (Rosenbaum & Rubin 1983)
I
xs must be pre-determined (i.e., unaected by treatment assignment);
if some xs are directly aected by D or by the anticipation of D, then
conditioning on them will mask (at least) some of the eect of the
treatment
I
Implies estimation under strong ignorability requires an instrument
exist, but it is not required to be observed (or even known) such that
conditional on x, D is random rather than deterministic
I
There may not exist any vector x in a particular data set for a
particular treatment such that stong ignorability holds
I
There is some tension between (A.ii) and (A.iii); some xs may
perfectly predict treatment assignment (invalidating CS), but omission
may invalidate CIA... hence, the need for the implicit IV
Nonparametric identication
Estimation
ATE
(x) =
\
E
[y
i
[x
i
= x, D = 1]
\
E
[y
i
[x
i
= x, D = 0]
=

N
i =1
y
i
I
[x
i
= x, D
i
= 1]
N
i =1
I
[x
i
= x, D
i
= 1]

N
i =1
y
i
I
[x
i
= x, D
i
= 0]
N
i =1
I
[x
i
= x, D
i
= 0]
p
E
[y
i
[x
i
= x, D = 1]
E
[y
i
[x
i
= x, D = 0]
=
E
[y
1i
[x
i
= x, D = 1]
E
[y
0i
[x
i
= x, D = 0]
=
E
[y
1i
[x
i
= x]
E
[y
0i
[x
i
= x]
and then
ATE
=
E
[
ATE
(x)] =
_

ATE
(x)f (x)dx =
1
N
ATE
(x
i
)
Similar story for other parameters, except nal step uses f (x[D = 1)
or f (x[D = 0)
Caveats
I
If x takes on many values (even if still discrete), there may be small
sample size for any particular value, x, leading to high variance for
ATE
(x)
I
If x is continuous, then this estimator cannot be used since the
probability of observing more than one obs with the same value of x is
zero
I
Possible solution: functional form assumptions
Final Note
CIA is not testable except by conducting random experiments for
comparison
One common test employed entails testing for dierences in
pre-treatment outcomes conditional on x between the to-be-treated
and the controls
I
Intuition: if D is uncorrelated with unobservables related to the
outcome conditional on x, then pre-treatment outcomes should be
unrelated to (future) D conditional on x
I
Heckman et al. (1999) refers to this as the alignment fallacy
I
In particular, test based on outcomes more than one period in the past
is misleading if shocks are serially correlated and agents self-select into
the treatment group due to an adverse shock in the period directly
before treatment
I
In general, test is useful if it rejects the independence of D and y
conditional on x in periods prior to treatment; if it fails to reject, then
the test is ambiguous
Strong Ignorability: Regression
Previous results showed that
ATE
(x) =
E
[y
1i
[x
i
= x]
E
[y
0i
[x
i
= x]
=
E
[y
i
[x
i
= x, D = 1]
E
[y
i
[x
i
= x, D = 0]
Implies key is to estimate the regression function
E
[y
i
[x
i
, D
i
]
Assumptions
(A.iv) Separability:
y
0i
=
0
(x
i
) +
0i
y
1i
=
1
(x
i
) +
1i
where
E
[
1
[x] =
E
[
0
[x] =
E
[
1
0
[x] = 0
(A.v) Functional forms:
(A.va) Constant treatment eect
0
(x
i
) =
0
+ x
i
1
(x
i
) =
1
+ x
i
(A.vb) Heterogeneous treatment eects
0
(x
i
) =
0
+ x
i
1
(x
i
) =
1
+ x
i
1
Implications...
Given (A.i), (A.ii), (A.iv), and (A.va) ...
E
[y
i
[x
i
, D = 0] =
0
+ x
i
+
E
[
0i
[x
i
, D = 0]
E
[y
i
[x
i
, D = 1] =
1
+ x
i
+
E
[
1i
[x
i
, D = 1]
implies
ATE
(x) =
E
[y
i
[x
i
= x, D = 1]
E
[y
i
[x
i
= x, D = 0]
=
1
0
=
ATE
=
ATT
=
ATU
Given (A.i), (A.ii), (A.iv), and (A.vb) ...
E
[y
i
[x
i
, D = 0] =
0
+ x
i
0
+
E
[
0i
[x
i
, D = 0]
E
[y
i
[x
i
, D = 1] =
1
+ x
i
1
+
E
[
1i
[x
i
, D = 1]
implies
ATE
(x) =
E
[y
i
[x
i
= x, D = 1]
E
[y
i
[x
i
= x, D = 0]
= (
1
0
) + x
i
(
1

0
)
and
ATE
=
_

ATE
(x)f (x)dx = (
1
0
) +
E
[x](
1

0
)
ATT
=
_

ATE
(x)f (x[D = 1)dx = (
1
0
) +
E
[x[D = 1](
1

0
)
ATU
=
_

ATE
(x)f (x[D = 0)dx = (
1
0
) +
E
[x[D = 0](
1

0
)
Estimation... Given (A.i), (A.ii), (A.iv), and (A.va)
Estimate via OLS
y
i
= y
0i
+ D
i
(y
1i
y
0i
)
=
0
+ x
i
+
0i
+ D
i
(
1
+ x
i
+
1i

0
x
i

0i
)
=
0
+ x
i
+ (
1
0
)D
i
+ [
0i
+ D
i
(
1i

0i
)]
=
0
+ x
i
+
ATE
D
i
+
i
Coecient on D is an unbiased estimate of the causal parameter, and
ATE
=
ATT
=
ATU
Estimation... Given (A.i), (A.ii), (A.iv), and (A.vb) ...
Estimate via OLS
y
i
=
0
+ x
i
0
+ (
1
0
)D
i
+ x
i
D
i
(
1

0
)
+ [
0i
+ D
i
(
1i

0i
)]
=
0
+ x
i
+
1
D
i
+ x
i
D
i
1
+
i
Estimates given by
ATE
(x) =

1
+ x
ATE
=

1
+ x
ATT
=

1
+ x
1
ATU
=

1
+ x
0
1
where x
j
=
i
x
i
I
[D
i
= j ]/
i
I
[D
i
= j ], j = 0, 1
Alternatively, estimate via OLS
y
i
=
0
+ (x
i
x)
0
+ (
1
0
)D
i
+ (x
i
x)D
i
(
1

0
)
+ [
0i
+ D
i
(
1i

0i
)]
=
0
+ (x
i
x)
0
+
1
D
i
+ (x
i
x)D
i
1
+
i
Estimates given by
ATE
(x) =

1
+ (x x)
ATE
=

ATT
=

1
+ x
1
ATU
=

1
+ x
0
1
where x
j
=
i
(x
i
x)
I
[D
i
= j ]/
i
I
[D
i
= j ], j = 0, 1
Notes
I
Inclusion of x on RHS of the introduces problem of generated
regressor; OLS std errors are incorrect, but eect is generally minor
I
Standard errors of estimators obtained via delta method or bootstrap
I
Prior to implementing regression approach, it is useful to examine the
normalized dierences in x across the treatment and control groups
F
Normalized dierence for a particular x is given by
x
=
x
1
x
0
_
2
x
1
+
2
x
0
F
If [
x
[ > 0.25, regression results are sensitive to functional form
assumptions in (A.va) and (A.vb); see Imbens & Wooldridge (2009)
Strong Ignorability: Matching
Preliminaries
I
Matching methods were quite popular, and still are to a large extent
I
(Incorrectly) viewed by many as a magic bullet to the estimation of
treatment eects, as a way to mimic randomized experiments
I
In practice, only as good as the underlying assumptions
I
Matching when identifying assumptions are violated may yield worse
estimate than without matching
Assumptions required: (A.i), (A.ii), and (A.iii)
I
Technicality #1: only need y
0
l D[x to estimate ATT; y
1
l D[x to
estimate ATU
I
Technicality #2: (really) only need
E
[y
j
[x, D = j ] =
E
[y
j
[x, D = j
/
],
j , j
/
= 0, 1 to estimate ATE;
E
[y
0
[x, D = 1] =
E
[y
0
[x, D = 0] to
estimate ATT;
E
[y
1
[x, D = 0] =
E
[y
1
[x, D = 1] to estimate ATU
Comparison to regression approach
I
No functional form assumptions: if CIA holds, but (A.va) or (A.vb) do
not, then matching will be consistent and OLS will not
I
Matching weights observations dierently, giving more weight to those
deemed most similar
I
Matching requires, and thus highlights problems due to, CS
1
2
3
4
5
.2 .4 .6 .8 1
x
Untreated Units Untreated, Regression Line
Treated Units Treated, Regression Line
E[y|x,D=0]=1+1x; E[y|x,D=1]=1.5+2.5x; sigma = 0.25
F
CS is violated, but OLS simply extrapolates from each group to
estimate the missing counterfactual at a particular value of x
F
If linear regression specication is not globally accurate, then regression
may yield severe bias (see earlier discussion on normalized dierences)
The fallacy (perhaps!) of extrapolation
Estimation
Parameters
ATE
=
E
[y
1
y
0
]
ATT
=
E
[y
1
y
0
[D = 1]
ATU
=
E
[y
1
y
0
[D = 0]
Unfeasible estimators
ATE
=
1
N
i
(y
1i
y
0i
)
ATT
=
1
i
I
[D
i
= 1]
i
(y
1i
y
0i
)
I
[D
i
= 1]
ATU
=
1
i
I
[D
i
= 0]
i
(y
1i
y
0i
)
I
[D
i
= 0]
Feasible estimators
ATT
=
1
i
I
[D
i
= 1]
i
(y
1i
y
i 0
)
I
[D
i
= 1]
ATU
=
1
i
I
[D
i
= 0]
i
( y
i 1
y
0i
)
I
[D
i
= 0]
ATE
=

i
I
[D
i
= 1]
N

ATT
+
i
I
[D
i
= 0]
N

ATU
where y
i 0
, y
i 1
are estimates of the missing counterfactuals, obtained
as
y
i 0
=
1
l D
l
=0
il

l D
l
=0
il
y
l 0
y
i 1
=
1
l D
l
=1
il

l D
l
=1
il
y
l 1
where
il
= weight given to observation l by observation i
Feasible estimation accomplished by replacing the missing
counterfactual with a weighted average of outcomes from the
corresponding group
Formally, all matching estimators take the form
ATT
=
1
N
1

i D
i
=1
_
_
_
_
y
1i

1
l D
l
=0
il

l D
l
=0
il
y
l 0
_
_
_
_
ATU
=
1
N
0

i D
i
=0
_
_
_
_
1
l D
l
=1
il

l D
l
=1
il
y
l 1
y
0i
_
_
_
_
ATE
=
N
1
N

ATT
+
N
0
N

ATU
where
N
j
=
i
I
[D
i
= j ], j = 0, 1
Matching estimators dier in terms of how the weights are specied
and what exactly is matched on
Strong Ignorability: Matching (Weighting Schemes)
Exact matching or cell matching
Assuming x contains only discrete variables, assign positive weight
only to observations with identical values of x
Let there be K distinct values (or combinations) of xs indexed by
k = 1, ..., K (i.e., K cells)
N
0k
, N
1k
= the number of untreated, treated obs in cell k
Estimators given by
ATT
=

k
N
1k
N
1
_

i kD
i
=1
y
1i
N
1k

l kD
l
=0
y
l 0
N
0k
_
ATU
=

k
N
0k
N
0
_

l kD
l
=1
y
l 1
N
1k

i kD
i
=0
y
0i
N
0k
_
which reect dierent weighted averages of the average treatment
eect within the K cells
Estimator is subject to curse of dimensionality
With high dimensional x, or if x contains continuous variables,
inexact matching algorithms are useful
Asymptotically, all inexact matching estimators are equivalent since
the inexactness disappears as N
In nite samples, dierent inexact matching algorithms may yield
quite dierent estimates
A newly proposed middle ground between exact and inexact matching
is known as coarsened exact matching (CEM)
I
Intuition: round x to fewer distinct values, then match exactly on the
coarsened data
I
Developed by King et al.
I
See -cem- in Stata
Inexact matching
Requires a measure of distance between any two observations, i and l
I
Euclidian-type distance metrics are of the form
d
il
= (x
i
x
l
)
/
W(x
i
x
l
)
where common choices for W are
1
W =
I
(identity matrix)
2
W =
1
, where is the sample variance-covariance matrix of x
(Mahalanobis metric)
3
W is a diagonal matrix with the variance of x along the diagonal, zeros
on the o-diagonal (Abadie & Imbens 2002, 2006)
4
Zhao (2004) proposes other alternatives
I
Propensity score methods compute the distance based on dierences in
the probability of being in the treatment group given x
p(x) = Pr(D = 1[x) [0, 1]
where distance between two observations is
d
il
= [p(x
i
) p(x
l
)[
I
If y
0
, y
1
l D[x =y
0
, y
1
l D[p(x), which follows from the fact that
D l x[p(x) (Rosenbaum & Rubin 1983)
Euclidean-type distance metrics, propensity score are both a means to
circumvent dimensionality as d is a scalar
No one method is superior; goal is to balance the xs ... discussed
later (Ho et al. 2007)
I
In this sense, matching is not an estimator per se, but can be viewed as
a way of pre-processing the data prior to applying some estimator
I
Similar to a type of outlier analysis
Given d
il
several weighting schemes are frequently used
I
Let C(0) represent a neighborhood around 0 for each i
I
Observations given positive weight by i are those included in the set A
i
where
A
i
= l [D
l
,= D
i
, d
il
C(0)
Focusing on propensity score estimators, we can re-write this as
A
i
= l [D
l
,= D
i
, p(x
l
) C(p(x
i
))
where C(p(x
i
)) represents a neighborhood around p(x
i
)
Single nearest neighbor matching
Sets
C(p(x
i
)) = min
l
[d
il
[
=

il
=
_
1 if l A
i
0 otherwise
Intuition: l has the closest propensity score to i , but with dierent
treatment assignment
k-nearest neighbor matching
Sets
C(p(x
i
)) = k-min
l
[d
il
[
=

il
=
_
1/k if l A
i
0 otherwise
Intuition: compute the average of the k closest obs to i in terms of
propensity score, but with dierent treatment assignment than i
Caliper or radius matching (Cochran & Rubin 1973)
Sets
C(p(x
i
)) = p(x
l
) [ [d
il
[ <
for a specied value of
=

il
=
_
1/k
i
if l A
i
0 otherwise
Intuition: compute the average over all k
i
obs that dier from i in
terms of propensity score by less than , but with dierent treatment
assignment than i
Kernel matching (Smith & Todd 2005)
Sets
C(p(x
i
)) =
_
p(x
l
) p(x
i
)
a
N
6
_
=
il
=
_
_
G
_
p(x
l
)p(x
i
)
a
N
_
l
/
D
l
/
=0
G
_
p(x
l
/
)p(x
i
)
a
N
_
if l A
i
0 otherwise
where G() is the kernel function and a
N
is the bandwidth
Intuition: compute a weighted average over all k
i
obs that receive
positive weight given the choice of G() and a
N
, but with dierent
treatment assignment than i
I
G() must integrate to one, a
N
0 as N , and a
N
N
I
Ex: quartic kernel ( = 1)
G(s) =
_
15
16
(1 s
2
)
2
if [s[ 6 1
0 otherwise
Local linear matching (Smith & Todd 2005)
Sets
C(p(x
i
)) =
_
p(x
l
) p(x
i
)
a
N
6
_
=
il
=
_
_
G
il

l
/
D
l
/
=0
G
il
/
(p
l
/
p
i
)
2
[G
il
(p
l
p
i
)]
_
_

l
/
D
l
/
=0
G
il
/
(p
l
/
p
i
)
_
_
l D
l
=0
G
il

l
/
D
l
/
=0
G
il
(p
l
/
p
i
)
2
_
_

l
/
D
l
/
=0
G
il
(p
l
/
p
i
)
_
_
2
if l A
i
0 otherwise
where G
il
= G
_
p
l
p
i
a
N
_
Intuition: similar to kernel matching, but diers in handling of weights
assigned to obs when obs are distributed asymmetrically around i or
when there are gaps in the distribution of the propensity score
Stratication or interval matching
Diers from above schemes (although it can be written as a matching
estimator)
Unit interval is divided into k intervals, the average outcome of
treated and untreated is computed within each interval, and

ATE
(k)
is computed within each interval
Finally
ATT
=

k
N
1k
N
1
ATE
(k)
ATU
=

k
N
0k
N
0
ATE
(k)
ATE
=

i
I
[D
i
= 1]
N

ATT
+
i
I
[D
i
= 0]
N

ATU
Stata: -psmatch2- or -nnmatch-
Strong Ignorability: Matching (Comparison of Matching Methods)
Asymptotically, all methods are consistent if assumptions hold and
bandwidth satsies the requisite criteria
In nite samples, choice may matter
Single nearest neighbor matching minimizes bias since it only uses the
closest match; however, Frlichs (2004) MC analysis shows it fairs
poorly in practice
If sample size is large and the propensity score is evenly dispersed
across the unit interval, kneighbor matching may be ideal
If sample size is large and the propensity score is asymmetrically
distributed, kernel matching may be ideal (weights obs according to
closeness)
If many obs have a propensity score close to the boundary (zero or
one), LL matching may be ideal
Stratication methods face problem of arbitrarily choosing K
Strong Ignorability: Matching (Regression Adjustment)
Various methods combine matching estimators with regression
methods
Regression then matching (Smith & Todd 2005)
I
Regress y
i
on (some) x
i
for treated and untreated samples, obtain
residuals, and use residuals to compute matching estimators
Matching then regression (Ho et al. 2007)
I
Match to obtain missing counterfactual for each obs, then regress y
i
on
D
i
and (some) x
i
using matched sample
I
Standard errors are an issue here, as the usual OLS SEs are incorrect
(more below)
Strong Ignorability: Matching in Practice
Several practical issues are confronted when implementing matching
estimators
1
Restriction to the common support
2
Does inexact matching balance the covariates, x?
3
Which variables belong in x?
4
Inference
5
Failure of CIA
Strong Ignorability: Matching (Common Support)
Dened as
S
p
= p(x) : f (p[D = 1) > 0 and f (p[D = 0) > 0
Matching estimates are only dened at values of p(x) S
p
In practice, may want to exclude obs outside S
p
To do so requires an estimate
S
p
= p(x) :

f (p[D = 1) > 0 and

f (p[D = 0) > 0
Smith & Todd (2005) recommend using NP density estimators to
estimate f () =
f (p[D = j ) =
i D
i
=j
G
_
p(x
i
) p
a
N
_
, j = 0, 1
I
See -kdensity- in Stata
Imprecise alternative
S
p
= p(x) : p
_
_
max
_
min
i D
i
=0
p(x
i
), min
i D
i
=1
p(x
i
)
_
,
min
_
max
i D
i
=0
p(x
i
), max
i D
i
=1
p(x
i
)
_
_
_
I
Simpler alternative
I
Excludes obs just outside the CS for whom close matches exist
I
Does not address holes in the interior of the dbn
Note: imposing the CS changes interpretation of the parameters
being estimated (e.g.,

ATE
becomes the ATE for treated individuals
with a propensity score in a particular region)
Trimming: Smith & Todd (2005) recommend reducing the CS to
S
p
= p(x) :

f (p[D = 1) > q and

f (p[D = 0) > q, q (0, 1)
Dealing with limited overlap; see Crump et al. 2009
Strong Ignorability: Matching (Balancing)
Matching mimics a randomized experiment in that conditioning on
p(x) should balance x across the treated and untreated groups
Equivalently, the problem is reduced to a series of quasi random
experiments at each value of p(x)... hence, an IV exists which
exogenously determines treatment assignment conditional on p(x)
Rosenbaum & Rubin (1983) prove that
x l D[p(x)
which implies
E
[x[p(x), D = 0] =
E
[x[p(x), D = 1]
This holds regardless of whether CIA holds
Balacing tests seek to gauge this
Note: this highlights that p(x) is simply a means to balance the xs;
the goal of p(x) is not to model treatment choice (more below)
Stratication tests (e.g., Deheija & Wahba 1999, 2002)
I
Estimate the propensity score
I
Divide the data into K intervals based on

p(x)
I
Test for equal means (or other moments) of each x across the treated
and control group within each strata
F
See -ttest- in Stata
I
Test xs individually or jointly using Hoteling T
2
test
F
See -hotel- in Stata
I
Add higher order or interaction terms of xs failing the test, and repeat
I
Problem: how to choose K?
F
Too small typically always reject equality
F
Too large rarely reject equality
Standardized dierences
I
Average dierence in each x, where weights from matching are used,
normalized by the pooled SD of x in the full sample
I
Example:
ATT
SDIFF(x
m
) = 100
1
N
1

i D
i
=1
_
x
mi

l D
l
=0
il
x
ml
_
_
Var
i D
i
=1
(x
mi
)+Var
l D
l
=0
(x
ml
)
2
I
Problem: how large is too large? Rosenbaum & Rubin (1985) suggest
20 is large
I
Perhaps criteria should be more strict for variables thought to be more
important in particular application
Hoteling T
2
test
I
Test joint null of equal (weighted) means across treatment and control
group
I
Example:
ATT
T
2
= (x
1
x
0
)
/
1
(x
1
x
0
)
where x
1
= vector of (unweighted) means from treatment group and
x
0
= vector of weighted means from untreated group, weighted by
il
I
Test may be conservative since estimation of weights is not accounted
for
Regression-based test
I
Estimate propensity score
I
Regress each x on a polynomial of p(x), D, and D interacted with the
same polynomial of p(x)...
x
i
=
0
+
S
s=1

s
p(x
i
)
s
+
0
D
i
+
S
s=1

s
D
i
p(x
i
)
s
+
i
and test H
o
:
0
=
1
= =
S
= 0
I
Regression may be unweighted or weighted, assigning weight
l
=
i D
i
=1

il
to each untreated obs (when focus is on
ATT
)
Strong Ignorability: Matching (Variable Selection)
CIA is a strong assumption that places great demands on the data
Two issues
I
What variables to include in x?
I
What functional form to use; should x include higher order, interaction
terms of the variables?
CIA will certainly hold if x includes all variables that determine both
outcomes and participation, but is this required?
Rubin and Thomas (1996) favor including variables in the propensity
score model unless there is consensus that they do not belong
HIT (1997), HIST (1998), Heckman and Smith (1999), Lechner
(2002), Smith & Todd (2005)
I
Estimators are sensitive to variables included in x
I
Bias likely to result if x is too crude
Brookhart et al. (2006)
I
Variables related to outcomes should always be included
I
Variables weakly related to the outcome even if strongly related to
treatment assignment should be excluded as their inclusion results in
higher mean squared error of the treatment eect estimate
Zhao (2007)
I
Including irrelevant variables ; biased estimates
I
Over-tting the propensity score model may be counterproductive
Wooldridge (2009), Pearl (2009)
I
Consider classes of variables whose inclusion leads to bias
I
Primary example is of instrumental variables
Hirano et al. (2003)
I
Using the true propensity score is inecient even when it is known
I
May imply that over-tting the propensity score model may have little
negative consequence in practice
Note: goal of the PS model is not to nd the best predictor of D
I
Generally, variables that impact participation and not outcomes should
be excluded; inclusion will exacerbate the CS problem
I
Psuedo-R
2
criteria should not be used to judge the PS model
Millimet & Tchernis (2009)
I
MC analysis of matching and weighting estimators (discussed later)
I
Estimate propensity score using a series logit estimator
Pr(D = 1) =
exp
_
0
+
S
s=1

s
x
s
_
1 + exp
_
0
+
S
s=1

s
x
s
_
where for suciently large S and appropriate coecients, , any
particpation function may be approximated
I
SLE =

estimated via ML
I
Assess impact of
F
Including irrelevant and excluding relevant higher order terms of variables that impact outcomes and
participation
F
Including irrelevant and excluding relevant higher order terms of variables that impact outcomes only
F
Including irrelevant and excluding relevant higher order terms of variables that impact participation only
I
Little impact to over-tting
F
Asymptotic variance of nonparametric estimators is dominated by bias terms (Ichimura & Linton 2005)
F
Over-tting minimizes the bias
F
Also, normalized weighting estimator is preferable (discussed later)
DiNardo & Lee (2011) criticize us and show instances where adding x
may exacerbate bias
I
Their examples are instances where the CIA does not hold, but one
applies an estimator that requires the CIA (such as matching)
I
Thus, the matching estimator is already biased
I
In this case, adding an additional covariate may increase or decrease
the bias even if x belongs in the model
I
That said, this is not the case examined in our work; we assume CIA
holds
Shaikh et al. (2009) propose a specication test of the propensity
score model
I
Informal test based on an eyeball comparison of the dbn of p(x) in the
treatment and control groups
I
Formal test procedure also provided
Strong Ignorability: Matching (Standard Errors)
Non-smooth matching estimators
I
Correct standard errors are not feasible in this case
I
Usual ttest for di in mean outcomes across matched treated and
untreated group ignores estimation of propensity score and nature of
matching
I
Problem due to estimation of the propensity score disappears
asymptotically
I
Eichler & Lechner (2001) suggest that N must be in the 1000s before
this bias disappears
Bootstrap methods are feasible for smooth matching estimators (e.g.,
kernel matching), but there is no formal evidence
Abadie & Imbens (2006) provide asymptotic standard errors for
non-propensity score matching estimators; work in progress focuses on
propensity score matching estimators
Must be careful when bootstrapping data with choice-based sampling
Strong Ignorability: Matching (Misc. Implementation Issues)
Replacement?
I
Single, k-nearest neighbor matching may be done with or without
replacement
I
Without replacement implies results are sensitive to the sort order of
the data
I
With replacement reduces bias (by improving match quality), but is
less ecient (by using less of the data)
Estimation of propensity score
I
Typically probit or logit is used = semiparametric estimator
I
NP methods are available as well
Bandwidth Selection
I
In NP work, bandwidth choice is typically much important than choice
of kernel function
I
Methods generally fall into three categories
1
ad hoc combined with sensitivity analysis
2
Rule-of-thumb approaches (Silverman 1986)
a
N
- 1.06N
1/5
3
Data driven methods (e.g., cross-validation)
I
Leave-one-out cross-validation (e.g.,
ATT
)
F
Perform a NP regression of y on p(x) using all untreated obs except l
and a candidate bandwidth, a
b
F
Predict y
l
F
Repeat for all l , l = 1, ..., N
0
F
Calculate MSE
MSE(a
b
) =
1
N
0

l D
l
=0
(y
l
y
l
)
2
F
Repeat for all candidate bandwidths a
b
, b = 1, ..., B
F
Choose a
+
b
to minimize MSE(a
b
)
Strong Ignorability: Matching (Sensitivity to Unobservables)
CIA is not testable
Applied literature does/should assess the robustness of matching
estimators
Several currently available techniques
I
Rosenbaum bounds
I
Simulation methods (Ichino et al. 2008)
I
Minimum bias approach (Millimet & Tchernis 2011)
I
Dierence-in-dierences matching
I
Assuming SOO = SOU (Altonji et al. 2005; discussed later)
I
Bayesian sensitivity analysis (de Luna & Lundin 2009)
Rosenbaum Bounds
Method of assessing sensitivity of matching estimator to an
unobserved confounder (Rosenbaum 2002)
Assume
p(x
i
) = F(x
i
+ u
i
) =
exp(x
i
+ u
i
)
1 + exp(x
i
+ u
i
)
where u is an unobserved binary variable and F is the logistic CDF
Implications
I
Odds ratio for obs i is
p(x
i
)
1 p(x
i
)
= exp(x
i
+ u
i
)
I
Odds ratio for obs i relative to obs i
/
p(x
i
)
1p(x
i
)
p(x
i
/
)
1p(x
i
/
)
=
exp(x
i
+ u
i
)
exp(x
i
/ + u
i
/ )
= exp(u
i
u
i
/ ) if x
i
= x
i
/
I
Thus, two observationally identical obs have dierent probabilities of
being treated if ,= 0 and u
i
,= u
i
/
How does inference regarding the treatment eect parameters change
as and u
i
u
i
/ change?
I
Since u is binary, u
i
u
i
/ 1, 0, 1
I
Implies
1
exp
6
p(x
i
)
1p(x
i
)
p(x
i
/
)
1p(x
i
/
)
6 exp
where
F
exp = 1 = no selection bias
F
exp = greater selection bias
I
Rosenbaum bounds compute bounds on the signicance level of the
matching estimate as exp changes values
F
If matching estimate is statistically insignicant even when
exp - 1, then treatment eect is not robust
F
If matching estimate is statistically signicant even when exp is
large, then treatment eect is not sensitive to hidden bias
Stata: -rbounds-, -mhbounds-
Ichino et al. (2008) Approach
Nannicini (2007) and Ichino et al. (2008) propose an alternative
method of assessing the robustness of ATT estimates obtained under
CIA
The sensitivity analysis is performed by comparing the baseline
matching estimate to estimates obtained after additionally
conditioning upon a simulated confounder
The distribution of the simulated variable can be constructed to
capture dierent hypotheses regarding the nature of potential
confounders
Setup
I
The parameter of interest is the
ATT
=
E
[y
1
y
0
[D = 1]
I
Accordingly, y
0
l D[x denotes the required CIA
I
Suppose that this condition is not met, but if an unobservable, U, is
added then a stronger CIA holds
y
0
l D[x, U
I
Implies
E
[y
0
[D = 1, x] ,=
E
[y
0
[D = 0, x]
E
[y
0
[D = 1, x, U] =
E
[y
0
[D = 0, x, U]
Solution
I
Simulate the potential confounder and use it as a matching covariate
F
For simplicity, the potential outcomes and the confounding variable are
assumed to be binary
F
Conditional independence of U and x is also assumed
F
Hence, the distribution of U is fully characterized by the choice of the
following four parameters
p
ij
= Pr(U = 1[D = i , y = j ) = Pr(U = 1[D = i , y = j , x)
with i , j 0, 1
F
Given the parameters p
ij
, a value of U is simulated for each observation
depending on D, y
I

ATT
is then estimated with U as an additional matching covariate
For a given set of the parameters p
ij
, many simulations are performed,
ATT
computed for each simulation, and the mean/sd of the
estimates reported
Choosing p
ij
...
I
It is essential to consider useful potential confounders
I
Calibrated confounders: choose p
ij
to make the distribution of U
similar to the empirical distribution of observable binary covariates
I
Killer confounders: search over dierent p
ij
for the existence of a U
which makes
ATT
= 0
I
One can also simulate other meaningful confounders by setting the
parameters p
ij
and p
i
, where p
i
can be computed as
p
i
= Pr(U = 1[D = i ) =
1
j =0
p
ij
Pr(y = j [D = i )
with i 0, 1
Common case
I
Typical scenario in applied work has

ATT
> 0 in baseline model
I
Thus, concern centers on potential confounder that has both a positive
eect on the untreated outcome and on the selection into treatment
I
Ichino et al. prove that
1
p
01
> p
00
=
Pr(y
0
= 1[D = 0, U = 1, x) > Pr(y
0
= 1[D = 0, U = 0, x)
where p
01
= Pr(U = 1[D = 0, y = 1) and
p
00
= Pr(U = 1[D = 0, y = 0)
2
p
1
> p
0
=
Pr(D = 1[U = 1, x) > Pr(D = 1[U = 0, x)
where p
1
= Pr (U = 1[D = 1) and p
0
= Pr (U = 1[D = 0)
I
Accordingly, by choosing p
01
> p
00
and setting p
1
> p
0
, a
confounder is simulated such that it has a positive eect on both y
0
and D even after conditioning on x
What do these ps represent?
I
The dierences
d = p
01
p
00
s = p
1
p
0
only depict the sign of Us outcome and selection eects
I
The size of these eects must be evaluated after conditioning on x to
account for the association between U and x that shows up in the data
I
Thus, at every iteration, logit models for Pr(y = 1[D = 0, U, x) and
Pr(D = 1[U, x) are estimated
F
The average odds ratio of U is reported as the outcome and selection
eects of the simulated confounder
=
Pr(y =1[D=0,U=1,x)
Pr(y =0[D=0,U=1,x)
Pr(y =1[D=0,U=0,x)
Pr(y =0[D=0,U=0,x)
=
Pr(D=1[U=1,x)
Pr(D=0[U=1,x)
Pr(D=1[U=0,x)
Pr(D=0[U=0,x)
F
and reect the strength of U
Stata: -sensatt-
Minimum Bias Approach
Intuition: Trim the sample on the basis of p(x) to minimize the bias
from a failure of CIA
Assume (A.iv) plus unobservables are trivariate normal:
0
,
1
, u ~ N
3
(0, ), where
=
_
_

2
0

01
1

0u
2
1

1u
1
1
_
_
and u is the error from the treatment assignment equation
D
+
i
= h(x
i
) u
i
where D
+
is latent treatment assignment
The bias of the ATT at some value of the propensity score, p(x), is
given by
B
ATT
[p(x)] =
ATT
[p(x)]
ATT
[p(x)]
=
0u
0
(
1
(p(x)))
p(x)[1 p(x)]
where
I

0u
= selection on unobservables aecting outcome in untreated state
I
and are standard normal PDF and CDF
I

ATT
is some propensity score based estimator
B
ATT
[p(x)] is minimized at p
+
(x) = 0.5
For the ATE,
B
ATE
[p(x)] =
0u
0
+ [1 p(x)]
u
_
(
1
(p(x)))
p(x)[1 p(x)]
_
where
I
=
1
0
= unobserved, individual-specic gain from treatment
I

u
= selection on unobserved, individual-specic gains
= The bias-minimizing propensity score, p
+
(x), depends on the error
correlation structure
Similar results in Black & Smith (2004), Heckman and
Navarro-Lozano (2004)
Minimum-biased (MB) estimation technique
I
Stage 1: Estimate the propensity score (e.g., probit model)
I
Stage 2: Retain only those observations with a propensity score,
[
p(x
i
), within a xed neighborhood around p
+
(x), the bias-minimizing
propensity score
I
Stage 3: Estimate the ATE or ATT using any propensity-score based
estimator that relies on CI using this sub-sample
Notes:
I
Estimator is biased, but it minimizes the bias
I
For ATT... this is straightforward as we know that p
+
(x) = 0.5
I
For ATE... p
+
(x) is unknown, depends on error correlations
I
If treatment eect is heterogeneous, then interpretation changes; may
not be economically interesting
For ATE, add Stage 1.5: Estimate the error correlations
I
Feasible if one also imposes (A.va) or (A.vb)
I
Estimate via OLS (discussed in more detail later)
y
i
=
0
+ (
1
0
)D
i
+ x
i
0
+ x
i
D
i
(
1

0
)
+
0
(1 D
i
)
_
(x
i
)
1 (x
i
)
_
+
1
D
i
_
(x
i
)
(x
i
)
_
+
i
where ()/() is the inverse Mills ratio and
0
=
0u
1
=
0u
0
+
u
.
I
Replacing with from the rst-stage probit yields consistent
estimates of
0u
0
and
u
Millimet & Tchernis (2009) nd that trimming is inecient when CIA

holds, but is more robust to (some) mis-specications
Dierence-in-Dierences Matching
All matching estimators are biased if unobservables invalidate the CIA
Formally (e.g.,
ATT
)
ATT
(p(x)) =
_
E
[y
1
[p(x), D = 1]
E
[y
0
[p(x), D = 0]
+
E
[y
0
[p(x), D = 0]
E
[y
0
[p(x), D = 1]
_
where matching estimators are based on
ATT
(p(x)) =
E
[y
1
[p(x), D = 1]
E
[y
0
[p(x), D = 0]
which implies
bias =

ATT
(p(x))
ATT
(p(x))
=
E
[y
0
[p(x), D = 1]
. .
Counterfactual
E
[y
0
[p(x), D = 0]
. .
Observed
which is zero under CIA
Rearranging terms yields
ATT
(p(x)) =

ATT
(p(x)) bias
This suggests a bias-corrected estimator is feasible if the bias can be
consistently estimated
Might assume the bias equals the dierence in mean outcomes prior
to treatment
bias =
E
[y
0t
[p(x), D = 1]
E
[y
0t
[p(x), D = 0]
?
=
E
[y
0t
/ [p(x), D = 1]
E
[y
0t
/ [p(x), D = 0]
where t
/
< t, t
/
precedes the treatment, t is post-treatment
Implies
ATT
(p(x)) =

ATT
(p(x)) bias
=
E
[y
1t
[p(x), D = 1]
E
[y
0t
[p(x), D = 0]
E
[y
0t
/ [p(x), D = 1]
E
[y
0t
/ [p(x), D = 0]
=
_
E
[y
1t
y
0t
/ [p(x), D = 1]
E
[y
0t
y
0t
/ [p(x), D = 0]
_
and

ATT
(p(x)) =
ATT
(p(x)) requires
E
[y
0t
y
0t
/ [p(x), D = 1] =
E
[y
0t
y
0t
/ [p(x), D = 0]
which is dierent than the original CIA
Implementation: dierence the data \i , then match
DID matching requires the original CIA be replaced with
y
0
, y
1
l D[p(x)
Intuition:
I
DID matching requires the change in potential outcomes to be
independent of treatment assignment given the PS
I
Equivalently, there are no time varying unobservables correlated with
both outcomes and treatment assignment given x
Smith & Todd (2005) nd DID matching to be more robust, but
conclusions are application-specic
Strong Ignorability: Inverse Propensity Score Weighting (IPW) Estimators
Alternative to matching estimators, but still rely on
knowing/estimating the propensity score
Identities
E
_
Dy
p(x)
_
=
E
_
Dy
1
p(x)
_
=
E
_
E
_
Dy
1
p(x)
_
[ x
_
=
E
_
1
p(x)
E
[Dy
1
] [ x
_
CIA
=
E
_
1
p(x)
E
[D [ x]
E
[y
1
[ x]
_
=
E
_
p(x)
p(x)
E
[y
1
[ x]
_
=
E
[
E
[y
1
[ x]] =
E
[y
1
]
and, similarly,
E
_
(1 D)y
1 p(x)
_
=
E
[y
0
]
Parameters of interest (Horvitz & Thompson 1952)
ATE
=
E
_
Dy
p(x)

(1 D)y
1 p(x)
_
=
E
_
D p(x)
p(x)[1 p(x)]
y
_
ATT
=
1
E
[p(x)]
E
_
p(x)
_
Dy
p(x)

(1 D)y
1 p(x)
__
=
1
E
[p(x)]
E
_
D p(x)
1 p(x)
y
_
ATU
=
1
E
[1 p(x)]
E
_
[1 p(x)]
_
Dy
p(x)

(1 D)y
1 p(x)
__
=
1
E
[1 p(x)]
E
_
D p(x)
p(x)
y
_
Proof: Wooldridge (2002, p. 613)
Estimation
Unnormalized estimators
ATE
=
1
N
i
_
D
i
y
i
[
p(x
i
)

(1 D
i
)y
i
1
[
p(x
i
)
_
=
1
N
i
_
[D
i

[
p(x
i
)]y
i
[
p(x
i
)[1
[
p(x
i
)]
_
ATT
=
1
1
N
i
[
p(x
i
)
1
N
i
[
p(x
i
)
_
D
i
y
i
[
p(x
i
)

(1 D
i
)y
i
1
[
p(x
i
)
_
=
1
1
N
i
[
p(x
i
)
1
N
i
_
[D
[
p(x
i
)]y
i
1
[
p(x
i
)
_
ATU
=
1
1
N
i
_
1
[
p(x
i
)
_
i
_
1
[
p(x
i
)
_
_
D
i
y
i
[
p(x
i
)

(1 D
i
)y
i
1
[
p(x
i
)
_
=
1
1
N
i
_
1
[
p(x
i
)
_
i
_
[D
[
p(x
i
)]y
i
[
p(x
i
)
_
Normalized estimators (Hirano and Imbens 2001)
I

ATE
is the dierence in two weighted averages, where weights are
D
i
N
[
p(x
i
)
and
1 D
i
N
_
1
[
p(x
i
)
_
I
Problem: weights may not sum to unity
I
HI assign weights normalized by the sum of propensity scores for
treated and untreated groups
I
Unnormalized estimator assigns equal weights of 1/N to each
observation
I
Normalized estimator (e.g.,

ATE
)
ATE
=
_
i
D
i
y
i
[
p(x
i
)
_
i
D
i
[
p(x
i
)
_
i
(1 D
i
)y
i
1
[
p(x
i
)
_
i
(1 D
i
)
1
[
p(x
i
)
_
I
Tends to be more stable in practice as it restricts weights to _ 1;
Millimet & Tchernis (2009), Busso et al. (2011) nd it performs better
Standard errors obtained via bootstrap
Strong Ignorability: Regression (Again)
Use propensity score as control variable in regression
Assumptions
(A.vi)
E
[y
1
y
0
[x] is uncorrelated with Var(D[x) = p(x)[1 p(x)]
(A.vii)
E
[y
1
[p(x)],
E
[y
0
[p(x)] are linear in p(x)
(A.vi) has no good interpretation
(A.vii) replaces the functional form assumptions discussed in the
previous regression approach
Estimation
Given (A.ii) and (A.vi)...
I
Estimate via OLS
y
i
=
0
+
1
D
i
+
[
p(x
i
) +
i
I
Estimates given by
ATE
=

ATT
=

ATU
=

1
which is consistent and asymptotically normal if
[
p(x
i
) is consistent and
asymptotically normal
I
Proof: See Wooldridge (2002)
Given (A.ii) and (A.vii)...
I
Estimate via OLS
y
i
=
0
+
1
D
i
+
0
[
p(x
i
) +
1
_
[
p(x
i
)
p
_
D
i
+
i
where

p
=
1
N
i
[
p(x
i
)
I
Estimates given by
ATE
(x) =

1
+
1
_

p(x)
p
_
ATE
=

ATT
=

1
+
1
x
1
ATU
=

1
+
1
x
0
where x
j
=
i
_
[
p(x
i
)
p
_
I
[D
i
= j ]/
i
I
[D
i
= j ], j = 0, 1
Given (A.ii) and a weaker version of (A.vii)...
I
Estimate via OLS
y
i
=
0
+
1
D
i
+
K
k=1

0k
[
p(x
i
)
k
+
K
k=1

1k
_
[
p(x
i
)
k

k
p
_
D
i
+
i
where

k
p
=
1
N
i
[
p(x
i
)
k
, k = 1, ..., K
and K is a low order number
I
Estimates given by
ATE
(x) =

1
+
K
k=1

1k
_

p(x)
k

k
p
_
ATE
=

ATT
=

1
+
K
k=1

1k
x
k
1
ATU
=

1
+
K
k=1

1k
x
k
0
where x
k
j
=
i
_
[
p(x
i
)
k

k
p
_
I
[D
i
= j ]/
i
I
[D
i
= j ], j = 0, 1;
k = 1, ..., K
Strong Ignorability: Double-Robust Estimators
Robins and Rotnizky (1995), Lunceford and Davidian (2004), and
others discuss DR estimators
DR estimators combine regression and weighting estimators and are
double robust because they are consistent as long as either the
regression specication for the outcome or the propensity score
specication is correctly specied
Estimation
OLS estimation
y
i
=
0
+ x
i
+
1
D
i
+
0
D
i
[
p(x
i
)
+
1
1 D
i
1
[
p(x
i
)
+
i
ATE
=

1
+
1
N
i
_
0
D
i
[
p(x
i
)

1
1 D
i
1
[
p(x
i
)
_
ATT
=

1
+
1
N
1
i :D
i
=1
_
0
D
i
[
p(x
i
)

1
1 D
i
1
[
p(x
i
)
_
ATU
=

1
+
1
N
0
i :D
i
=0
_
0
D
i
[
p(x
i
)

1
1 D
i
1
[
p(x
i
)
_
WLS estimation: ATE
y
i
=
0
+ x
i
+
1
D
i
+
i
where weights are
i
=
_
D
i
[
p(x
i
)
+
1 D
i
1
[
p(x
i
)
and dierent weights are used for ATT, ATU (given above)
Augmented IPW: ATE (Lunceford and Davidian 2004; Glynn and
Quinn 2010)
ATE
=
1
N
i
_
D
i
y
i
(D
i

[
p(x
i
))g
1
(x
i
)
[
p(x
i
)

(1 D
i
)y
i
+ (D
i

[
p(x
i
))g
0
(x
i
)
1
[
p(x
i
)
_
where g
0
(x
i
) and g
1
(x
i
) are estimated via separate OLS regressions of
y on x
I
See -dr- in Stata
Strong Ignorability: Decomposition of Treatment Eects
Flores & Flores-Lagunes (2009) provide a framework to decompose
k
into a direct eect of D and an indirect eect that operates
through some causal mechanism, S
Setup
I
S 0, 1 is a post-treatment, mechanism variable
I
S
0
, S
1
are potential values of S associated with D = 1 and D = 0
I
S = DS
1
+ (1 D)S
0
is the realized value of S
Example: D = 1 if student i attends a private HS, 0 otherwise; S = 1
if student i obtains a college degree, 0 otherwise; y = earnings as an
adult
Composite potential outcomes for y are dened as y(D, S
D
/ ),
D, D
/
0, 1
I
y(1, S
1
) = potential outcome associated with D = 1 and S
1
, the
realized value of the mechanism variable, S, when D = 1
I
y(0, S
0
0
, the
I
y(1, S
0
0
, the
Decomposing
ATE
ATE
=
E
[y(1, S
1
)]
E
[y(0, S
0
)]
=
E
[y(1, S
1
)]
E
[y(1, S
0
)]
. .
A
+
E
[y(1, S
0
)]
E
[y(0, S
0
)]
. .
B
where A represents the indirect of D on y operating through S and B
represents the direct eect of D and y xing S at the non-treatment
value
Authors refer to
I
A as the individual causal mechanism eect
I
B as the net average treatment eect
Note, B still reects two eects of D on y
1
Eects of D on y operating independently of S
2
Eects on D on y operating through a change in the return to S (i.e.,
even though the level of S is held xed, the eect of S on y may
change due to D)
Assumptions
(DTE.i) Independence of Treatment: y(1, S
1
), y(0, S
0
), y(1, S
0
), S
0
, S
1
l D
(DTE.ii) Conditional Indepedence of Potential Mechanisms:
y(1, S
1
), y(0, S
0
), y(1, S
0
) l S
0
, S
1
[x
(DTE.iii) Constant Functional Form: If
E
[y(1, S
1
)[S
1
= s
1
, x] = f
1
(S
1
, x), then
E
[y(1, S
0
)[S
0
= s
0
, x] = f
1
(S
0
, x)
(DTE.iii) implies that the functional form relating S and x to y when
D = 1 is the same regardless of whether S = S
1
or S = S
0
Under (DTE.i) (DTE.iii),
ATE
and B can be estimated, and then
A can be backed out
Extension to the case where (DTE.i) only holds conditional on x is
also presented
Non-Binary Treatments: Multi-Valued Treatments
Suppose the treatment can take on many discrete values
D = d
0
, d
1
, d
2
, ..., d
J
= e.g., years of education

y
j
= potential outcome for treatment j = 0, 1, ..., J
ATE
j ,j
/
=
E
[y
j
y
j
/ ] , j , j
/

ATE
j ,j
/
=
E
_
y
j
y
j
/ [D = j , D = j
/
, j , j
/

ATT
j ,j
/
=
E
[y
j
y
j
/ [D = j ] , j , j
/

Dose-response function reects the unconditional expectation of
potential outcomes at each dose
E
[y
j
] \j
Now, there are J missing counterfactuals
I
D
ji
= indicator if obs i receives treatment j
D
ji
=
_
1 if D
i
= j
0 otherwise
I
y
i
= observed outcome for i
y
i
=
J
j =0
y
ji
D
ji
Identication of the dose-response function
I
Unconditional independence
_
y
j
_
j
l D
I
Strong unconfoundedness (Rosenbaum & Rubin 1983)
_
y
j
_
j
l D[x
= treatment assignment is conditionally independent of all potential
outcomes
I
Weak unconfoundedness (Imbens 2000)
y
j
l D
j
[x \j
= assignment to any particular treatment is conditionally independent
of that treatments potential outcome
Implication of weak unconfoundedness
E
[y
j
[x] =
E
[y[D
j
= 1, x]
=
E
[y[D = j , x]
=
E
[y
j
] =
E
[
E
[y[D = j , x]]
= one may estimate the conditional dose-response function by
estimating the mean outcome given treatment assignment and x, and
then obtain the population dose-response function by averaging over
the distribution of x
=
E
[y
j
y
j
/ ] =
E
[
E
[y
j
y
j
/ [x]]
=
E
_
E
[y[D = j , x]
E
_
y[D = j
/
, x

Example
I
Let x = gender (M, F)
I
= years of schooling (0, 1, ..., 21)
I
E
_
y
j
obtained by
F
Computing average value of y for sub-sample with D
ji
= 1 and x = M
= y
M
j
F
Computing average value of y for sub-sample with D
ji
= 1 and x = F
= y
F
j
F
Obtaining portion of M and F in entire sample = p
M
, p
F
F
Compute p
M
y
M
j
+ p
F
y
F
j
I
Obtain
E
_
y
j
/
similarly
I
Compute the dierence
I
Other parameters can be estimated by using the proportions of M and
F in various sub-samples (e.g., D = j , j
/
only)
Generalized propensity score
I
Denition
r (j , x) = Pr(D = j [x) =
E
[D
j
[x]
I
r (j , x) may be estimated given data on D, x (MNL, MNP, ordered
logit/probit)
I
Imbens (2000) shows that weak unconfoundedness =
y
j
l D
j
[r (j , x) \j
and
E
_
y
j
[r (j , x)
=
E
_
y[D
j
= 1, r (j , x)
=
E
[y[D = j , r (j , x)]
and
E
_
y
j
=
E
[
E
[y[D = j , r (j , x)]]
I
The above result requires r (j , x) > 0 along the entire support of x
Estimation
I
Given weak unconfoundedness and assuming r (j , x) > 0 for the entire
support of x, then
E
_
D
j
y
r (j , x)
_
=
E
_
y
j
I
Estimator
\
E
_
y
j
=
1
N
i
_
D
ji
y
i
\
r (j , x
i
)
_
which is analogous to the weighting estimator dened previously in the
binary treatment case
I
Analogous normalized weighting estimator given by
\
E
_
y
j
=
_
i
D
ji
y
i
\
r (j , x
i
)
_ _
i
D
ji
\
r (j , x
i
)
_
1
Non-Binary Treatments: Continuous Treatments
Suppose is an interval [d, d],and D has a continuous dbn on
= e.g., income
y
j
= potential outcome for treatment j
D
j
is not useful since j takes on an innite number of values
Weak unconfoundedness can be re-stated as
y
j
l D[x \j
in contrast to strong unconfoundedness which requires y
j
j
, the
full set of potential outcomes, to be conditionally independent
Generalized propensity score
I
Now dened as the conditional density of D given x
r (j , x) = f (j [x)
I
Implication (Hirano & Imbens 2004)
y
j
l D[r (j , x) \j
I
Estimation based on
E
_
y
j
=
E
[
E
[y[D = j , r (j , x)]]
I
Since D is continuous, estimation entails
F
Estimation of r (j , x)
F
Estimate
E
[y[D = j , r (j , x)] by regessing y on D and
\
r (j , x)
F
Average
\
E
[y[D = j , r (j , x)] over the dbn of x (at a xed value of j )
Weighting estimator version: see Robins (1998), Hernan et al. (2000)
See -doseresponse- in Stata
Stratication estimator version (Imai & van Dyk 2004)
I
Regress D on x via OLS = =
\
E
[D[x] = x
I
Split sample in K strata of equal size based on
I
Within each strata, model y as a function of D (and perhaps x to
further control for dierences in x)
F
y continuous: regress y on D and x
F
y binary: probit/logit
F
y ordered: oprobit/ologit
F
y count: poisson, NB
=

ATE
k
given by coecient on D
I
Obtain overall
ATE
as
ATE
=
k
_
N
k
N
_
ATE
k
I
Generalizable to multiple treatment case (e.g., two continuous
treatments: income, educ)
Dynamic Matching
Pertains to situations where agents receive an initial treatment or not,
and then have the option of receiving a second treatment if they
receive the rst treatment
Many employment or job training programs, or treatments within
schools, operate in this manner
Need to carefully consider the parameter of interest in these
applications, as well as CIA at dierent stages of the problem
See work by Lechner (2009, JBES), Lechner and Miquel (2010, EE),
Cooley et al. (2010), or Behrman et al. (2004, ReStat)
Regression Discontinuity
This estimator returns us to the class of binary treatments
First introduced in Thistlethwaite & Campbell (1960)
Two classes of models: sharp, fuzzy
Sharp RD is a selection on observables estimator, but is not based on
strong ignorability (in fact, it precludes it)
Fuzzy RD is a selection on unobservable estimators (discussed later in
the course)
Note: Recent work also on Regression Kinked Design (Card, Lee, &
Pei 2009)
RD setup
I
Agents self-select into treatment group
I
Selection done at least in part on the basis of an observed continuous
variable, s
F
s is referred to as the score, running variable, or forcing variable
I
s may directly impact potential outcomes as well
I
There exists a discrete jump in Pr(D = 1) at a known value, s
Thus, s and s are both known to the econometrician
Sharp RD model
(SRD.i) Treatment assignment is a deterministic function of s (with a known
threshhold, s)
D
i
= D(s
i
) =
_
1 if s
i
> s
0 otherwise
(SRD.ii) Positive density at the threshold: f
S
(s) > 0
(SRD.iii) Outcomes are continuous in s at least around s
(SRD.iv) For each agent, the dbn of s is continuous at least around s
Notes
I
(SRD.ii) implies we see agents near s
I
(SRD.iii) precludes discontinuities in y at s due to other reasons
besides changes in D
I
(SRD.iv) implies that agents cannot perfectly manipulate s to ensure
s ? s
F
This is crucial to give the setup the interpretation of a random
experiment in the neighborhood of s
Notes (cont.)
I
y
0
, y
1
l D[s follows from (SRD.i)
I
All RD estimators require existence of following limits
D
+
= lim
s|s
Pr(D = 1[s)
D
= lim
ss
Pr(D = 1[s)
and D
+
,= D
F
(SRD.i) implies D
+
= 1 and D
= 0
I
Common support condition is necessarily violated since
Pr(D = 1) =
_
1 if s
i
> s
0 otherwise
which implies that Pr(D = 1[s) / (0, 1) \s
Parameter of interest
ATE
(s) =
E
[y
1
y
0
[s]
= lim
s|s
E
[y[s] lim
ss
E
[y[s]
DiNardo & Lee (2011) advocate a dierent intepretation
I
Argue that RD estimates a weighted average of
i
where the weights
are proportional the probability that an agents s
i
is the neighborhood
of s
Estimation
Use only sub-sample with s
i
s , s + for small
I
Similar s = similar observations
I
Compute mean dierence in outcomes across treatment groups
ATE
(s) =
\
E
[y
i
[s
i
s, s + , D = 1]
\
E
[y
i
[s
i
s , s, D = 0]
=

N
i =1
y
i
I
[s
i
s, s + , D
i
= 1]
N
i =1
I
[s
i
s, s + , D
i
= 1]
N
i =1
y
i
I
[s
i
s , s, D
i
= 0]
N
i =1
I
[s
i
s , s, D
i
= 0]
p
_
E
[y
i
[s
i
s, s + , D = 1]
E
[y
i
[s
i
s , s, D = 0]
_
=
_
E
[y
1i
[s
i
s, s + , D = 1]
E
[y
0i
[s
i
s , s, D = 0]
_
=
E
[y
1i
[s
i
s, s + ]
E
[y
0i
[s
i
s , s]
,= lim
s|s
E
[y[s] lim
ss
E
[y[s] for xed > 0
This is essentially a kernel estimator with a uniform kernel over the
interval s, s + or s , s, which entails a non-negligible bias
for > 0
Example: If y is increasing in s, then
I
\
E
[y
i
[s
i
s, s + , D = 1] will overestimate lim
s|s
E
[y[s]
I
\
E
[y
i
[s
i
s , s, D = 0] will underestimate lim
ss
E
[y[s]
=

ATE
(s) will be biased up
Regression approach
I
Model
y
i
= D
i
+
i
where D = treatment indicator, = parameter of interest
I
Model is not estimable via OLS since Cov(D, ) ,= 0
I
However,
E
[[D, s] =
E
[[s]
I
Implies is estimable if the model is augmented with a suciently
exible function of s to proxy for
E
[[s]
y
i
= D
i
+ k(s
i
) +
i
where Cov(D, ) = 0
I
What is k(s)?
F
Linear: k(s) = s (Goldberger 1972; Cain 1975)
F
Quadratic: k(s) =
1
s +
2
s
2
(Berk & Rauma 1983; van der Klaauw
2000)
F
Semiparametric: k(s) =
M
m=1

m
s
m
, with M choosen by
cross-validation (Trochim 1984; van der Klaauw 2000)
Example:
-
1
0
1
2
3
0 .2 .4 .6 .8 1
score
outcome fitted values (OLS, y on D)
fitted values (OLS, y on s & D)
Note: S~U(0,1); D(s)=I(s>0.5); y=s+D+e; delta = 1
Notes
I
Testing of some of the underlying assumptions is feasible
F
Examine the density of s to look for evidence of discontinuity at s,
suggesting manipulation by agents (McCrary 2008)
F
Look for existence of discontinuities in predetermined variables at s
(similar to assessing balancing of predetermined variables in randomized
experiments)
I
If treatment eect is heterogeneous, then RD estimates a unique
parameter (discussed above) that may be uninteresting
F
This is an example of a local average treatment eect (LATE)
F
May be a policy relevant parameter if the question is the impact of a
marginal change in an eligibility cut-o, s
I
Applications: nancial aid, GED, Clean Air Act attainment status
I
See -rd- in Stata
Distributional Approaches
Analysis to this point has focused on mean eects of treatments
Averages may mask a lot of heterogeneity
Distributional methods seeks to assess the eects of treatments on
other quantities
Traditional approach is quantile regression (QR)
More recent approaches have been couched in the potential outcomes
framework and focus on quantile treatment eects (QTE)
Distributional Approaches: Quantile Regression
Motivation
I
QR provides a convenient linear framework for assessing the impact of
changes in a vector of covariates on the quantiles of the dependent
variable
I
Equivalently, QR allows estimation of linear conditional quantile
functions
I
Analogous to linear regression, which estimates the conditional mean
function
I
Common applications
F
Studies of wage determination
F
Studies of student achievement
Notation
I
F(y) = CDF of y
I
Q
(y) =
th
quantile of the random variable, y, given by
Q
(y) = infy : F(y) >

(Unconditional) quantiles as a minimization problem
I
Prior to discussing QR, it is useful to view unconditional quantiles as a
solution to a minimization problem
I
Example: median
Q
0.5
(y) = arg min
b

i
[y
i
b[
F
Solution depends on the sign of the residuals, not the magnitude
F
y = 99, 100, 101 =Q
0.5
(y) = 100;
y = 99, 100, 150 =Q
0.5
(y) = 100 as increasing b closer to 150
reduces that residual, but increases the sum of the other two residuals
by twice as much
F
Implies median is less sensitive to outliers than the mean
I
General formula for any quantile (0, 1)
Q
(y) = arg min

b
_

i :y
i
>b
[y
i
b[ +

i :y
i
<b
(1 )[y
i
b[
_
F
Quantiles other than the median are dened as the arg min of a
weighted sum of the absolute residuals
F
Intuition: say = 0.75 and b = median, then problem puts more
weight on residuals above b, which pushes the solution to the
minimization problem above the median
QR model (Koeneker & Bassett 1978)
I
Replace b in previous problem with a linear function of covariates
= arg min
1
N
_

i :y
i
>x
i
[y
i
x
i
[ +

i :y
i
<x
i
(1 )[y
i
x
i
[
_
which may be rewritten as
= arg min
1
N

(
i
)
where
(
i
) is known as the check function, dened as
(
i
) = [
I
(
i
< 0)]
i
and
i
is the residual for i and
I
Preceding objective fn is equivalent (after some algebra) to
= arg min
1
N
_
i
_

1
2
+
1
2
sgn(y
i
x
i
)
_
(y
i
x
i
)
_
I
Error distribution
F
Key assumption: Q
[x) = 0
F
No other assumption about the distribution
Estimation
The objective fn is not dierentiable = standard optimization
methods are not viable
Solved using linear programming methods
GMM estimation is also feasible (Buchinsky 1998)
Special case: median regression
I
Corresponds to QR model with = 0.5;

obtained from
0.5
= arg min
1
N

i
[y
i
x
i
[
I
Analogous to OLS, but

minimizes the sum of absolute errors instead
of sum of squared errors
I
Also known as LAD (Least Absolute Deviations) estimator
I
Useful alternative to OLS, particularly when the distribution of the
error term is symmetric (so the conditional mean and median are
equal), yet outliers are a concern
I
Also useful when y is imputed for some obs
Inference
Using a GMM framework, can show
_
N(
) N(0,
)
where
=
2
()(x
/
x)
1
2
() =
(1 )
f
2
(F
1
())
and f (F
1
()) denotes the density of the error distribution evaluated
at the
th
quantile
Intuitiion:
I
Estimation of the
th
conditional quantile uses only obs near the
th
quantile
I
Asymptotically, obs are added in this range in a manner proportional to
f (F
1
()) assuming iid errors
Utilizing the asymptotic formula for inference is dicult in practice
Bootstrap methods provide a simpler alternative (Buchinsky 1998)
Results
Parameters of interest are the partial derivatives of the conditional
quantile fn w.r.t. x
E
[Q
(y[x)]
x
k
which equals
k
if x enters linearly
Presentation of results
I
Dicult as there are a large number of results that are possible to
obtain (i.e.,
k
, k = 1, ..., K and (0, 1))
I
Possibilities
F
Typical table of coecient estimates at several quantiles (typically =
0.10, 0.25, 0.50, 0.75, and 0.90)
F
Graph the conditional quantile fns against x
k
if there is one x that is
the focus of the paper (again, typically for a few quantiles)
F
Graph

k
vs. for several dierent xs on one graph (only works if x
k
enters linearly)
Sequential estimation
I
In practice, one typically wishes to estimate

for multiple values of

I
Estimates are not independent since they are obtained from the same
data
I
Estimation one equation at a time, however, is ecient unless there are
cross-equation restrictions (e.g., one might wish for a type of smooth
coecient model)
Stata: -qreg-, -bsqreg-, -sqreg-, -grqreg- (for graphing), -qcount-
(for count data models), -lqreg- (for logistic models)
Distributional Approaches: Quantile Treatment Eects
Notation
I
y
1i
, y
0i
= potential outcomes for i
I
D
i
= binary indicator of treatment assignment
I
F
j
(y) = Pr[y
ji
< y], j = 0, 1 = CDFs of potential outcomes
I
y
j
= infy
j
: F
j
(y) > = quantiles of potential outcome dbns
QTE
=
E
[y
1
y
0
], (0, 1)
QTT
=
E
[y
1
y
0
[D = 1], (0, 1)
QTU
=
E
[y
1
y
0
[D = 0], (0, 1)
Interpretation
Constant treatment eect assumption
I
y
1i
= y
0i
+ \i
I
Implies F
1
1
() = F
1
0
() +
0
.
2
.
4
.
6
.
8
1
-4 -2 0 2 4
y1 y0
F
(
y
)
NOTE: y0~N(0,1); y1=y0+1
I

QTE
=
QTT
=
QTU
= \ (0, 1)
Heterogeneous treatment eects
I
y
1i
= y
0i
+
i
I
Perfect rank correlation (Heckman et al. 1997)
F
Denition: F
1
(y
1i
) = F
0
(y
0i
) \i
F
Intuition: each observation lies in the identical quantile in both
potential outcome dbns, which implies that y
1
is a monotone
transformation of y
0
F
Implication:
QTE
=
E
[y
1
y
0
] = Q
(), which is the

th
quantile of
the dbn of , which implies that QTEs identify the distribution of the
treatment eect, BUT this requires a strong assumption about the
joint dbn of potential outcomes
I
No perfect rank correlation
F
No assumption about the joint dbn of potential outcomes
F
Implication:
QTE
=
E
[y
1
y
0
] ,= Q
(), which implies that QTEs

identify the dierence in the two marginal dbns of the potential
outcomes, BUT say nothing about the dbn of actual treatment eects
... QTEs reect the eect of D on quantiles of the potential outcome
dbns, NOT on observations at particular quantiles.
Example #1...
ID y
0
y
1

1 1 2 1
2 2 4 2
3 3 6 3
4 4 8 4
5 5 10 5
Rank preservation holds;
i
varies
CDF of y
0
, y
1
are not identical
=
QTE
varies with
QTE
= Q
()
Example #2...
ID y
0
y
1

1 1 1 0
2 2 4 2
3 3 3 0
4 4 2 -2
5 5 5 0
Rank preservation is violated;
i
varies
CDF of y
0
, y
1
are identical =
QTE
= 0 \
QTE
,= Q
()
Estimation
Identication assumptions: strong ignorability (CIA, CS)
y
i
= D
i
y
1i
+ (1 D
i
)y
0i
= observed outcome
obtained using sample analogues of y
1
and y
0
Obtain

F
j
(y), j = 0, 1
F
j
(y) =
1
i
I
(D
i
= j )
i
I
(D
i
= j )
I
(y
i
_ y) unconditional
F
j
(y) =

i j

i
I
(y
i
_ y)
i j

i
covariates

i
=
D
i
p
i
(x
i
)
+
1 D
i
1 p
i
(x
i
)
(QTE)

i
= D
i
+
p
i
(x
i
)(1 D
i
)
1 p
i
(x
i
)
(QTT)

i
=
[1 p
i
(x
i
)]D
i
p
i
(x
i
)
+ 1 D
i
(QTU)
where p
i
(x
i
) is the propensity score and x is the vector such that CIA
holds
y
1
= infy :

F
1
(y) > ; similarly for y
0
Implies

QT
= y
1
y
0
Inference based on bootstrap
Test of equal CDFs (Abadie 2002)
I
Equivalent to test for H
o
:
= 0 \ (0, 1)
I
Utilize Kologorov-Smirnov statistic
d
eq
=
_
N
2
sup [F
1
(y) F
0
(y)[
I
Compute
d
eq
=
_
N
2
max
k
_
F
1
(y
k
)
F
0
(y
k
)
_
for a grid of points, k = 1, ..., K in the support of y
i
I
Inference for test of equality using bootstrap
Stata: -dbn- (my code)
Distributional Approaches: Stochastic Dominance
In the event the QTEs dier in sign or signicance across the dbn,
may be interested in ranking dbns
Denitions
I
First Order Stochastic Dominance: Y
1
FSD Y
0
i
F
1
(y) _ F
0
(y) \y
with strict inequality for some y (where is the union of the supports
for Y
1
and Y
0
), or
y
1
_ y
0
\ [0, 1]
with strict inequality for some
I
Second Order Stochastic Dominance: X SSD Y i
_
y
F
1
(t)dt _
_
y
F
0
(t)dt \y , or
_

0
y
t
1
dt _
_

0
y
t
0
dt \ [0, 1]
with strict inequality for some y or
Example: FSD... (y
1
~ N(1, 1); y
0
~ N(0, 1))
0
.
2
.
4
.
6
.
8
1
-4 -2 0 2 4
Support
Control Treatment
F
(
x
)
.
8
.
9
1
1
.
1
1
.
2
0 10 20 30 40 50 60 70 80 90 100
Quantile
(
T
r
e
a
t
m
e
n
t

-

C
o
n
t
r
o
l
)
Q
u
a
n
t
i
l
e

T
r
e
a
t
m
e
n
t

E
f
f
e
c
t
Example: SSD... (y
1
~ N(0.25, 0.25); y
0
~ N(0, 1))
0
.
2
.
4
.
6
.
8
1
-4 -2 0 2 4
Support
Control Treatment
F
(
x
)
-
1
-
.
5
0
.
5
1
1
.
5
0 10 20 30 40 50 60 70 80 90 100
Quantile
(
T
r
e
a
t
m
e
n
t

-

C
o
n
t
r
o
l
)
Q
u
a
n
t
i
l
e

T
r
e
a
t
m
e
n
t

E
f
f
e
c
t
FSD = SSD
Third and higher order rankings exist
Any two dbns can be ranking at some order of SD
Implications
I
Notation
F

1
= class of social welfare fns that are increasing in y
F

2
= sub-class of
1
that includes all social welfare fns that are also
concave in y
I
X FSD Y = X is at least as preferred by all welfare functions in
1
,
with strict inequality holding for some welfare function in the class
I
X SSD Y = X is at least as preferred by all welfare functions in
2
,
with strict inequality holding for some welfare function in the class
Test statistics
d = min sup
z
[F(z) G(z)]
s = min sup
z
_
z
[F(t) G(t)] dt
where min is taken over F G and G F
Tests are based on estimates of d and s using the empirical CDFs
I
Unconditional, or
I
Inverse propensity score weighted
Inference using bootstrap (simple and/or more complex methods)
Selection on Unobservables
When all xs required for CIA to hold are not observed, then one
enters into selection on unobservables world
Implies unobservable attributes of obs i are correlated with both
potential outcomes and treatment assignment of obs i
In general, this implies
E
[y
j
[x, D = j ] ,=
E
[y
j
[x, D = j
/
], j , j
/
= 0, 1
In a regression framework, with functional form assumptions, this
implies
y
i
= D
i
y
1i
+ (1 D)
i
y
0i
=
0
+ x
i
0
+ (
1
0
)D
i
+ x
i
D
i
(
1

0
)
+ [
0i
+ D
i
(
1i

0i
)]
where SOU results if
I
Cov(D,
0
) ,= 0 = selection on unobservables impacting outcome in
untreated state, or
I
Cov(D,
1
0
) ,= 0 = presence of and selection on unobserved,
obs-specic gains from treatment
Possible solutions
1
Bound treatment eects (set identication as opposed to point
identication) under minimal assumptions
2
Utilize panel data
3
Utilize exclusion restrictions (i.e., instrumental variables)
4
Model dependence between treatment and unobservables = control
function approach
5
Other methods that nd identication elsewhere
Bounding Treatment Eects
Recall, the ATE
ATE
(x) =
E
[y
1
y
0
[x] =
E
[y
1
[x]
E
[y
0
[x]
=
E
[y
1
[x, D = 1] Pr(D = 1[x)
+
E
[y
1
[x, D = 0] Pr(D = 0[x)
E
[y
0
[x, D = 1] Pr(D = 1[x)
+
E
[y
0
[x, D = 0] Pr(D = 0[x)
= g
1
(x)
E
[y
0
[x, D = 1]p(x)
+
E
[y
1
[x, D = 0] g
0
(x)[1 p(x)]
where p(x), the propensity score, and g
j
(x), j = 0, 1, are all
observable from the data
Similar derivation for other two primary mean treatment eect
parameters
ATT
(x) = g
1
(x)
E
[y
0
[x, D = 1]
ATU
(x) =
E
[y
1
[x, D = 0] g
0
(x)
Thus, without additional information, no parameter is identied
Early bounding approach outlined in Smith and Welch (1986)
I
Objective was to estimate the average wage for blacks accounting for
selection into LF
E
[w] =
E
[w[LF = 1] Pr(LF = 1) +
E
[w[LF = 0] Pr(LF = 0)
where
E
[w[LF = 0] is not observed
I
Solution:
E
[w[LF = 0] =
E
[w[LF = 1], [0.5, 1]
I
In treatment eects context, can specify
E
[y
d
[D = d
/
] =
E
[y
d
[D = d] for dierent values of , where d ,= d
/
I
Rosenbaum (2002) summarizes other papers that bound causal eects
by varying the unobserved parameters
More recent approaches focus on adding assumptions to tighten the
bounds on the parameter of interest
Notation (Lechner 1999; Manski 1990)
I
L
1
, L
0
= lower bounds of the support of y
1
, y
0
, respectively
I
U
1
, U
0
= upper bounds of the support of y
1
, y
0
, respectively
I
B
L
k
, B
U
k
= lower, upper bounds, respectively, of treatment eect k
(k = ATE, ATT, or ATU)
I
w
k
= B
U
k
B
L
k
= width of bounds for treatment eect k
Trivial case
I
No additional information
B
L
k
= L
1
U
0
B
U
k
= U
1
L
0
w
k
= (U
1
L
0
) (L
1
U
0
)
= (U
1
L
1
) + (U
0
L
0
)
I
Example: y is binary (e.g., employment after job training program)
L
1
= L
0
= 0
U
1
= U
0
= 1
B
L
k
= 1
B
U
k
= 1
w
k
= 2
Tightening bounds with data
Use sample data
I
p(x), g
0
(x), g
1
(x) may be consistently estimated from the data by
F
Sample means
F
Nonparametric smoothing methods
F
Parametric methods
New bounds with sample data
I

ATE
(x)
B
L
ATE
=
[
g
1
(x) U
0
p(x) +L
1
[
g
0
(x)[1

p(x)]
B
U
ATE
=
[
g
1
(x) L
0
p(x) +U
1
[
g
0
(x)[1

p(x)]
w
ATE
= (U
1
L
1
)[1

p(x)] + (U
0
L
0
)

p(x)
I

ATT
(x)
B
L
ATT
=
[
g
1
(x) U
0
B
U
ATT
=
[
g
1
(x) L
0
w
ATT
= U
0
L
0
I

ATU
(x)
B
L
ATU
= L
1
[
g
0
(x)
B
U
ATU
= U
1
[
g
0
(x)
w
ATU
= U
1
L
1
Example: y is binary = w
k
= 1 \k (sample data cuts width in half)
Note: Bounds necessarily include zero
I
Cannot rule out zero average treatment eect
I
Can exclude some extreme values
I
Full characterization of the bounds should also account for uncertainty
in the variables belonging in x and the model used to estimate g
0
(x),
g
1
(x), and p(x) (Heckman et al. 1999)
F
While bounds conditional on x and a model, m, all have width one, the
exact bounds are aected
I
Kreider, Pepper, and co-authors incorporate measurement error in D
into the bounds (discussed later)
Tightening bounds with assumptions
Assume
ATT
(x) =
ATU
(x)
I
Calculate bounds for
ATT
(x) and
ATU
(x)
I
New bounds include only the intersection of the two bounds
I
Example
ATT
(x) [0.25, 0.75]
ATU
(x) [0.75, 0.25]
then new bounds are [0.25, 0.25]
I
Note: still necessarily include zero since bounds on
ATT
(x),
ATU
(x)
both include zero
Level-set restrictions: treatment eects are constant \x A
0
_ A
(the support of x)
I
Calculate bounds for
k
(x) \x A
0
I
New bounds include only the intersection of these bounds
I
Example (
ATE
)
ATE
(x
a
) [0.25, 0.75]
ATE
(x
b
) [0.75, 0.25]
where x
a
, x
b
A
0
, then new bounds are [0.25, 0.25]
I
Note: still necessarily include zero since bounds on
k
(x) include zero
\x
I
Formally
B
L
k
(A
0
) = sup
xA
0
B
L
k
(x)
B
U
k
(A
0
) = inf
xA
0
B
U
k
(x)
w
k
(A
0
) = B
U
k
(A
0
) B
L
k
(A
0
)
Level-set restrictions: expected outcomes are constant
\x A
0,1
_ A (for y
1
) and \x A
0,0
_ A (for y
0
)
I
Implies
E
[y
1
[x] is constant \x A
0,1
E
[y
0
[x] is constant \x A
0,0
= Bounds become
B
L
ATE
(x
0
) = sup
xA
0,1
[
g
1
(x)

p(x) + L
1
[1

p(x)]
inf
xA
0,0
[
g
0
(x)[1

p(x)] + U
0

p(x)
B
U
ATE
(x
0
) = inf
xA
0,1
[
g
1
(x)

p(x) + U
1
[1

p(x)]
sup
xA
0,0
[
g
0
(x)[1

p(x)] + L
0

p(x)
B
L
ATT
(x
0
) = sup
xA
0,1
[
g
1
(x) inf
xA
0,0
U
0
B
U
ATT
(x
0
) = inf
xA
0,1
[
g
1
(x) sup
xA
0,0
L
0
B
L
ATU
(x
0
) = sup
xA
0,1
L
1
inf
xA
0,0
[
g
0
(x)
B
U
ATU
(x
0
) = inf
xA
0,1
U
1
sup
xA
0,0
[
g
0
(x)
where x
0
A
0,1
A
0,0
Assumption: positive selection
I
Implies
E
[y
1
[x, D = 1] >
E
[y
0
[x, D = 1]
which means that the treated only join the treatment group if there are
non-negative gains on average
I
Bounds become
B
L
ATE
= L
1
[
g
0
(x)[1

p(x)]
B
U
ATE
=
[
g
1
(x) L
0
p(x) +U
1
[
g
0
(x)[1

p(x)]
B
L
ATT
= 0
B
U
ATT
=
[
g
1
(x) L
0
I
Does not aect bounds on
ATU
(x)
Combining assumptions, restrictions
B
L
k,combine
= max
p
B
L
k,p
B
U
k,combine
= min
p
B
U
k,p
where is the set of restrictions being combined

Inference via bootstrap
I
Yields condence intervals for the bounds, not the treatment eect
I
For example, a 90% CI implies that the probability that the true
bounds lie in the CI is 90%; the probability that the true treatment
eect lies in the CI is even higher (see also Imbens & Manski (2004))
Tightening bounds (again)
Manski (1990), Manski & Pepper (2000) consider additional
assumptions
1
Instrument
E
[y
j
[z] =
E
[y
j
], j = 0, 1
2
Monotone Instrument
z
1
_ z _ z
2
=
E
[y
j
[Z = z
1
] _
E
[y
j
[Z = z] _
E
[y
j
[Z = z
2
], j = 0, 1
3
Monotone Treatment Selection
E
[y
j
[D = 1] _
E
[y
j
[D = 0], j = 0, 1
4
Monotone Treatment Response
y
0
_ y
1
=
E
[y
0
] _
E
[y
1
]
where x is omitted for notational convenience
Use of an instrument
I
E
[y
j
[z] =
E
[y
j
], j = 0, 1, implies
E[y
j
]
_
sup
z
E[y [D = j , Z = z] Pr(D = j [Z = z) + L
j
Pr(D ,= j [Z = z),
inf
z
E[y [D = j , Z = z] Pr(D = j [Z = z) + U
j
Pr(D ,= j [Z = z)
_
I
Bounds for
ATE
become
B
L
ATE
= sup
z

[
g
1
(z)
p(z) + L
1
[1

p(z)]
inf
z

[
g
0
(z)[1

p(z)] + U
0

p(z)
B
U
ATE
= inf
z

[
g
1
(z)
p(z) + U
1
[1

p(z)]
sup
z

[
g
0
(z)[1

p(z)] + L
0

p(z)
I
Bounds are tighter than worst case bounds if p(z) ,= Pr(D = 1); i.e., z
is correlated with treatment assignment
Use of a monotone instrument (MIV)
I
z
1
_ z _ z
2
=
E
[y
j
[Z = z
1
] _
E
[y
j
[Z = z] _
E
[y
j
[Z = z
2
], j = 0, 1
F
Weaker assumption than the prior, mean independence assumption
F
Implies that potential outcomes are non-decreasing in z
I
Implies
E
[y
j
]
_

z?
Pr(Z = z)
_
sup
z
1
_z
E
[y[D = j , Z = z
1
] Pr(D = j [Z = z
1
)
+ L
j
Pr(D ,= j [Z = z
1
)
_
,
z?
Pr(Z = z)
_
inf
z
2
_z
E
[y[D = j , Z = z
2
] Pr(D = j [Z = z
2
)
+ U
j
Pr(D ,= j [Z = z
2
)
__
I
Bounds derived based on this
Monotone treatment selection (MTS)
I
E
[y
j
[D = 1] _
E
[y
j
[D = 0], j = 0, 1, implies that the treated group
has weakly higher potential outcomes in all treatment states
I
Plausible in certain cases when one does not condition on x and x is
correlated with both D and y
j
in the same direction
I
Implies
E
[y
j
] [
E
[y[D = j ] Pr(D _ j ) + L
j
Pr(D < j ),
E
[y[D = j ] Pr(D _ j ) + U
j
Pr(D > j )]
Monotone treatment response (MTR)
I
y
0
_ y
1
=
E
[y
0
] _
E
[y
1
] implies we know the sign of the treatment
eect (inclusive of zero)
I
Implies
ATE
_ 0
I
Stronger than the positive selection assumption previously as that only
applied to the sub-sample with D = 1
MIV can be combined with MTS, MTR
Methodology can also be combined with assumptions concerning
measurement error (discussed later)
Stata: -bpbounds- (related)
Altonji et al. Approach
Altonji et al. (2005) oer two approaches to assess the sensitivity of
estimates obtained under SOO assumption when this assumption is
false
Approach #1 is applicable to the case of a binary outcome
Approach #2 is applicable regardless of type of outcome
Krauth (2011) attempts to extend the approach
Approach #1: Bivariate probit model
Model
y
+
i
= x
i
+ D
i
+
i
D
+
i
= x
i
+
i
where , ~ N(0, 0, 1, 1, ) and
y =
_
1 if y
+
> 0
0 otherwise
D =
_
1 if D
+
> 0
0 otherwise
Estimation by ML
ln / =

i :y =1,D=1
ln[
2
(x
i
+ , x
i
, )]
+
i :y =1,D=0
ln[
2
(x
i
, x
i
, )]
+
i :y =0,D=1
ln[
2
(x
i
, x
i
, )]
+
i :y =0,D=0
ln[
2
(x
i
, x
i
, )]
Model is technically identied with no exclusion restriction, but treat
as unidentied
Assessing treatment eect as varies provides evidence of sensitivity
to selection on unobservables
Constrain > 0 = positive selection; < 0 = negative selection
Approach #2: SOU relative to SOO
Intuition is to assess how much SOU, relative to the amount of SOO,
is needed to fully explain the observed positive association between D
and y
If
(AET.i) Random observables: x is a random subset of all factors, w, inuencing
y
(AET.ii) Equally important factors: the number of elements in w is large and no
single variable factor has an undue inuence on y
(AET.iii) Relationship between x and unobservables: slightly weaker technical
assumption than independence between x and remaining elements of w
then one should expect the amount of selection controlled for by x to
equal the amount of selection on unobservables
Implies that if the amount of SOU needed to explain the observed
association is less than amount of SOO, the estimated treatment
eect should not be viewed as robust
Model for outcome
y
i
= x
i
+ D
i
+
i
The (normalized) amount of SOU is given by
E
[[D = 1]
E
[[D = 0]
Var()
The (normalized) amount of SOO ignoring the impact of D is
given by
E
[x[D = 1]
E
[x[D = 0]
Var(x)
The goal is to assess how large SOU must be relative to SOO to fully
account for the positive treatment eect estimated under exogeneity
Express actual treatment participation as
D
i
= x
i
+
i
plim of OLS estimator of is
plim = +
Cov(, )
Var()
= +
Var(D)
Var()

E
[[D = 1]
E
[[D = 0]
Under the assumption that SOO = SOU, the asymptotic bias term is
Cov(, )
Var()
=
Var(D)
Var()
_
E
[x[D = 1]
E
[x[D = 0]
Var(x)
Var()
_
This bias can be consistently estimated under H
o
: = 0
The ratio /
bias indicates how much larger SOU needs to be relative

to SOO to entirely explain the treatment eect
A small ratio = treatment eect is highly sensitive to selection on
unobservables; a ratio >> 1 implies treatment eect is robust
Algorithm:
1
Estimate Var(D) from sample
2
Estimate treatment eqtn via LPM =
\
Var()
3
Estimate outcome eqtn via OLS restricting = 0 = x
,
\
Var(x
),
\
Var()
4
Obtain sample means of x
in treatment and control groups =

\
E
[x
[D = 1],
\
E
[x
[D = 0]
5
Estimate outcome eqtn via OLS =
6
Compute ratio of /
bias
Notes:
I
If y is binary, estimate treatment eqtn via probit perhaps in step 3 =
Var() = 1
I
AET methods have relatively little to say about economic signicance
of treatment eect unless one makes assumptions about amount of
SOU
Panel Data
Refer to ECO 6375 for panel data refresher...
Panel data is useful addressing selection on unobservables that are
invariant along a certain dimension
Thus, panel data methods provide a solution to selection on
unobservables in only certain situations
Notation
I
Population regression fn given by
E
[y[x
1
, ..., x
k
, c]
I
x
k
, k = 1, ..., K, are observable (to the econometrician)
I
c is an unobservable (to the econometrician) variable
Assuming linearity:
E
[y[x
1
, ..., x
k
, c] =
0
+ x + c
Error form of the model
y =
0
+ x + c +
where c is the unobserved eect and is the idiosyncratic error
Time-series or cross-section models are forced to include c in the error
term (referred to as the composite error )
y
i
=
0
+ x
i
+
i
,
i
= c
i
+
i
y
t
=
0
+ x
t
+
t
,
t
= c
t
+
t
Model
y
it
=
0
+ x
it
+ c
i
+
it
I
Unobserved eect is assumed to be time invariant (assuming a
traditional panel where t represents time)
I
x may include time dummies or time trend, etc.
Problem: given presence of c
i
, how can we recover consistent
estimates of
0
, ?
Estimation techniques
I
Assuming Cov(x, c) = 0
F
Pooled OLS (POLS)
F
Random eects (RE)
I
Assuming Cov(x, c) ,= 0
F
Least squares dummy variable model (LSDV)
F
Fixed eects (FE)
F
First-dierencing (FD)
Panel Data: Treatment Eects Models
Structural model
y
it
= c
i
+
t
+ x
it
+ D
it
+
it
, i = 1, ..., N; t = 1, ..., T
where
t
are time dummies
Special case
I
Setup
F
T = 2
F
D
i 1
= 0 \i
F
D
i 2
0, 1 \i
F
Assume no xs
I
FE or FD estimation =
=
E
[y[D
2
= 1]
E
[y[D
2
= 0]
I
Known as dierence-in-dierences estimator
Visual representation of special case
y
it
= c
i
+
t
+ x
it
+ D
it
+
it
I
Expected outcomes by period and treatment status
t = 1 t = 2
D = 0 c
0
+
1
c
0
+
2
D = 1 c
1
+
1
c
1
+
2
+
I
Implies
E
[y[D
2
= 1] = (c
1
+
2
+ ) (c
1
+
1
) = +
2
1
E
[y[D
2
= 0] = (c
0
+
2
) (c
0
+
1
) =
2
1
which implies
=
E
[y[D
2
= 1]
E
[y[D
2
= 0]
Before-After Estimator
Cross-Section Estimator
DID
0
1
2
3
-1 0 1
Period
Y0 Y1
Note: Illustration of Three Common Estimators.
Beyond the special case
I
Special case is useful to gain the intuition, not required
I
In general, as long as D
it
is time-varying for some units i , then can
be estimated by any panel data method given the required assumptions
are met
I
If selection into treatment is only on observables (not c
i
), then POLS
or RE may be consistent and ecient
I
If selection into treatment is also on time invariant unobservables (c
i
),
then POLS and RE are inconsistent, but FE or FD are consistent if
other assumptions are met
I
Important to remember: FE/FD is not a magic bullet (Duo et al.
2004)
F
FE and FD require strict exogeneity; rules out Ashenfelters Dip =
Cov(D
it
,
it1
) ,= 0
F
Rules out selection on contemporaneous shocks = Cov(D
it
,
it
) ,= 0
F
Key: requires treated and untreated to follow same time trend in
absence of treatment
F
Di-in-di-in-di may be an option
I
With heterogeneous treatment eects, FE identies the ATT
Timing issues (LaPorte & Windmeijer 2005)
Previous model restricts D to a one-time intercept shift,
In certain applications, agent may anticipate treatment and alter
behavior prior to actual treatment; or, response may occur with a lag;
or, some combination of both
Examples: policy changes announced, but not implemented until
future date; or, lags in adjustment to policy changes
General structural model
y
it
= c
i
+
t
+ x
it
+
L
0
l =1

l
D
l
it
+
0
D
it
+
L
1
l =1

l
D
l
it
+
it
where
D
l
it
= D
it+l
(treatment assignment l periods in future)
D
l
it
= D
itl
(treatment assignment l periods in past)
l
reects anticipatory eects of treatment
l
reects lagged eects of treatment
0
reects instantaneous eects of treatment
Specication test
If anticipatory and/or lagged eects occur, but simple model of
one-time eect is estimated, then FE and FD will yield (statistically)
dierent estimates
E
[
FD
] =
0
1
E
[
FE
] =

t

t
(
0+
)
where
0+
= average of
0
,
1
, ...,
L
1
= average of
1
, ...,
L
0
and
t
are weights
H
o
:
FD
=
FE
== H
o
: = 0
_
y
it
y
it1
y
it
y
i
_
=
_
x
it
x
it1
x
it
x
i
_
+
_
D
it
D
it1
D
it
D
i
_
+
_
0
D
it
D
i
_
+
_

it
it
_
Estimate via OLS, look at condence interval on

Lee and Huang (2011) extend the existing literature on dynamic

treatment eects to allow for anticipatory behavior
Autoregressive Model
Fixed eects models require D
it
to be time-varying for some i
If D is time invariant \i , it is still possible to identify the eect of the
program under the common treatment eect assumption
Structural model
y
it
=
t
+ x
it
+ D
i
+
it
it
=
it1
+
it
where
it
is iid with mean zero and is the homogeneous treatment
eect
Quasi-FD yields
y
it
=

t
+ (x
it
x
it1
) + (1 )D
i
+ y
it1
+
it
OLS is consistent if (i) x are strictly exogenous and (ii) D is
uncorrelated with (e.g., post-treatment shocks are not forecastable
and therefore do not aect past treatment decision
Comparative Case Study Approach
Provides an alternative to DD when
I
Treatment occurs at an aggregate level
I
Typically only a single observation is treated and lengthy history of
pre-treatment data are availble for the treated and the pool of controls
Examples:
I
Mariel Cuban Boat Lift (Card 1980)
I
State minimum wage (Card & Krueger 1994)
Solution
I
Construct a synthetic control which is a weighted average of available
to controls to estimate the missing counterfactual in post-treatment
period(s)
I
Weights are chosen by matching pre-treatment covariates and outcomes
I
Allows for dierential time trends in treatment and control observations
F
By matching pre-treatment outcomes, one is implicitly matching on the
time-invariant unobserved eect
F
Thus, does not matter if unobservd eect has dierential eects over
time if the time-specic eect is a common factor
Model
I
y
it
is observed outcome for obs i , i = 1, ..., J + 1, in period
t = 1, ..., T
o
, ..., T
I
Obs 1 is treated; remaining 2, ..., J + 1 are never treated
I
Timing of treatment eects
1
No Anticipatory Eects: T
o
is period prior to obs 1 being treated
2
Anticipatory Eects: T
o
is period prior to any anticipatory eects for
obs 1 begining
I
Outcomes in the absence of treatment
y
it
= y
N
it
=
t
+
t
Z
i
+
t
u
i
+
it
I
Outcomes with treatment
y
it
= y
I
it
= y
N
it
+
it
Synthetic control is dened as
J+1
j =2

j
y
jt
=
J+1
j =2

j
(
t
+
t
Z
i
+
t
u
i
+
it
)
where
j
is the weight given to control j and
I

J+1
j =2

j
= 1
I

j
_ 0 \j
Conditional on choice of weights,
+
j
, period-specic treatment eect
is estimated as
it
= y
1t

J+1
j =2

+
j
y
jt
Requires a SUTVA-type assumption that the treatment does not
impact outcomes in the control pool
Weights are chosen to match moments of the data in periods t _ T
o
I
Dene
y
K
i
=
T
o
s=1
k
s
y
is
where K = (k
1
, ..., k
T
o
) is a vector of weights and thus y
K
i
represents
a particular linear combination of pre-treatment outcomes for obs i
I
Given M unique linear combinations, dene the vector of pre-treatment
outcomes for obs 1 as
X
1
= (Z
/
1
, y
K
1
1
, ..., y
K
M
1
)
with dimension R 1
I
Dene the R J matrix of variables for the remaining obs i ,
i = 2, ..., J + 1 as X
0
, where column j is given by
(Z
/
j 1
, y
K
1
j 1
, ..., y
K
M
j 1
)
I
Weights are chosen to minimize some distance function
[[X
1
X
0
W[[
V
=
_
(X
1
X
0
W)
/
V(X
1
X
0
W)
where V is a R R symmetric, positive semidenite matrix
I
In practice, V is chosen to minimize the MSE of the pre-intervention
predictions
Inference is handled by
I
Re-doing the analysis, treated obs i , i = 2, ..., J + 1, as treated after
period T
o
and the remaining obs as the pool of potential controls
I
This yields a dbn of treatment eect estimates under H
o
of no
treatment eect
I
If actual estimates of
1t
look very dierent, this is evidence of a
statistically meaningful treatment eect
Code is available in Stata at
http://www.mit.edu/~jhainm/synthpage.html.
Example: Abadie et al. (2010)
Instrumental Variables
Refer to ECO 6374 for refresher on basics...
Terminology
I
Structural model
y
i
=
0
+
1
x
i
+
i
I
First-stage model
x
i
=
0
+
1
z
i
+ u
i
I
Reduced form model
y
i
= (
0
+
1
0
) +
1
1
z
i
+ (
i
+
1
u
i
)
=
0
+
1
z
i
+
i
Goal: devise alternative estimation technique to obtain consistent
estimates when
E
[[x] ,= 0
I
Solution: identify from exogenous variation in x isolated using
instruments, z
I
z is a valid IV for x i
(IV.i) First-stage:
E
[z
/
x] ,= 0
(IV.ii) Exogeneity:
E
[z
/
] = 0
(IV.iii) Exclusion:
E
[y[x, z] =
E
[y[x]
where z and x are both N K matrices
I
Exogenous xs serve as instruments for themselves
I
Need unique instrument for each endogenous var
Stata: -ivreg2-, -xtivreg2-
Several issues remain under scutiny in the literature
1
Choice of estimation technique
2
Properties and inference with weak IVs =
E
[z
/
x] - 0
3
Properties and inference with endogenous IVs =
E
[z
/
] ,= 0
Estimators
1
IV
2
Two-Stage Least Squares (TSLS or 2SLS)
3
Nagar
4
Split-sample or Two-Sample IV
(data set #1: x, z
N
1
i =1
; data set #2: y, z
N
2
i =1
)
5
JIVE
6
LIML
7
Fuller (modied LIML)
8
GMM
Estimators: IV Estimator
Estimator is given by
y = x +
=z
/
y = z
/
x + z
/
= (z
/
x)
1
z
/
y if z
/
= 0
=

IV
= (z
/
x)
1
z
/
y
Estimated asymptotic variance is given by
Var(
IV
) =
2
(z
/
x)
1
(z
/
z)(x
/
z)
1
;
2
=
1
N K
2
i
Estimators: Two-Stage Least Squares
IV estimator requires 1 instrument per endogenous variable; otherwise
z
/
x is a L K matrix (L > K) with rank = K, and the inverse does
not exist
Discarding additional IVs is probably inecient
TSLS is an alternative estimator that does not face this problem
In multivariate regression, this is formalized as
I
First-stage
x = z(z
/
z)
1
z
/
x
and replacing z with x in the IV estimator
I
Estimator now given by
TSLS
= (x
/
x)
1
x
/
y = [x
/
z(z
/
z)
1
z
/
x]
1
x
/
z(z
/
z)
1
z
/
y
Notes ...
In a multiple regression...
I
With multiple endogenous vars, need at least as many IVs as
endogenous xs; do not interpret this IV for this x, that IV for that x
I
Where the second-stage contains other exogenous vars, these vars must
be included in the rst-stage
If strictly more IVs than endogenous vars, then
I
Model is overidentied (as opposed to exactly identied)
I
Enables additional tests for instrument validity
Estimators are CAN, but biased
I
Intuition behind the bias is that the rst-stage OLS estimates,

, are
correlated with the error term from the structural model, , which
implies that the tted values, x are also correlated with
Incorrectly treating other covariates in the model as exogenous =
inconsistent estimates if instrument(s) are correlated with these
covariates
Estimators: JIVE, SSIV, Nagar
Breaking the correlation between

and is the motivation behind
JIVE and SSIV
SSIV (Angrist & Krueger 1992, 1995)
I
Approach
F
Divide sample into two groups: i = 1, ..., N
1
and i = N
1
+ 1, ..., N
F
Estimate rst-stage using N
2
obs, i = N
1
+ 1, ..., N
F
Predict x out-of-sample for rst N
1
obs
F
Estimate second-stage using rst N
1
obs
I
Estimators
SSIV
= (x
/
21
x
/
21
)
1
x
/
21
y
USSIV
= (x
/
21
x
/
1
)
1
x
/
21
y
where x
21
= z
1
(z
/
2
z
2
)
1
z
/
2
x
2
and subscript 1 (2) refers to estimation
on i = 1, ..., N
1
(i = N
1
+ 1, ..., N)
I
SSIV uses OLS in the second-stage; USSIV stands for Unbiased SSIV
and uses IV in the second-stage
JIVE
I
Approach
F
Estimate rst-stage using N 1 obs
F
Predict x out-of-sample for the excluded obs
F
Repeat for all N obs and estimate second-stage using all N obs
I
Estimators
JIVE
= (x
/
i
x
/
i
)
1
x
/
i
y
UJIVE
= (x
/
i
x)
1
x
/
i
y = (x
/
C
/
J
x)
1
x
/
C
/
J
y
where x
/
i
is matrix whose i
th
row is z
i
i
,
i
is the vector of
rst-stage coes with obs i removed, and
C
j
= (
I
D
P
z
)
1
(P
z
D
P
z
), D
P
z
= diag(P
z
), and P
z
= z(z
/
z)
1
z
/
I
JIVE uses OLS in the second-stage; UJIVE stands for Unbiased JIVE
and uses IV in the second-stage
I
Stata: -jive-
Nagar estimator is a bias-corrected TSLS estimator
I
Nagar (1959), Hahn & Hausman (2002)
I
Estimator given by
N
=
_
x
/
_
P
z

K
N
I
N
_
x
_
1
x
/
_
P
z

K
N
I
N
_
y
where K = # IVs and P
z
= z(z
/
z)
1
z
/
I
Hahn & Hausman (2002) discuss the poor performance of the Nagar
estimator when the model is close to being unidentied
Estimators: LIML, Fuller, and k-Class Estimators
k-class estimators can be all be written as
k
= [x
/
(I
N
kM
z
)x]
1
x
/
(I
N
kM
z
)y
for dierent values of k, where M
z
= I
N
z(z
/
z)
1
z
/
k = 0 = OLS
k = 1 = TSLS
k = = LIML
k =

N L
= Fuller
k = 1 +
L K
N
= Nagar
For LIML, is a minimum eigenvalue
For Fuller, is user-specied (typically 1) and L = # included +
excluded instruments
For Nagar, L K = # over-identifying restrictions
IV: Specication Tests
Much specication testing is required when utilizing IV in applied
research
Types of tests available
I
Tests of endogeneity:
E
[x
/
]
?
= 0
I
Tests of instrument relevance:
E
[z
/
x]
?
= 0
I
Tests of overidentication:
E
[z
/
]
?
= 0 (partial test only)
I
Tests for weak instruments:
E
[z
/
x] - 0
Covered in ECO 6374
With weak IVs, some recommend LIML, others Fuller, others UJIVE,
others TSLS (which tends to have a larger bias, similar RMSE)
IV: Imperfect Instruments
Recent work has explored what can be learned if z is an imperfect
instrumental variable (IIV)
Two possible imperfections:
1
z is also endogenous
2
z is not excludable from the second-stage
Nevo & Rosen (2010) and Ashley (2009) address endogeneity
Conley et al. (2010) address excludability
Note: These are intimately related since if z is incorrectly treated as
excludable, then it will be correlated with the second-stage composite
error that now includes the error and z
Nevo & Rosen (2010) ...
Setup
I
Model given by
y
i
= x
i
+ w
i
+
i
where x is a single endogenous regressor, w is exogenous (or
alternatively are endogenous with valid instruments), and z is 1 k
z
vector of imperfect instruments for x
I
z is an imperfect IV (IIV) in the sense that it is also correlated with
I
Assumptions:
(IIV.i) Sign of correlation:
x
z
j
_ 0, j = 1, ..., k
z
(IIV.ii) Degree of endogeneity: [
x
[ _ [
z
j
[, j = 1, ..., k
z
(IIV.iii) True model: y
i
= x
i
+ w
i
+
i
(IIV.ii) contrasts with the classical IV assumption that
z
j
= 0
Dene
+
j
=

z
j
x
which is in the unit interval under (IIV.i), (IIV.ii)
If
+
j
were known, then a valid IV for x is
V
j
(
j
) =
x
z
j

+
j

z
j
x
However,
+
= [
+
1

+
k
z
] is unknown, but lies in the unit cube in
R
k
z
-space
Intuitively, searching over feasible values of
+
, one may bound
Consider k
z
= 1
I
Partial out the eects of w by dening
y
i
= y
i
w
i
[(w
/
w)
1
w
/
y]
x
i
= x
i
w
i
[(w
/
w)
1
w
/
x]
(Note: If w is endogenous with valid IVs, then the OLS coes are
replaced by IV coes.)
I
Under (IIV.i) (IIV.iii) and assuming without loss of generality that
x
_ 0, obtain the following bounds:
F
Case I. (
z x
x

x x
z
)
z x
> 0

_
[
IV
V (1)
,
IV
z
] if
z x
< 0
[
IV
z
,
IV
V (1)
] if
z x
> 0
F
Case II. (
z x
x

x x
z
)
z x
_ 0

_
_
_
[max
_
IV
z
,
IV
V (1)
_
, ) if
z x
< 0
(, min
_
IV
z
,
IV
V (1)
_
] if
z x
> 0
Additional work to bound is also possible
Extension to k
z
> 1
I
Bounds can be tightened by obtaining bounds for each z individually
and then computing the nal bounds as the intersection of the k
z
bounds
I
Formally
F
For each z
j
, obtain B
+
j
= [
l
j
,
u
j
]
F
Final bounds given by

_
max
j

l
j
, min
j

u
j

_
F
In Case II, these bounds are one-sided; one trick may be to try and
dene a new IV that is a weighted average of two of the IVs such that
(
q x
x

x x
q
)
q x
> 0, where q
i
= z
ji
+ (1 )z
j
/
i
I
Need to be careful, though, and make sure dierent zs estimate the
same parameter (discussed later)
Conley et al. (2010) ...
Setup
y
i
= x
i
+ z
i
+
i
x
i
= z
i
+ u
i
where x is a k
x
-dimensional vector of endogenous regressors, z is a
k
z
-dimensional vector of instruments, k
z
_ k
x
, and
E
[z
/
] = 0
Classical IV requires the assumption that = 0
I
With k
x
= k
z
= 1, we have
plim
IV
= +

z
xz
= +

2
z
2
z
= +

where = z
i
+
i
is the composite error
I
Thus, IV is asymptotically biased when ,= 0 and the bias is
decreasing in and increasing in
I
Authors refer to deviations from = 0 as plausible exogeneity
Approach
I
Track estimates

() =

IV
/ for dierent values of
I
Estimates will be more sensitive to the weaker the rst-stage
relationship
Authors present several possible methods of inference, only some
presented here
Method #1. Union of CIs with Support Assumption
I
Suppose the true value of =
0
G
k
z
, with known bounds
I
If
0
were known, then IV/TSLS applied to
y
i
z
i
0
= x
i
+
i
using z as instruments is consistent for
I
With
0
unknown, but contained in G
k
z
, one can
F
Apply IV/TSLS to a grid of values for from G
k
z
F
For each value,
s
, s = 1, ..., S, obtain the (1 )% CI for
F
Compute a nal CI as the union of these S CIs
CI (1 ) = '
G
k
z
CI (1 , )
which has an asymptotic coverage probability _ 1
F
If some prior info, may want to weight dierent s dierently
Method #2. Local-to-Zero Approximation
I
is treated as unknown, but coming from a known dbn
=

_
N
, ~ G
where prior info on translates to knowing the dbn G
I
The normalization by
_
N ensures that uncertainty about z being a
valid instrument and sampling error are of the same order and so both
factor into the asymptotic dbn of

I
Assuming ~ N(
) leads to the following approximate dbn
~ N( + A
, V
IV
+ A
A
/
)
where A = (x
/
z(z
/
z)
1
z
/
x)
1
x
/
z
I
If
= 0, then this approach simply leads to a revised variance for the

IV/TSLS estimator
Stata ado les available on Conleys website
IV: Heterogenous Treatment Eects
Assume a binary endogenous regressor, D, and a binary instrument, z
Motivation arises from the fact that the treatment eect may vary
across by i and agents may act on observation-specic gains when
making treatment decision
Admitting this possibility implies that one must think more carefully
about what parameter one is estimating
Linear model
Setup (from earlier potential outcomes framework)
y
i
= y
0i
+ D
i
(y
1i
y
0i
)
=
0
+x
i
+
0i
+ D
i
(
1
+x
i
+
1i

0
x
i

0i
)
=
0
+x
i
+ (
1
0
+
1i

0i
)D
i
+
0i
= x
i
+
i
D
i
+
i
Dene
i
= (
1
0
) + (
1i

0i
) = +
+
i
Substitution implies
y
i
= x
i
+D
i
+ (
+
i
D
i
+
i
)
where
+
i
D
i
+
i
is the composite error term, which diers from the
usual error term for the treated
A valid IV in the homogeneous treatment eects setup requires
E
[
i
[x
i
, D
i
, z
i
] =
E
[
i
[x
i
, D
i
]
but now
E
[
+
i
D
i
+
i
[x
i
, D
i
, z
i
] =
E
[
+
i
D
i
+
i
[x
i
, D
i
]
is required
Thus, z must be
I
Correlated with D
i
(as usual)
I
Uncorrelated with the error term from the structural model and
individual-specic gains (or losses) from treatment
F
Not possible unless (i)
+
i
= 0 \i (implying a constant treatment
eect) or (ii)
+
i
l D
i
[x
i
(implying that agents either do not know or
do not act on specic gains ... no essential heterogeneity)
F
Model with
+
i
and D
i
correlated known as Correlated Random
Coecients (CRC) model
Much more restrictive requirement
I
Example: if z is an exogenous variable representing the cost of
participation in the treatment (e.g., distance to job training center),
then high z will lead to no participation unless the benet from
participation,
+
i
, is very high; if z is low, one will participate if
+
i
is
low or high = positive correlation between z and
+
i
conditional on D
i
If z is uncorrelated with , but correlated with
+
i
, then IV estimates
are still useful, but identify a dierent parameter
Parameter known as local average treatment eect (LATE)
Formally, given the model (ignoring x)
y
i
= +D
i
+ (
+
i
D
i
+
i
)
and an instrument, z, we have
plim
OLS
=
Cov(y, D)
Var(D)
= +
Cov(, D) +Cov(
+
D, D)
Var(D)
,=
plim
IV
=
Cov(y, z)
Cov(D, z)
= +
Cov(, z) +Cov(
+
D, z)
Cov(D, z)
= +
Cov(
+
D, z)
Cov(D, z)
,=
where the last inequality holds unless (i)
+
i
= 0 \i or (ii)
+
i
l D
i
[x
i
(as stated above)
How do we interpret

IV
?
LATE
Assume a binary endogenous regressor, D, and a binary instrument,
z, and no other covariates (for simplicity)
Four potential subpopulations
z = 0 z = 1
Never Takers (NT) D = 0 D = 0
Deers (DF) D = 1 D = 0
Compliers (C) D = 0 D = 1
Always Takers (AT) D = 1 D = 1
Compliers are the key, as their treatment status varies with the
instrument
Recall, the Wald estimator
IV
=
E
[y[z = 1]
E
[y[z = 0]
Pr(D = 1[z = 1) Pr(D = 1[z = 0)
Numerator terms may be expressed as
E
[y[z = j ] =
_
_
_
E
[y
1
[AT] Pr(AT) +
E
[y
j
[C] Pr(C)
+
E
[y
(1j )
[DF] Pr(DF)
+
E
[y
0
[NT] Pr(NT)
_
_
_
, j = 0, 1
Denominator terms may be expressed as
Pr[D = 1[z = j ] =
_
_
Pr[D = 1[z = j , AT] Pr(AT)
+ Pr[D = 1[z = j , C] Pr(C)
+ Pr[D = 1[z = j , DF] Pr(DF)
+ Pr[D = 1[z = j , NT] Pr(NT)
_
_
, j = 0, 1
=
_
Pr(AT) + Pr(C) if j = 1
Pr(AT) + Pr(DF) if j = 0
Wald estimator reduces to
IV
=
_

E
[y
1
[C] Pr(C) +
E
[y
0
[DF] Pr(DF)
E
[y
0
[C] Pr(C) +
E
[y
1
[DF] Pr(DF)
_
Pr(C) Pr(DF)
which is a weighted average of the treatment eect for compliers and
the negative of the treatment eect for deers
Assumptions
(LATE.i) Independence: y
0
, y
1
, D
0
, D
1
l z, where D
j
, j = 0, 1, are potential
treatment assignments
(LATE.ii) Exclusion:
E
[y
0
[z] =
E
[y
0
];
E
[y
1
[z] =
E
[y
1
]
(LATE.iii) First-Stage/Compliers: Pr(C) > 0 =Pr(D = 1[z) is a non-trivial
function of z
(LATE.iv) Monotonicity: Pr(D
i
= 1[z
i
= 1) > Pr(D
i
= 1[z
i
= 0) \i =
Pr(DF) = 0
Imposing these assumptions =
IV
=

LATE
=
E
[y
1
y
0
[C]
which is a parameter dened with respect to a particular instrument
Comments
I
LATE is a well-dened economic parameter
I
Whether it is an interesting parameter is a dierent matter
I
Not possible to know who are the compliers in the data
I
Interpretation is similar, but derivation more complex, if D or z is
continuous
F
Continuous z estimates the local instrumental variable (LIV) parameter
(Heckman and Vytlacil 1999)
I
With multiple instruments, things become thorny ... dierent
instruments, even if all valid, potentially identify dierent parameters!
F
No reason why dierent IV estimates should be the same
F
Using multiple IVs yield a weighted average of dierent LATEs
DiNardo & Lee (2011) provide an alternative interpretation of the IV
estimand
I
They replace the monotonicity assumption with what they call a
probabilistic monotonicity assumption
I
The result is that

IV
is shown to be a weighted average of
i
where
the weights are proportional to the increase in
Pr(D
i
= 1[z
i
= 1) Pr(D
i
= 1[z
i
= 0)
F
Under the monotonicity assumption,
Pr(D
i
= 1[z
i
= 1) Pr(D
i
= 1[z
i
= 0) =
_
0 if type = AT, NT
1 if type = C
so that only compliers receive positive weight
F
This follows from the assumption that D is a deterministic fn of z
F
Probabilistic monotonicity relaxes this assumption and allows D to be a
nondecreasing fn of z (conditional on type)
Not possible to infer anything about
ATE
,
ATT
, or
ATU
without
additional assumptions about how compliers compare to rest of the
population
I
Vytlacil et al. (2009) working on when one can learn the sign of
ATE
I
DiNardo & Lee (2011) discuss extrapolating to the
ATE
I
Heckman et al. (2010) propose two tests of the CRC assumption
H
o
:
+
i
l D
i
[x
i
F
Test #1 based on comparison of dierent (valid) IV estimates; under
H
o
dierent IVs provide consistent estimates of the same parameter
even if they lead to dierent sub-populations of compliers
F
Test #2 based on testing for a linear relationship between y and the
estimated propensity score conditional on x
IV: Finding Instruments
Economic theory ... what determines participation, but not outcomes?
Exogenous variation in program availability (across space or over
time) ... must be exogenous
Natural experiments ... twins, sex composition, miscarriages, Marial
Cuban boatlift, Russian immigration to Israel
Randomized experiments (even if imperfect compliance) ... Project
Star
Fuzzy regression discontinuity design
Recall from sharp RD case that we require the existence of the
following limits
D
+
= lim
s|s
Pr(D = 1[s)
D
= lim
ss
Pr(D = 1[s)
and D
+
,= D
I
Sharp RD setup implies D
+
= 1 and D
= 0
I
Fuzzy RD setup implies 1 _ D
+
> D
_ 0
Formally
(FRD.i) Treatment assignment is a discontinuous function of s (with a known
threshhold, s)
D
i
= D(s
i
,
i
)
where
Pr(D = 1) = Pr(D = 1[s _ s) Pr(s _ s) +Pr(D = 1[s < s) Pr(s < s)
(FRD.ii) Positive density at the threshold: f
S
(s) > 0
(FRD.iii) Outcomes are continuous in s at least around s and do not depend on
whether s ? s
(FRD.iv) For each agent, the dbn of s is continuous at least around s
Notes
I
Endogenous treatment variable, D, depends on observed score variable,
s, and stochastic element
I
Discrete jump in Pr(D = 1) at s
I
Example: Pr(D = 1) = max0, 0.5s + 0.25
I
(s > 0.5) +
0
.
2
.
4
.
6
.
8
1
0 .2 .4 .6 .8 1
x
P
r
(
D
=
1
)
Implies D
i
=
E
[D[s
i
] +
i
, where Cov(, ) ,= 0
OLS estimation of
y
i
= x
i
+D
i
+ f (s
i
) +
i
where x is a vector of exogenous controls, is biased, even with a
exible function of s included
Solution
I
Estimate propensity score, where f (s) is included along with the
indicator
I
(s > s) =
[
p(D)
I
Estimate by OLS
y
i
= x
i
+
\
p(D
i
) + f (s
i
) +
i
I
Equivalent to TSLS, with
I
(s > s) as the instrument, when f (s) is
chosen parametrically
Intepretation
I
Typical interpretation: RD identies the LATE at s
I
DiNardo & Lee (2011) intepret the estimated parameter as a weighted
average of
i
where the weights are proportional to (i) the probability
of s
i
being in the neighborhood of s and (ii) the inuence of crossing
the threshold, s, on the probability of receiving the treatment
Methods Not Requiring Exclusion Restrictions
Several methods exist that do not rely on a typical exclusion
restriction for identication
1
Heckman bivariate normal selection model
2
Millimet & Tchernis (2011) bias-corrected estimator
3
Higher moments
4
Covariance restrictions
All such methods must
replace the assumption
concerning an exclusion
restriction with some
other identifying
assumption (there is no
such thing as a free lunch)
Heckman Bivariate Normal Selection Model
Requires fairly strong parametric assumptions to circumvent the
selection on unobservables problem
Also useful to solve problems of non-random sample selection
(discussed later)
Treatment eects model with common eect
Setup
y
0i
= x
i
0
+
i
y
1i
= x
i
1
+
i
y
i
= D
i
y
1i
+ (1 D
i
)y
0i
D
+
i
= z
i
+ u
i
D
i
=
_
1 if D
+
i
> 0
0 if D
+
i
6 0
Notes
I

i
= common error component (or common eect) in both potential
outcome equations
I
s allowed to dier across outcome equations
I
D
+
i
= latent indicator of treatment status
I
Model rules out selection on observables assumption since
unobservables associated with treatment status, u, are correlated with
unobservables aecting outcomes conditional on x
Assumptions
(BVN.i) , u ~ N
2
(0, 0,
2
,
2
u
, )
(BVN.ii) , u l x, z
(BVN.iii)
2
u
= 1
I
Given the setup, individual-specic treatment eect is given by
i
= y
1i
y
0i
= x
i
(
1

0
)
I
Average treatment eects are
ATE
=
E
[
i
] =
E
[X
i
](
1

0
)
ATT
=
E
[
i
[D
i
= 1] =
E
[X
i
[D
i
= 1](
1

0
)
ATU
=
E
[
i
[D
i
= 0] =
E
[X
i
[D
i
= 0](
1

0
)
I
Implies consistent estimates of all three parameters require consistent
estimates of
0
,
1
I
Two nave options:
F
Split sample into D = 1 and D = 0, and regress y on x via OLS in
each sub-sample
F
Pool sample, regress y on x, Dx
I
Under selection on unobservables, neither option produces consistent
estimates
Conditional expectations (following from the properties of conditional
normal random variables)
I
Of the outcome in the treated state for the treated
E
[y
i
[D
i
= 1, x
i
, z
i
] = x
i
1
+
E
[
i
[u
i
> z
i
]
= x
i
1
+
_
(z
i
)
(z
i
)
_
= x
i
1
+
[(z
i
)]
where () is known as the Inverse Mills Ratio
I
Of the outcome in the untreated state for the untreated
E
[y
i
[D
i
= 0, x
i
, z
i
] = x
i
0
+
E
[
i
[u
i
6 z
i
]
= x
i
0
+
_
(z
i
)
1 (z
i
)
_
I
Given Corr (, u) ,= 0, error term is no longer well-behaved
Estimation: Method #1
Estimate the outcome equation for the treated and the untreated
separately via OLS
Consistent estimates of
0
,
1
require inclusion of the selection terms
Selection terms are estimable by
1
Estimating a probit model for treatment assignment =
2
Estimating the selection terms
_
(z
i
)
(z
i
)
_
and
_
(z
i
)
1 (z
i
)
_
3
Including these as additional covariates in each second-stage regression
Upon estimation of

0
,
1
...
I
Predict y
1i
, y
0i
\i
I
Estimate treatment eect parameters
ATE
= y
1i
y
0i
ATT
= y
1i
y
0i
ATU
= y
1i
y
0i
where ATE computes mean for entire sample, and latter two compute
means using only the treated and untreated, respectively
I
Equivalently,
ATE
= x(
0
)
ATT
= x
1
(
0
)
ATU
= x
0
(
0
)
where x is the sample mean, and x
k
, k = 0, 1, is the sample mean in
the sub-sample with D = k
Estimate a single outcome equation with no restriction
y
i
= x
i
0
+ x
i
D
i
(
1

0
) +
1
D
i
_
(z
i
)
(z
i
)
_
+
0
(1 D
i
)
_
(z
i
)
1 (z
i
)
_
+
i
This does not impose the restriction that the coecient on both
selection terms should be the same:
Thus, testing H
o
:
0
=
1
constitutes a specication test of the
underlying model
Note
i
=
i

1
D
i
_
(z
i
)
(z
i
)
_

0
(1 D
i
)
_
(z
i
)
1 (z
i
)
_
=
i
D
i
E
[
i
[D
i
= 1] (1 D
i
)
E
[
i
[D
i
= 0]
which is a well-behaved error term since the portion of the error term
that is correlated with treatment assignment now appears in the
model in the form of the selection correction terms
Estimate a single outcome equation imposing the restriction that
0
=
1
y
i
= x
i
0
+ x
i
D
i
(
1

0
)
+
_
D
i
_
(z
i
)
(z
i
)
_
+ (1 D
i
)
_
(z
i
)
1 (z
i
)
__
+
i
Eciency gain if, in fact, the restriction is true
Maximum likelihood estimation of the system of three equations
Above estimators are known as control function approach since
selection terms control for selection on unobservables
ML is not a control function approach, but rather directly
incorporates the covariance structure of the errors into the estimation
by jointly estimating the system of equations
Benets: yields an estimate of along with a std error, more ecient
if parametric assumptions are true
Cost: results are less robust if parametric assumptions of the model
are violated
Comments
There is no instrument or exclusion restriction required for
identication
I
Identication arises from the non-linearity of the selection correction
terms, which in turn arises from the assumption of bivariate normality
I
Exclusion restrictions a variable in z not in x would be nice
Semi-parametric versions exist
I
Relaxes dependence on bivariate normality
I
Require exclusion restrictions
I
One version includes a polynomial of the propensity score in the
regression model; motivation is to include a exible functional form to
capture the selection terms without reliance on bivariate normality
Bivariate probit treatment eects model
I
Similar to above models, except outcome of interest is binary (e.g.,
employment following a job training program)
I
Similar estimation to above by ML, except likelihood is based on a
bivariate probit model (same as in Altonji et al. (2005) unconstrained
bivariate probit model)
Aside:
Typical IV estimator can also be implemented using a control function
approach
I
TSLS estimator of the model
y
i
=
1
x
1i
+ x
2i
2
+
i
x
1i
= z
i
1
+ x
2i
2
+ u
i
is equivalent to OLS estimation of
y
i
=
1
x
1i
+ x
2i
2
+ u
i
+
i
where u
i
is replaced with the OLS estimate of the rst-stage
residual
I
Since u
i
= x
1i
z
i

1
x
2i

2
, this is not linearly independent of x
2
unless
1
,= 0
Treatment eects model without the common eect assumption
Relaxation of common eect assumption allows for heterogeneous
eects of the treatment even conditional on x
Setup
y
0i
= x
i
0
+
0i
y
1i
= x
i
1
+
1i
= x
i
1
+ [(
1i

0i
) +
0i
]
= x
i
1
+ [
i
+
0i
]
y
i
= D
i
y
1i
+ (1 D
i
)y
0i
D
+
i
= z
i
+ u
i
D
i
=
_
1 if D
+
i
> 0
0 if D
+
i
6 0
Notes
I

i
= obs-specic gain to treatment (conditional on x)
I

i
= y
1i
y
0i
= x
i
(
1

0
) +
i
(heterogeneous treatment eects
given x)
I
Selection into treatment may depend on either
0i
(untreated outcome
level given x) or
i
(obs-specic gains given x)
I
Otherwise, intuition is identical to common eect version
Assumptions (replaces (BVN.i))
(BVN.i)
0
,
1
, u ~ N(0, ), where
=
_
_
0

01

0u
1

1u
1
_
_
Conditional expectations
E
[
0i
[D
i
= 1, x
i
, z
i
] =
0u
0
_
(z
i
)
(z
i
)
_
E
[
i
[D
i
= 1, x
i
, z
i
] =
u
_
(z
i
)
(z
i
)
_
E
[
0i
[D
i
= 0, x
i
, z
i
] =
0u
0
_
(z
i
)
1 (z
i
)
_
Estimation
Generalization of the previous two-step approach in the common
eect model
Estimating equation
y
i
= x
i
0
+ x
i
D
i
(
1

0
) +
1
D
i
_
(z
i
)
(z
i
)
_
+
0
(1 D
i
)
_
(z
i
)
1 (z
i
)
_
+
i
where
1
=
0u
0
+
u
0
=
0u
0
Selection terms obtain by estimating rst-stage probit model for D
ML estimation of entire model is feasible, but it requires estimation of
a trivariate normal dbn (computationally dicult)
01
is not identied since never observe y
1
and y
0
for same i
Upon estimation of

0
,
1
...
I
Predict y
1i
, y
0i
\i
I
Estimate

ATE
ATE
= y
1i
y
0i
= x
_
0
_
where ATE computes mean for entire sample
I
ATT is given by
ATT
=
E
x
i
[D
i
=1
[x
i
(
1

0
)] +
E
i
[D
i
=1
[
i
]
=
E
x
i
[D
i
=1
[x
i
(
1

0
)] +
E
z
i
[D
i
=1
_
_
(z
i
)
(z
i
)
__
F
If there is no selection on unobservable gains, then
u
= 0 = common
eect model
F

1

0
=
u
= \
u
0
, which gives the sign of the
selection on gains (which one expects to be positive if obs know their
unobservable gains)
F
Estimate obtained by replacing expectations with sample averages
within the treatment group
I
ATU obtained in similar fashion, but average over x, z in control group
Stata: -treatreg-, -biprobit-
Millimet & Tchernis (2011)
Builds on the minimum biased approach (discussed earlier) by oering
a bias-corrected procedure
Recall, under certain assumptions the bias of the ATT, ATE at some
value of the propensity score, p(x), is given by
B
ATT
[p(x)] =
0u
0
(
1
(p(x)))
p(x)[1 p(x)]
B
ATE
[p(x)] =
0u
0
+ [1 p(x)]
u
_
(
1
(p(x)))
p(x)[1 p(x)]
_
where
I

0u
= selection on unobservables aecting outcome in untreated state
I

u
= selection on unobserved, individual-specic gains
B
ATT
[p(x)] is minimized at p
+
(x) = 0.5; B
ATE
[p(x)] does not have a
unique minimum
Minimum-biased (MB) estimation technique
I
Stage 1: Estimate the propensity score (e.g., probit model)
I
Stage 2: Retain only those observations with a propensity score,
[
p(x
i
), within a xed neighborhood around p
+
(x), the bias-minimizing
propensity score
I
Stage 3: Estimate the ATE or ATT using any propensity-score based
estimator that relies on CIA using this sub-sample
For ATE, add Stage 1.5: Estimate the error correlations using
Heckman BVN model
BC estimator amends the previous MB estimator by removing the
estimated bias
k
BC
=

_
\
B
k
[p(x
i
)]f
k
(x)dx, k = ATE, ATT, ATU
where f
k
(x) is the appropriate dbn needed to estimate parameter k
Millimet & Tchernis (2011) nd some benet to this estimator,
particularly in large samples, using MC
Higher Moments: Lewbel (2010) approach
Originally proposed as a solution to measurement error, but
potentially applicable to more general dependence between x and
(Lewbel 1997, 2010)
Setup
I
Structural model
y
i
=
1
D
i
+ x
i
2
+
i
I
First-stage model
D
i
= x
i
+ u
i
where
F
x includes the intercept
F
Cov(, u) ,= 0
D may be discrete or continuous
Potential instruments for D include (z
i
z)u
i
, where z _ x
Estimation requires consistently estimating the rst-stage and
replacing u with u
Validity of the IVs requires
(HM.i)
E
[z
/
u
2
] ,= 0
(HM.ii)
E
[z
/
u] = 0
Restrictions are satised if, say,
i
=
i
+
i
u
i
=
i
+ u
i
where
i
is a homoskedastic common factor and the sole source of
correlation between and u, and u is heteroskedastic with variance
depending on z
Higher Moments: Klein & Vella (2009, 2010); Farr et al. (2010)
Setup as in the prior model
I
Structural model
y
i
=
1
D
i
+ x
i
2
+
i
I
First-stage model
D
i
= x
i
+ u
i
where
F
F
Cov(, u) ,= 0
Identication assumptions
(KV.i)
i
= S
(z
i
)
+
i
and/or u
i
= S
u
(z
i
)u
+
i
, where z _ x, such that
S
(z
i
)/S
u
(z
i
) varies across i
(KV.ii)
E
[
+
i
u
+
i
] = , which is constant
Under (KV.i) and (KV.ii), the structural model may be re-written as
y
i
=
1
D
i
+ x
i
2
+
_
S
(z
i
)
S
u
(z
i
)
u
i
_
+
i
where
i
is now a well-behaved error term
The term in brackets acts as a control function since it controls for
selection bias such that conditional on this term and x D is no
longer correlated with the error term
Klein & Vella (2009) propose a semiparametric estimator of the model
Farr et al. (2010) outline a parametric estimator
Parametric Estimation
Assuming
S
(z
i
) =
_
exp(z
i
)
S
u
(z
i
) =
_
exp(z
i
u
)
the structural model becomes
y
i
=
1
D
i
+ x
i
2
+
_
_
exp(z
i
)
_
exp(z
i
)
u
i
_
+
i
Estimate the rst-stage by OLS = u
Estimate by OLS
ln( u
2
i
) = z
i
u
+ u
i
and form

S
u
(z
i
) =
_
exp(z
i
u
)
Substitute u and

S
u
(z
i
) into the structural model and estimate the
remaining parameters by NLS
y
i
=
1
D
i
+ x
i
2
+
_
_
exp(z
i
S
u
(z
i
)
u
i
_
+
i
While one could stop, performance is perhaps improved by adding
additional steps
I
Given NLS estimates of
1
and
2
=
I
Estimate by OLS
ln(
2
i
) = z
i
i
and form

S
(z
i
) =
_
exp(z
i
)
I
Estimate by OLS
y
i
=
1
D
i
+ x
i
2
+
_
(z
i
)
S
u
(z
i
)
u
i
_
+
i
Obtain std errors via bootstrap
Higher Moments: Klein & Vella (2009)
Setup
I
Structural model
y
i
=
1
D
i
+ x
i
2
+
i
I
First-stage model
D
+
i
= x
i
1
+ u
i
where x contains an intercept
When D is binary, one may estimate the rst-stage via probit and
form an instrument using the propensity score,

p(x)
Even with no exclusion restriction,

p(x) is correlated with D and
linearly independent of x (since

p(x) = (x ))
However, most of this linearity occurs in the tails
Additional non-linearity of the IV may be induced if one uses a
heteroskedastic probit to form the IV
I

u
is modeled as exp(x)
I

p(x) = (x /exp(x
))
I
Additional non-linearity is roughly equivalent to using higher-order
terms of x as exclusion restrictions
Klein & Vella (2009) also propose a semiparametric version
Higher Moments: Vella & Verbeek (1997); Rummery et al. (1999)
Vella and Verbeek (1997) propose an alternative IV strategy that may
also be valid with heteroskedastic errors
Known as Rank Order IV
Setup as in the prior models
y
i
=
1
D
i
+ x
i
2
+
i
D
i
= x
i
+ u
i
where
I
I
Cov(, u) ,= 0
Identication assumptions
(ROIV.i) An agents level of unobserved heterogeneity responsible for
Cov(, u) ,= 0 does not impact y, but rather only the agents relative
position or rank order matters
(ROIV.ii) Data can be partitioned into subsets such that agents may be paired
across subsets in a manner leading to pairs with identical ranks in their
respective subsets but dierent levels of D
For example, if y is wages, D is participation in a training program,
and endogeneity is due to unobserved work ethic, then
I
(ROIV.i) implies that the level of ones work ethic does not impact
wages but only the fraction of workers with whom ones work ethic
exceeds
F
I.e., ones level of work ethic is irrelevant, only ones percentile in the
dbn if work ethic matters
I
(ROIV.ii) implies we can divide the data (say, by region) such that
across regions individuals at the same percentile of the dbn of work
ethic within their region have dierent values of D
To proceed, partition the data into mutually exclusive groups,
s = 1, ..., S, on the basis of some attribute, q
i
(which may be a
subset of x)
Notation
I
Dene F([q
i
) as the CDF of u given q
I
Let c
i
= F(u
i
[q
i
) be the rank order of obs i in its partition
(ROIV.i) may be expressed formally as
E
[
i
[x
i
, D
i
, u
i
, q
i
] =
E
[
i
[u
i
, q
i
] =
E
[
i
[c
i
] = m(c
i
)
where m() is some fn mapping c to y
I
This condition states that
E
[
i
[u
i
, q
i
] depends only on u and q through
the rank order, c
I
Vella & Verbeek (1997) refer to as the order restriction
The order restriction is useful for identifying the model since it implies
that agents from dierent partitions, q
i
,= q
j
, but with identical rank
orders, c
i
= c
j
, are identical along the unobserved dimension
responsible for the endogeneity
To be useful, however, requires an additional assumption, (ROIV.ii),
such that these comparable pairs of agents have dierent values of D
Estimation
Re-write the structural model as
y
i
=
1
D
i
+ x
i
2
+ m(c
i
) +
i
where is now a well-behaved error term; m(c) is another example of
a control function, but c and m() are unknown
Estimate c
i
by
I
Estimating the rst-stage model via OLS = u
i
I
Estimate c
i
nonparametrically using the empirical CDF within each of
the S partitions based on q
Approximate m(c) using a nite-order polynomial in c
Alternatively, one may estimate the original structural model
y
i
=
1
D
i
+ x
i
2
+
i
by IV with the instrument given by the residual, , obtained after
OLS estimation of the model
D
i
=
0
+
1
c
i
+
i
Covariance Restrictions
Setup
I
Structural model
y
i
=
0
+
1
D
+
i
+ x
i
2
+
i
I
First-stage model
D
i
=
0
+ x
i
1
+ u
i
I
Reduced form model
y
i
= (
0
+
1
0
) + x
i
(
1
1
+
2
) + (
i
+
1
u
i
)
=

0
+
1
x
i
+
i
With no IV, estimable quantities include:
0
,
1
,
0
,
1
I
These four quantities are functions of ve structural parameters:
0
,
1
,
0
,
1
,
2
I
Thus, the model is under-identied
What about the covariance matrix of the system of reduced form
eqtns?
1
also shows up there
y
i
=

0
+
1
x
i
+ (
i
+
1
u
i
)
D
i
=
0
+ x
i
1
+ u
i
Assume , u ~ N(0, 0,
,
u
, ), then , u are also mean zero with
covariance matrix
=
_
+
2
1
2
u
+ 2
1
u
+
1
2
u
_
=
_
11

12
22
_
Three quantities are estimable based on MLE of the system:
11
,
12
,
22
I
These 3 quantities are functions of 4 structural parameters:
1
,
,
u
,
I
Thus, the model remains under-identied
Intuition: place restrictions on other parameters in in order to
identify
1
from the cov matrix; intercept and slope parameters are all
identied then as well
Model is then estimated via ML
ln / =
i
1
2
ln [
1
[
1
2
/
i
i
where
i
is the vector of errors for obs i
Note: If D is instead modelled as a LDV, then the likelihood must be
factored appropriately to account for the fact that one eqtn has a
discrete outcome
Realistic restrictions may be easier to devise if one adds additional
outcomes that also depend on the same endogenous regressor
I
Ex: K = 2
y
1i
=

10
+
11
x
i
+ (
1i
+
11
u
i
)
y
2i
=

20
+
21
x
i
+ (
2i
+
21
u
i
)
D
i
=
0
+ x
i
1
+ u
i
which entails
=
_
1
+
2
11
2
u
+2
11
1
+
2
2
+ 2
12
2
+
11
u
+
21
u
+
11
21
2
1
u
+
11
2
u
2
+
2
21
2
u
+2
21
2
2
2
u
+
21
2
u
2
u
_
_
=
_
_
11

12

13
22

23
33
_
_
I
If y
1
, y
2
are similar (e.g., two anthropometric measures), might impose
1
=
2
and might have a strong prior for
12
Types of restrictions
Altonji et al. (2005)-type restrictions: impose values for and track
estimates of
1
Factor Structure
I
Add additional outcomes
I
Decompose errors as
ki
=
k
i
+
ki
, k = 1, ..., K
u
i
=
u
i
+
i
where has unit var (normalization, not an assumption), , , are
assumed to be independent, and are known as factor loadings
I
Factor structure assumes all cross-eqtn correlation is through
I
Parameters to be estimated from :
k
,
k
,
1k
,
u
,
F
This is 3K + 2 parameters in total
F
Estimable quantities from is (K + 1)K/2
F
(K + 1)K/2 _ 3K + 2 =K _ 6
Hogan and Rigobon (2003), Rigobon (2003) propose an Identication
through Heteroskedasiticity estimator that is very similar
Distributional Approaches
Relatively recent work has begun to address endogeneity in the
context of distributional models
Other estimators not discussed here
1
Fixed eect QR models (Koenker 2004)
2
Nonparametric bounds applied to QR models (Giustinelli 2011)
Distributional Approaches: Changes-in-Changes
Recall, standard DID strategy
I
Assume treatment group observed pre- and post-intervention
I
Assume control group observed in same time periods
I
Assume treatment and control groups follow same time trend absent
treatment
I
Estimate treatment eect by the additional change over time in the
treatment group relative to the control group
Idea is extendable beyond just average treatment eects
Model does require panel data or repeated cross-sections
Setup (Athey & Imbens 2005)
Notation
I
Individual i belongs to a group G
i
0, 1, where G = 1 is treatment
group
I
Individual i observed at time T
i
0, 1
I
y
N
i
, y
I
i
= potential outcomes in non-treated (N), treated (intervention,
I ) states
I
y
i
= (1 I
i
)y
N
i
+ I
i
y
I
i
= observed outcome, where I
i
= treatment
(intervention) indicator
I
I
i
= G
i
T
i
Standard DID
I
Untreated outcome
y
N
i
= + T
i
+ G
i
+
i
I
Constant treatment eect assumption
= y
I
i
y
N
i
I
Combining above two assumptions yields
y
i
= + T
i
+ G
i
+ I
i
+
i
where
F
= ATE with constant treatment eect assumption
F
= ATT with heterogeneous treatment eect assumption
Generalizing the standard model
I
Untreated outcome
y
N
i
= h(U
i
, T
i
)
where
F
h(u, t) is increasing in u
F
u
i
= unobservable attribute of i
F
y
N
is identical across individuals within a time period with identical u,
irrespective of G
I
Dbn of u may vary by G, but not over time within G =u
i
l T
i
[G
i
I
In the absence of treatment...
F
Any dierences in outcomes across groups is entirely due to dis in the
dbn of u across groups
F
Any changes in outcomes within groups over time is due to dis in
h(u, 0) and h(u, 1) [i.e., since unobservables do not change over time,
the eect of unobservables on the untreated outcome must change over
time]
I
Treated outcome
y
I
i
= h
I
(U
i
, T
i
)
where h
I
(u, t) is increasing in u
Changes-in-changes model
Notation
I
Conditional dbns
y
N
gt
~ y
N
[G = g, T = t
y
I
gt
~ y
I
[G = g, T = t
y
gt
~ y[G = g, T = t
U
g
~ U[G = g
I
Inverse CDFs
F
1
y
(q) = infy : F
Y
(y) > q
Goal
I
Devise set of assumptions to identify dbn of y
N
11
, F
y
N
,11
, which is (one
of) the distributions of missing counterfactuals
I
Observable dbns include: F
y
N
,10
, F
y
I
,11
, F
y
N
,00
, and F
y
N
,01
Assumptions
(CIC.i) Model: y
N
= h(U, T)
(CIC.ii) Strict monotonicity: h(u, t) is strictly increasing in u for t 0, 1
(CIC.iii) Time invariance within groups: U l T[G
(CIC.iv) Support: U
1
_ U
0
Estimator
Counterfactual CDF
F
y
N
,11
= F
y,10
(F
1
y,00
(F
y,01
(y)))
which is estimable using empirical CDFs
Treatment eect estimate
CIC
q
= F
1
y
I
,11
(q)
F
1
y
N
,11
(q)
Note,
CIC
q
is the dierence in two QTE (Firpo 2007) estimates
CIC
q
=
QTE
q,1

QTE
q
/
,0
where
I

QTE
q,1
is change over time in y at quantile q for G = 1 group
I

QTE
q
/
,0
is change over time in y at quantile q
/
for G = 0 group, where
q
/
is the quantile in the G = 0, T = 0 dbn corresponding to the value
of y associated with quantile q in the G = 1, T = 0 dbn
Alternative estimator
I
QDID treatment eect estimator
QDID
q
= F
1
y
I
,11
(q)
F
1
y
N
,11
(q)
where
F
1
y
N
,11
(q) = F
1
y,10
(q) + [F
1
y,01
(q) F
1
y,00
(q)]
which corresponds to
QDID
q
=
QTE
q,1

QTE
q,0
where
QTE
q,1
,
QTE
q,0
is change over time in y at quantile q for
G = 1, 0, respectively
I
Relies on (perhaps) unrealistic assumptions
Counterfactual CDF for control group
F
y
I
,01
= F
y,00
(F
1
y,10
(F
y,11
(y)))
Treatment eect estimate
CIC
q,0
= F
1
y
I
,01
(q) F
1
y
N
,01
(q)
Notes
Athey & Imbens (2006) discuss extensions to
I
Discrete outcomes
I
Multiple groups and multiple time periods
I
Incorporating covariates
F
Semiparametric specication of potential outcomes
y
N
= h(u, t) + x
y
I
= h
I
(u, t) + x
where U l T, X[G
F
OLS estimation of outcomes
y
i
= D
i
+ x
i
+
i
where D = [GT (1 G)T G(1 T) (1 G)(1 T)]
F
Perform CIC estimation on
y
i
= y
i
x
i
= D
i
+
i
F
Inverse propensity score weighting alternative?
Panel data allows additional exibility, but repeated cross sections are
sucient
Inference
I
Athey & Imbens (2006) prove asymptotic normality, and devise
asymptotic variance
I
Bootstrap alternative?
Distributional Approaches: IV Quantile Regression
Recall, QR model (Koenker & Bassett 1978)
I
Assuming linear conditional quantiles, estimation is
=arg min
,
1
N
_

i :y
i
>x
i
[y
i
D
i
x
i
[ +

i :y
i
<x
i
(1 )[y
i
x
i
[
_
I
May be rewritten as
= arg min
,
1
N
_
(
i
)
_
where
(
i
) is check function, dened as
(
i
) = [
I
(
i
< 0)]
i
and
i
is the residual for i and
Parameters of interest are the partial derivatives of the conditional
quantile fn w.r.t. x
E
[Q
(y[x, D)]
x
k
which equals
k
if x enters linearly
For discrete regressors, parameters give the expected change in the
conditional quantile fn
=
E
[Q
(y[x, 1)
E
[Q
(y[x, D = 0)]
QR model is biased and inconsistent if D is endogenous
Recall, potential outcomes setup
I
y
d
, d = 0, 1, are potential outcomes associated with D = 0, 1,
respectively
I
q(d, x, ) = conditional quantile fn of potential outcomes
I

= q(1, x, ) q(0, x, ) = QTE (parameter of interest)

IV-QR model (Chernozhukov & Hansen 2005, 2006)
Express conditional quantile fn as
y
d
= q(d, x, u
d
), u
d
~ U[0, 1]
where q(d, x, ) is the conditional
th
-quantile of potential outcome,
y
d
Linear (in parameters) conditional quantile fn implies
q(d, x, ) =
D
i
+ x
i

Assumptions
(IV-QR.i) Potential outcomes: given X = x, for each d, y
d
= q(d, x, u
d
),where
u
d
~ U[0, 1] and q(d, x, ) is strictly increasing in
(IV-QR.ii) Independence: given X = x, u
d
l Z
(IV-QR.iii) Selection: given X = x, Z = z, D = (z, x, ) for unknown fn () and
random vector,
(IV-QR.iv) Rank similarity: given X = x, Z = z, u
d
~ u
d
/ \d, d
/
(IV-QR.v) Observed data: y = q(d, x, u
d
), D = (z, x, ), x, and z
Note: rank similarity is a bit weaker than rank invariance (whereby
U
d
= U
d
/ \d, d
/
), and requires that U
d
= U
d
/ are equal in
expectation only (thus, they may be considered equal ex ante, but are
allowed to dier ex post)
Estimation
Consider the objective fn
1
N
_
(
i
)
_
where
(
i
) = [
I
(
i
< 0)]
i
i
= y
i

D
i
x
i
and

i
is the predicted value from the rst-stage regression of D on
x, z
Given correctly specied structural model,
should equal zero

Algorithm
1
Dene a grid of possible values of ,
j
, j = 1, ..., J
2
For each , estimate a QR model with y
i
D
i
as the dependent
variable and x,

i
as covariates
3
Obtain estimates

j
,
j
, j = 1, ..., J
4
Choose

j
and

j
to minimize [
j
[
Inference via sub-sampling or typical, nonparametric iid bootstrap, as
in QR model
Can test interesting hypotheses (
= 0,
constant \, SD,
exogeneity)
Easily extendable to multiple endogenous variables, but grid search
increases exponentially
Distributional Approaches: Stochastic Dominance
Recall, previous denitions for stochastic dominance
I
First Order Stochastic Dominance: Y
1
FSD Y
0
i
F
1
(y) _ F
0
(y) \y
with strict inequality for some y (where is the union of the supports
for Y
1
and Y
0
), or
y
1
_ y
0
\ [0, 1]
I
Second Order Stochastic Dominance: X SSD Y i
_
y
F
1
(t)dt _
_
y
F
0
(t)dt \y
with strict inequality for some y, or
_

0
y
t
1
dt _
_

0
y
t
0
dt \ [0, 1]
Recall, previous tests for stochastic dominance
I
Test statistics
d = min sup
z
[F(z) G(z)]
s = min sup
z
_
z
[F(t) G(t)] dt
where min is taken over F G and G F
I
Tests are based on estimates of d and s using the empirical CDFs
F
Unconditional, or
F
Inverse propensity score weighted
Previous methods assume selection on observables
Failure of this assumption invalidates causal conclusions
Solution (Abadie 2002; Imbens & Rubin 1997)
With a binary IV, Z, the potential distributions of the outcome
variable are identied for the subpopulation of compliers
Z
i
satises the following three assumptions:
I
Independence: y
0i
, y
1i
, D
0i
, D
1i
l Z
i
I
Correlation: Pr(Z
i
= 1) (0, 1) and Pr(D
0i
= 1) < Pr(D
1i
= 1)
I
Monotonicity: Pr(D
0i
_ D
1i
) = 1
where:
F
y
0
, y
1
are potential outcomes (subscripts refer to treatment status)
F
D
0
, D
1
are potential treatments (subscripts refer to instrument status)
SD tests comparing the distribution of outcomes across the samples
with Z = 0 and Z = 1 identify the causal eect of D on y for
compliers
Dene the empirical CDF of potential outcomes for compliers as
F
C
1
(y) =
E
[
I
(Y
1i
_ y) [D
1i
= 1, D
0i
= 0]
F
C
0
(y) =
E
[
I
(Y
0i
_ y) [D
1i
= 1, D
0i
= 0]
Abadie (2002) shows
F
C
1
(y)
F
C
0
(y) = K + [
F
1
(y)
F
0
(y)]
where
F
1
(y),

F
0
(y) are empirical CDFs for the Z = 1, Z = 0 samples
K = 1/(
E
[D[Z = 1]
E
[D[Z = 0]) <
Implies SD tests on

F
1
(y),

F
0
(y) yield valid inference for the SD
rankings of

F
C
1
(y),

F
C
0
(y)
Dierent Zs yield dierent results if the treatment eect varies across
the population
Data Issues
Data issues are a fact of life
Frequently encountered are problems pertaining to missing or
contaminated data
Sample selection concerns missing data on the dependent variable
Contaminated data refers to a scenarious where one is interested in
the marginal distribution of a potentially mismeasured variable
Measurement error more generally refers to mismeasured dependent
or independent variables
Data Issues
Sample Selection
Population model
y
i
= x
i
+
i
,
i
~ N(0,
2
)
Given a random sample, y
i
, x
i
N
i =1
, then OLS is consistent and
ecient if the usual assumptions are satised
Problem arises when data on y is only available for a non-random
sample
I
Let S
i
= 1 if y
i
is observed; S
i
= 0 if y
i
is unobserved
Note: While exposition is using cross-section, a common source of
(non-random) selection is attrition in panel data; particularly
important in rm-level studies where attrition may be due to rms
exiting the market
Example: Certain subpopulations may not be representative of the
population
Implies following data structure
I
Have data on a random sample, y
i
, x
i
, S
i
N
i =1
, but y
i
= . if S
i
= 0
I
Can only use M =
i
S
i
observations to estimate any model
I
Examples
F
Wages only observed for workers
F
Firm prots only observed for rms that remain in business
F
Test scores only observed for test takers
F
House prices only observed for houses on the market (sold?)
Issue
I
Is OLS still unbiased and consistent?
I
Answer: depends
Heckman Model (Heckman 1979)
Setup
y
i
= x
i
+
i
S
+
i
= z
i
+ u
i
S
i
=
_
1 if S
+
i
> 0
0 if S
+
i
6 0
y
i
= . if S
i
= 0
i
, u
i
~ N
2
(0, 0,
2
, 1, )
x, z are exogenous
Problem
I
E
[y[x] = x, but
E
[y[x, S = 1] =
E
[y[x, z, u] = x +
E
[[x, z, u]
= x +
E
[[u > z
i
]
= x +
(z)
(z)
where
(z)/(z) is the Inverse Mills Ratio from before

I
Implies that
E
[y[x, S = 1] = x i = 0
I
OLS estimation of
y
i
= x
i
+
i
using only M observations omits the IMR term, which implies that
i
=
(z)/(z) +
i
which is not mean zero, and is not independent of x, unless = 0
Solution
I
Estimate IMR (using i = 1, ..., N)
F
Estimate probit model, where S is dependent variable and z are the
covariates =
F
Obtain
IMR
i
=
(z
i
)
(z
i
)
I
Regress y
i
on x
i
, IMR
i
via OLS (using i = 1, ..., M)
I
Known as Heckman two-step method
I
Test of endogenous selection
H
o
:
= 0
H
a
:
,= 0
where
is the coecient on the IMR

Notes
I
Usual OLS standard errors are incorrect since IMR is predicted; must
account for additional uncertainty due to estimation of
I
Other complications in derivation of standard errors
I
Need an exclusion restriction(s)
F
A variable in z not in x
F
Otherwise model is identied from non-linearity of IMR, which arises
solely from the assumption of joint normality
F
However, even though technically identied from the non-linearity,
substantial collinearity in practice makes identication questionable
I
Model can be estimated in one-step by ML
F
More ecient if model assumptions are valid
F
Less robust in general since more dependent on functional form
assumptions
Stata: -heckman-, -heckman2-
QR alternative
Assume the latent outcome is
y
+
i
= x
i
+ u
i
y
+
is unobserved; instead observe
y
i
=
_
y
+
i
if observed
. otherwise
QR model estimated using data on y
i
, x
i
, where
y
i
=
_
y
i
if observed
miny
i
otherwise
yields
= arg min
1
N
_
( y
i
x
i
)
_
which is consistent as long as all missing values of y
+
i
6 Q
(y
+
[x)
More generally, QR model estimated using data on y
i
, x
i
, where
y
i
=
_
y
i
if observed
imputed value otherwise
yields
= arg min
1
N
_
( y
i
x
i
)
_
which is consistent as long as imputed values lie on the correct side of
Q
(y
+
[x)
Example:
-
.
5
0
.
5
1
1
.
5
0 .2 .4 .6 .8 1
x
ystar 'true' OLS fitted line
'true' LAD fitted line OLS fitted line, y>0 only
LAD fitted line
NOTE: x~U[0,1]; ystar=-0.25+x+e; e~N(0,0.25^2); y=ystar if ystar>0.
LAD fitted line obtained by first replacing y=10 if ystar>true LAD line, -10 otherwise.
Multiple selection criteria
Setup
y
i
= x
i
+
i
S
+
1i
= z
1i
1
+ u
1i
S
1i
=
_
1 if S
+
1i
> 0
0 if S
+
1i
6 0
S
+
2i
= z
2i
2
+ u
2i
S
2i
=
_
1 if S
+
2i
> 0
0 if S
+
2i
6 0
y
i
= . if S
1i
S
2i
,= 1
i
, u
1i
, u
2i
~ N
3
(0, 0, 0,
2
, 1, 1,
1
,
2
,
12
)
x, z are exogenous
Estimation
I
Same as above, except with two IMR terms
IMR
1i
=
(z
1i

1
)
(z
1i

1
)
; IMR
2i
=
(z
2i

2
)
(z
2i

2
)
I
Coecients on each IMR term are
1
and
2
Examples
I
Grameen Bank: only observe outcome of credit amount if village
contains a bank, and income makes one eligible
I
Child care: only observe price paid for child care if work and use
market-based day care
Regime switching models
Setup
S
+
i
= z
i
+ u
i
S
i
=
_
1 if S
+
i
> 0
0 if S
+
i
6 0
y
i
=
_
x
i
1
+
1i
x
i
0
+
0i
which is the previous model for treatment eects
Applicable to any situation where one thinks determinants of the
outcome (i.e., ) dier across groups or regimes
May be extended to multiple regimes
S
+
i
= z
i
+ u
i
S
i
=
_
_
0 if S
+
i
6 0
1 if S
+
i
(0,
1
]
2 if S
+
i
(
1
,
2
]
.
.
.
K if S
+
i
>
K1
y
i
=
_
_
x
i
0
+
0i
if S
i
= 0
x
i
1
+
1i
if S
i
= 1
.
.
.
x
i
K
+
Ki
if S
i
= K
Estimate each regime seperately
y
i
= x
i
k
+
u
k
k

IMR
ki
+
ki
where
IMR
k
=
_
(z
i
)
1(z
i
)
if S
i
= 0
(
k1
z
i
)(
k
z
i
)
(
k
z
i
)(
k1
z
i
)
if S
i
= k 1, 2, ..., K 1
.
.
.
(
K1
z
i
)
1(
K1
z
i
)
if S
i
= K
and
0
= 0 and is estimated via ordered probit
Examples
I
Wages by rm size (Main & Reilly 1993)
I
Various outcomes by education or household size
Regime switching models with unknown switch point
Setup
S
+
i
= z
i
+ u
i
S
i
=
_
1 if S
+
i
> c
0 if S
+
i
6 c
y
i
=
_
x
i
1
+
1i
if S
i
= 1
x
i
0
+
0i
if S
i
= 0
where S
+
is observed, but c is unknown
Estimation
I
ML, where c is unknown parameter
I
Grid search:
F
Estimate model for several plausible values of c
F
c and resulting estimates

are those that minimize total SSE
I
Examples
F
Wages of PT vs. FT (Hotchkiss 1991)
F
Outcomes of DCs vs. LDCs
F
Stock market performance of large vs. small rms
Separate literature on selection models with panel data
Bounding distributions (Blundell et al. 2007)
Notation
I
W
+
= latent outcome variable
I
E = selection indicator
I
W = outcome variable, where
W =
_
W
+
if E = 1
. otherwise
I
X = covariate vector
Goal: bound CDF F(w[x) given observable CDF F(w[x, E = 1)
Examples:
I
Dbn of wages under full employment
I
Dbn of child health under full HI coverage
I
Dbn of student achievement under universal attendance at public
schools
I
Dbn of test scores on college entrance exams with full participation
Worst case bounds
Identity
F(w[x) = F(w[x, E = 1)p(x) + F(w[x, E = 0)[1 p(x)]
where p(x) = Pr(E = 1[x)
F(w[x, E = 0) is unknown, but must lie in unit interval
Replacing F(w[x, E = 0) with zero and one yields
F(w[x, E = 1)p(x) 6 F(w[x) 6 F(w[x, E = 1)p(x) + [1 p(x)]
Example (ignoring x):
I
F(10[E = 1) = 0.4
I
Pr(E = 1) = 0.9
=F(10) [0.36, 0.46]
Can be rewritten in terms of bounds on quantiles
w
q,l
(x) 6 w
q
(x) 6 w
q,u
(x)
where
I
w
q
(x) = q
th
quantile of F(w[x)
I
w
q,l
(x) is the value of w that solves
q = F(w[x, E = 1)p(x) + [1 p(x)]
=w = F
1
_
q [1 p(x)]
p(x)
[x, E = 1
_
I
w
q,u
(x) is the value of w that solves
q = F(w[x, E = 1)p(x)
=w = F
1
_
q
p(x)
[x, E = 1
_
Example
I
q = 0.5, p(x) = 0.9
I
w
q,l
(x) = F
1
(q
//
[x, E = 1), where
q
//
= (0.5 0.1)/0.9 = 0.4/0.9 - 0.44
I
w
q,u
(x) = F
1
(q
/
[x, E = 1), where q
/
= 0.5/0.9 - 0.55
= bounds on the median are given by the values of the observed
conditional dbn at the 44
th
and 55
th
quantiles
Notes
I
Bounds cannot be used to determine if selection is non-random; only
assess the possible consequences
I
Bounds only estimable for q [1 p(x), p(x)]
I
Bounds converge to point estimates as p(x) 1
Positive selection
Stochastic dominance
I
One characterization of positive selection is to assume that
F(w[x, E = 1) FSD F(w[x, E = 0)
= F(w[x, E = 1) 6 F(w[x, E = 0) \w, \x
I
Equivalent to Pr(E = 1[W 6 w, x) 6 Pr(E = 1[W > w, x)
I
Bounds on F(w[x) become
F(w[x, E = 1) 6 F(w[x) 6 F(w[x, E = 1)p(x) + [1 p(x)]
since the missing term, F(w[x, E = 0), is now bounded from below at
F(w[x, E = 1)
Example (ignoring x):
I
F(10[E = 1) = 0.4
I
Pr(E = 1) = 0.9
=F(10) [0.4, 0.46] whereas the worst-case bounds were [0.36, 0.46]
Median restriction
I
Weaker characterization is to assume (conditional on x) that
w
0.5(E=1)
> w
0.5(E=0)
I
Equivalent to
Pr(E = 1[W 6 w
0.5(E=1)
, x) 6 Pr(E = 1[W > w
0.5(E=1)
, x)
I
F(w[x, E = 1)p(x) 6 F(w[x) 6 F(w[x, E = 1)p(x) + [1 p(x)]
if w < w
0.5(E=1)
F(w[x, E = 1)p(x)
+ 0.5[1 p(x)]
6 F(w[x) 6 F(w[x, E = 1)p(x) + [1 p(x)]
if w > w
0.5(E=1)
I
Bounds are tightened (relative to worst case) only above the median
since the missing term, F(w[x, E = 0), is now bounded from below at
0.5 for w > w
0.5(E=1)
(instead of zero)
Exclusion restriction
Conditional independence
I
Assume z satises
F(w[x, z) = F(w[x) \w, x, z
I
max
z
F(w[x, z, E = 1)p(x, z)
6 F(w[x)
6 min
z
F(w[x, z, E = 1)p(x, z) + [1 p(x, z)]
I
If conditional independence is not true, bounds may cross; failure of
bounds to cross does not prove conditional independence holds
Monotonicity
I
Higher values of z improve the dbn in a FSD sense
F(w[x, z
/
) 6 F(w[x, z
//
) \w, x, z
/
, z
//
s.t. z
/
> z
//
I
Bounds on F(w[x, z
1
) become
max
z>z
1
F(w[x, z, E = 1)p(x, z)
6 F(w[x, z
1
)
6 min
z6z
1
F(w[x, z, E = 1)p(x, z) + [1 p(x, z)]
I
Bounds on F(w[x) obtained by integrating over the dbn of z; entails
computing the weighted average of the upper and lower bounds across
the dierent values, z
1
, where the weights are sample proportion,
Pr(z = z
1
[x)
Bounding dierences in QTEs across groups accounting for
non-random selection
Notation
I
D 0, 1 indexes groups
I
T 0, 1 indexes time period
Bounds on QTEs across groups in a given time period
w
q,l
(1, T) w
q,u
(0, T) 6 w
q
(1, T) w
q
(0, T)
6 w
q,u
(1, T) w
q,l
(0, T)
Bounds on QTEs across time for a given group
w
q,l
(D, 1) w
q,u
(D, 0) 6 w
q
(D, 1) w
q
(D, 0)
6 w
q,u
(D, 1) w
q,l
(D, 0)
Bounds on di-QTEs across groups
[w
q
(1, 1) w
q
(0, 1)] [w
q
(1, 0) w
q
(0, 0)] [LB, UB]
where
LB = [w
q,l
(1, 1) w
q,u
(0, 1)] [w
q,u
(1, 0) w
q,l
(0, 0)]
UB = [w
q,u
(1, 1) w
q,l
(0, 1)] [w
q,l
(1, 0) w
q,u
(0, 0)]
I
Example: Change in median wage gap across males and females over
period T = 0 to T = 1
Level set restrictions
I
Assume di-QTE, [w
q
(1, 1) w
q
(0, 1)] [w
q
(1, 0) w
q
(0, 0)], is
constant across dierent values of some covariate x A
I
Calculate LB(x), UB(x) \x A
I
New LB, UB given by
LB = max
xA
LB(x)
UB = min
xA
UB(x)
Test statistics derived in Blundell et al. for bounds crossings, whether
observed conditional distribution, F(w[x, E = 1) lies in the bounds
Inference via bootstrap
Bounding dierences in average treatment eects across groups
accounting for non-random selection
Lechner and Melly (2007)
Imai (2008)
Lee (2009)
Huber and Mellace (2011)
Data Issues
Contamination
Horowitz and Manski (1995); see also Chen et al. (JEL 2011)
Goal is to bound the marginal distribution of y
+
, where
y
i
= d
i
y
+
i
+ (1 d
i
) y
i
where y
+
is the true value, y is the mismeasured value, and d = 1 in
the absence of contamination (0 otherwise)
Add more!
Data Issues
Measurement Error
Refer to ECO 6374 for refresher on basics...
Problem: sometimes (often!) data are measured imprecisely; see
Bound et al. (2001), Millimet (2011)
Data Issues
ME: Classical Errors-in-Variables (CEV) model
Continuous dependent variable
y
i
..
observed
= y
+
i
..
actual
+
i
..
ME
I
Assumptions
(CEV.i) True model: y
+
i
= + x
+
i
+
i
(CEV.ii) Normality and Mean Zero:
i
~ N(0,
2
)
(CEV.iii) Independence: Cov(x
+
, ) = 0
I
Implications
F
OLS unbiased, consistent
F
Standard errors are correct
F
| R
2
, standard errors due to extra noise in the data
Continuous independent variable
x
i
..
observed
= x
+
i
..
actual
+
i
..
ME
I
Assumptions (in addition to previous assumptions)
(CEV.iv) Independence: Cov(, ) = 0
I
Implications
F
OLS biased, inconsistent unless = 0
F

OLS
suers from attenuation bias
Data Issues
ME: Binary Dependent Variable (Hausman et al. 1998)
True model
D
+
i
= x
+
i
+
i
where
+
on a variable indicates correctly measured
Given a random sample D
+
i
, x
+
i

N
i =1
, assume logit model is consistent
and ecient
I
Logit probabilities
Pr(D
+
= 1[x) =
exp(x
i
)
1 + exp(x
i
)
Pr(D
+
= 0[x) =
1
1 + exp(x
i
)
I
Estimation by ML
ln / =
I
[D
+
= 1] ln[Pr(D
+
= 1[x)] +
I
[D
+
= 0] ln[Pr(D
+
= 0[x)]
With measurement error, do not observe D
+
i
I
Instead one observes D
i
I
Introduce following notation
0
= Pr(D
i
= 1[D
+
i
= 0)
1
= Pr(D
i
= 0[D
+
i
= 1)
I

0
,
1
dependent on D
+
, but not on x
i
Estimation
I
Probabilities of observed responses
Pr(D = 1[x) = Pr(D
i
= 1[D
+
i
= 0) Pr(D
+
i
= 0[x)
+ Pr(D
i
= 1[D
+
i
= 1) Pr(D
+
i
= 1[x)
=
0
+ (1
0
1
)
_
exp(x
i
)
1 + exp(x
i
)
_
Pr(D = 0[x) = 1 Pr(D = 1[x)
= 1
0
(1
0
1
)
_
exp(x
i
)
1 + exp(x
i
)
_
I
Estimation by ML
ln / =
I
[D = 1] ln[Pr(D = 1[x)] +
I
[D = 0] ln[Pr(D = 0[x)]
I
Extension to probit is trivial
Identication
I
In linear probability model (LPM), conditional expectation given by
E
[D[x] =
E
[D
i
= 1[D
+
i
= 0] Pr(D
+
i
= 0)
+
E
[D
i
= 1[D
+
i
= 1] Pr(D
+
i
= 1)
=
0
+ (1
0
1
)(x
i
)
=
0
+ (1
0
1
)(
0
+x
i
1
)
= [
0
+ (1
0
1
)
0
] +x
i
(1
0
1
)
1
which makes clear that identication of
0
,
1
, and arises from
non-linearity of probit/logit, in addition to ...
I
Monotonicity assumption:
0
+
1
< 1
I
Semiparametric alternatives available
Data Issues
ME: Binary Independent Variable
True model
y
+
i
= + D
+
i
+
i
,
i
~ N(0,
)
where
+
on a variable indicates correctly measured
Given a random sample y
+
i
, D
+
i

N
i =1
, assume OLS is consistent and
ecient
With measurement error, do not observe D
+
i
Instead one observes D
i
where
D
i
..
observed
= D
+
i
..
true
+
i
..
ME
which implies that 0, 1 if D
+
= 0, and 0, 1 if D
+
= 1
Thus, measurement error is
I
Not normally distributed (violates CEV.ii)
I
Is negatively correlated with D
+
(violates CEV.iii)
Assumptions
(BME.i) Non-dierential classication errors:
E
[y[D
+
] =
E
[y[D, D
+
]
(BME.ii) D
+
l
(BME.iii) Cov(D, D
+
) > 0
(BME.iv) Cov(D
+
, ) < 0
Given (BME.i) (BME.iv), asymptotic bias given by
plim
_
OLS
_
=
_

D
+
D
+ +
D
+
D
+
D
+ + 2
D
+
Results in attenuation bias for if

D
+
> 0
Likely true for any mismeasured bounded variable
Millimet (2011) conducts MC study comparing common treatment
eect estimators ( = 1)
Partial solutions (Aigner 1973; Bollinger 1996; Black et al. 2000)
Reverse regression
I
Estimate via OLS
D
+
i
=
0
+
1
y
+
i
+
i
I
plim given by
plim
_

1
1,OLS
_
=

2
D
+
D
+ +
D
+
D
+ +
D
+
_
which is biased up in absolute value
I
Implies

D
+
,OLS
OLS

1
1,OLS
_
, where

D
+
,OLS
is the OLS
estimate if D
+
were observed (Frisch bounds)
I
If R
2
is low, then bounds obtained using reverse regression may be
uninformative
I
IV estimation also yields an upper bound (not a consistent estimate!),
that may be more informative in many cases
I
Inconsistency of IV results from fact that any instrument correlated
with D
+
will most likely be correlated since Cov(D
+
, ) ,= 0
Improved lower bound obtained by estimating
y
+
i
= +
0
I
[D
i
= 0, D
/
i
= 1]
+
1
I
[D
i
= 1, D
/
i
= 0] +
2
I
[D
i
= 1, D
/
i
= 1] +
i
where D
/
i
is a second mis-measured indicator
I
If the measurement errors are independent conditional on actual
treatment assignment, D
+
i
, then
0 <
E
[
OLS
]
<
E
[
2,OLS
]
< [[
Bound

D
+
,OLS
under various assumptions concerning severity of
measurement error (papers by Kreider and Pepper)
Full Solutions
Point estimates possible using method-of-moments framework
Brachet (2008) proposes following algorithm
1
Estimate Hausman et al. misclassication probit, including an
instrument z in the rst-stage
2
Replace D with Pr(D
+
i
= 1[x, z) in second-stage
McCarthy & Tchernis (2011) consider a similar approach in a
Bayesian framework
Partial solutions (Kreider & Pepper 2007)
Utilize a non-regression approach to bound the eect of a
mis-measured binary treatment
Authors do not wish to invoke (BME.i), which implies that
mis-reporting is independent of outcomes conditional on the truth
Notation
I
y 0, 1 is a binary outcome (correctly measured)
I
D
+
0, 1 is the true binary treatment
I
D 0, 1 is the reported binary treatment
I
Z 0, 1, where Z = 1 if D = D
+
and 0 otherwise
Estimand of interest: = Pr(y = 1[D
+
= 1) Pr(y = 1[D
+
= 0)
Data provides an estimate of Pr(y = 1[D)
Manipulation yields
Pr(y = 1[D
+
= 1) =
Pr(y = 1, D
+
= 1)
Pr(D
+
= 1)
=
_
_
Pr(y = 1, D = 1)
+Pr(y = 1, D = 0, Z = 0)
Pr(y = 1, D = 1, Z = 0)
_
_
_
Pr(D = 1) + Pr(D = 0, Z = 0)
Pr(D = 1, Z = 0)
_
where Pr(D = 1, Z = 0) is a false positive and Pr(D = 0, Z = 0) is
a false negative
Data provide estimates of Pr(y = 1, D = 1), Pr(D = 1)
Other elements are unknown, but bounded by the unit interval
Lower-Bound Accurate Reporting Rate
I
Assume Pr(Z = 1) _ v
I
Can show that
Pr(y = 1[D
+
= 1)
_
Pr(y = 1, D = 1)
Pr(D = 1) 2 + (1 v )
,
Pr(y = 1, D = 1) +
Pr(D = 1) + 2 (1 v )
_
where
=
_
min(1 v ), Pr(y = 1, D = 1) if Pr(y = 1, D = 1) Pr(y = 0, D = 1) (1 v ) _ 0
max0, (1 v ) Pr(y = 0, D = 0) otherwise
=
_
min(1 v ), Pr(y = 1, D = 0) if Pr(y = 1, D = 1) Pr(y = 0, D = 1) + (1 v ) _ 0
max0, (1 v ) Pr(y = 0, D = 1) otherwise
I
Bounds for Pr(y = 1[D
+
= 0) are obtained by replacing D with 1 D
I
Bounds for each term obtained by replacing elements with sample
analogs
I
Bounds for obtained using relevant upper and lower bounds for each
term
I
When v = 1, bounds collapse to a point estimate
Partial Verication
I
Might assume a lower bound for accuracy among some sub-group
whose status is more certain, W = 1
I
Assume Pr(Z = 1[W = 1) _ v
w
I
Can show that
Pr(y = 1[D
+
= 1)
_
_
Pr(y = 1, D = 1, W = 1)
_
_
Pr(D = 1, W = 1)
+Pr(y = 0, W = 0)
2 + (1 v
w
) Pr(W = 1)
_
_
,
_
Pr(y = 1, D = 1, W = 1)
+Pr(y = 1, W = 0) +
_
_
Pr(D = 1, W = 1) + Pr(y = 1, W = 0)
+2 (1 v
w
) Pr(W = 1)
_
_
_
where
=
_
min(1 v
w
) Pr(W = 1), Pr(y = 1, D = 1) if _ 0
max0, (1 v
w
) Pr(W = 1) Pr(y = 0, D = 0, W = 1) otherwise
=
_
min(1 v
w
) Pr(W = 1), Pr(y = 1, D = 0) if
/
_ 0
max0, (1 v
w
) Pr(W = 1) Pr(y = 0, D = 1, W = 1) otherwise
= Pr(y = 1, D = 1, W = 1) Pr(y = 0, D = 1, W = 1)
Pr(y = 0, W = 0) (1 v
y
) Pr(W = 1) _ 0
/
= Pr(y = 1, D = 1, W = 1) Pr(y = 0, D = 1, W = 1)
+ Pr(y = 1, W = 0) + (1 v
y
) Pr(W = 1) _ 0
I
If v
w
= 1, then one has full verication for an observed sub-sample
bounds are tightened
Combine the prior assumptions with a Monotone IV assumption to
possibly further tighten the bounds
MIV Assumption
I
x s.t.
x
0
[x
1
, x
2
] =Pr(y = 1[D
+
, x
0
) [Pr(y = 1[D
+
, x
1
), Pr(y = 1[D
+
, x
2
)]
I
Implies that Pr(y = 1[D
+
, x) is weakly monotonically increasing in x
I
Proceed by
F
Computing bounds conditional on dierent values of x
F
Obtaining unconditional bounds by integrating over the dbn of x
Kreider & Hill (2009), Kreider et al. (2011) combine this
methodology on reporting errors with prior methods on bounding
treatment eects under SOU
Imai & Yamamoto (2010) oer a similar analysis in poli sci
Partial solutions (Battistin & Sianesi 2009)
Consider ME of a binary or multi-valued treatment in the context of
propensity score estimators
Setup
(MPS.i) CIA given no ME
y
0
, y
1
l D
+
[x
(MPS.ii) CS given no ME
p
+
(x) = Pr(D
+
= 1[x) (0, 1) \x
I
D
+
is not observed, instead D is, where D
i
,= D
+
i
for at least some i
Estimation based on D yields
ATE
=
E
E
[y[D = 1, x]
E
[y[D = 0, x]
where the outer expectation is over o, where
o = x : p(x) = Pr(D = 1[x) (0, 1)
In contrast, estimation based on D
+
=

ATE
+
Notation
I
(Mis)classication probabilites given by
jj
/ (x) = Pr(D
+
= j [D = j
/
, x), j , j
/
0, 1
F

10
= proportion of incorrect reported zeros
F

01
= proportion of incorrect reported ones
I
Condensed notation for correct reporting rates
00
(x) =
0
(x) = Pr(D
+
= 0[D = 0, x)
11
(x) =
1
(x) = Pr(D
+
= 1[D = 1, x)
I
Matrix of (mis)classication probabilities can be written in terms of
0
,
1
(x) =
_

0
(x) 1
0
(x)
1
1
(x)
1
(x)
_
Assumptions
(MPS.iii) Non-dierential classication errors:
E
[y[D
+
, x] =
E
[y[D, D
+
, x]
(MPS.iv) Informative reported treatment status:
0
(x) +
1
(x) 1 ,= 0
Outcomes condition on D can be written as a weighted average of
outcomes conditional on D
+
_
E
[y[D = 0, x]
E
[y[D = 1, x]
_
= (x)
_
E
[y[D
+
= 0, x]
E
[y[D
+
= 1, x]
_
=
_
E
[y[D
+
= 0, x]
E
[y[D
+
= 1, x]
_
=
1
(x)
_
E
[y[D = 0, x]
E
[y[D = 1, x]
_
provided det[(x)] =
0
(x) +
1
(x) 1 ,= 0
Two cases satisfy (MPS.iv)
I
Minimal classication errors:
0
(x) +
1
(x) > 1
I
Severe classication errors:
0
(x) +
1
(x) < 1
The bias when using D is
ATE
(x) = [
0
(x) +
1
(x) 1]
ATE
+
(x)
Implications:
I

ATE
(x) is unbiased if
0
=
1
= 1
I

ATE
(x) suers from attenuation bias if
0
(x) +
1
(x) > 1
I

ATE
(x) suers from attenuation bias AND
sgn
_
ATE
(x)
_
,= sgn
_
ATE
+
(x)
_
if
0
(x) +
1
(x) < 1
I

ATE
(x) =
ATE
+
(x) if
0
=
1
= 0
The bias of the unconditional ATE,
ATE
, also depends on the
erroneous determination of the CS
I
Can show that
p(x) =
p
+
(x) [1
0
(x)]
0
(x) +
1
(x) 1
I
This implies that boundary values of p(x) can be obtained even if
p
+
(x) (0, 1) if
p(x) = 0 =
0
(x) = 1 p
+
(x)
p(x) = 1 =
1
(x) = p
+
(x)
To ensure one does not utilize a dierent CS based on D, must
assume
(MPS.v)
0
(x) ,= 1 p
+
(x) and
1
(x) ,= p
+
(x)
Estimation
I
Under (MPS.i) (MPS.v)
ATE
+
=
_
S
(x)
ATE
(x)f (x)dx
=
ATE
+
_
S
[(x) 1]
ATE
(x)f (x)dx
where
(x) =
Pr(D = 1)
Pr(D
+
= 1)
_
1 +
1
p(x)
1
0
(x)
0
(x) +
1
(x) 1
_
Pr(D
+
= 1) =
_
S
[1
0
(x)]f (x)dx
+
_
S
[
0
(x) +
1
(x) 1]p(x)f (x)dx
I
Shows that
ATE
+
can be obtained from an appropriately weighted
average of
ATE
(x)
I
Weights depend on
0
(x),
1
(x)
Notes
I
Bounds obtained by computing

ATE
+
(
0
,
1
) over a grid of values
and obtaining the lower and upper bounds
F
Restrictions on possible values of s can be imposed based on prior info
F

ATE
+
(
0
,
1
) can be obtained using any propensity-score based
estimator
F
In their paper, they use a (5 strata) stratication estimator and assume
(
0
,
1
) are stratum-specic
I
Extension to multi-valued treatments provided as well
Data Issues
ME: Missing Binary Independent Variable
Molinari (2010) applies similar bounding approach to analyze the case
where D is missing, possibly non-randomly, due to subject
non-response
I
Examples:
F
Respondents refuse to answer questions concerning drug use, welfare
use, etc.
Millimet (2011) MC study also compares common treatment eect
estimators when y or x is measured with error (do not forget the rest
of the data! ... = 1)
Data Issues
ME: Persistence of Treatment Eects
Often neglected in applied research is the question of whether
treatment eects are persistent
Clearly relevant for policymakers; an investment that improves
outcomes for one period only has dierent benets than an
investment that yields a permanent improvement in outcomes
Jacob et al. (2010) propose an interesting method to estimate the
degree of persistence in a treatment eect (under certain
circumstances)
Method relies on preceding analysis of measurement error
Setup
y
it
= y
L
it
+ y
S
it
where y is the outcome, which is decomposed into a LR component,
y
L
, and a SR componenent, y
S
I
The two components are given by
y
S
it
=
S
D
it
+
S
it
y
L
it
= y
L
it1
+
L
D
it
+
L
it
where D is a treatment (binary, discrete, or continuous)
I
Interpretation of parameters
F
= persistence of the LR component of y (by denition, the SR
component completely decays each period)
F

S
,
L
= the (common) treatment eect on y
S
, y
L
Goal: say something about ,
S
, and
L
Consider trying to estimate the LR component equation
y
L
it
= y
L
it1
+
L
D
it
+
L
it
Problem: y
L
it
, y
L
it1
are unobserved; only y is observed
Some algebra yields
y
it
y
S
it
= (y
it1
y
S
it1
) +
L
D
it
+
L
it
= y
it
= y
it1
+
L
D
it
+ [y
S
it
y
S
it1
+
L
it
]
Notes
I
Cov(y
it1
, y
S
it1
) ,= 0 ... y
S
it1
is analagous to ME in the desired
covariate, y
L
it1
I
Cov(D
it
, y
S
it
) ,= 0 if
S
,= 0
Circumvent this second issue by incorporating D
it
into the error term
y
it
= y
it1
+ [
L
D
it
+ y
S
it
y
S
it1
+
L
it
]
= y
it1
+
it
Comparison of estimators ...
OLS yields
plim
OLS
=
_

2
y
L
2
y
L
+
2
y
S
_
<
using the CEV formula discussed previously
IV using y
it2
as an instrument
plim
IV ,1
=
if Cov(y
it2
,
L
it
) = Cov(y
it2
,
S
it
) = Cov(y
it2
,
S
it1
) =
Cov(y
it2
, D
it
) = 0, implying that y
it2
is predetermined and
uncorrelated with future treatment status
IV using D
it1
as an instrument
plim
IV ,2
=
Cov(y
it
, D
it1
)
Cov(y
it1
, D
it1
)
=
Cov(y
it1
+
L
D
it
+ y
S
it
y
S
it1
+
L
it
, D
it1
)
Cov(y
it1
, D
it1
)
= +
Cov(
L
D
it
+ y
S
it
y
S
it1
+
L
it
, D
it1
)
Cov(y
it1
, D
it1
)
I
Assume Cov(D
it1
, D
it
) = Cov(D
it1
,
S
it
) = Cov(D
it1
,
L
it
) = 0
I
But, Cov(D
it1
, y
S
it1
) ,= 0 = D
it1
is not a valid IV
plim
IV ,2
= +
Cov(y
S
it1
, D
it1
)
Cov(y
it1
, D
it1
)
=
_
1
Cov(y
S
it1
, D
it1
)
Cov(y
it1
, D
it1
)
_
=
_
1

S
Var(D
it1
)
(
S
+
L
) Var(D
it1
)
_
=
_

L
S
+
L
_
Notes:
I
Combination of OLS and IV1 can estimate the relative contribution of
y
L
to y
I
Combination of IV1 and IV2 can estimate the relative contribution of
D to the LR component
I
xs can be incorporated by redening
L
it
= x
it
L
+
L
it
I
Model requires Cov(D
it1
, D
it
) = 0, ruling out treatments which
persist themselves (e.g., treaties)
F
Examples (perhaps): class size, R&D (?)
In conclusion, listen to the words of Sims (2010):
Natural, quasi-, and computational experiments, as well as
regression discontinuity design (RDD), can all, when well applied, be
useful, but none are panaceas... Because we are not an experimental
science, we face dicult problems of inference. The same data
generally are subject to multiple interpretations. It is not that we learn
nothing from data, but that we have at best the ability to use data to
narrow the range of substantive disagreement. We are always
combining the objective information in the data with judgment, opinion
and/or prejudice to reach conclusions...
Natural experiments, dierence-in-dierence, and regression
discontinuity design are good ideas. They have not taken the con out
of econometrics in fact, as with any popular econometric technique,
they in some cases have become the vector by which con is
introduced into applied studies. Furthermore, over-enthusiasm about
these methods, when it leads to claims that single-equation linear
model with sandwiched errors are all we ever really need, can lead to
our training applied economists who do not understand how to fully
model a dataset.
In light of these sentiments, recall the points made at the start of this
course:
Prior to conducting, or when reviewing, causal analyses, questions that
need to be answered:
1
What is the causal relationship of interest? [Is it economically
interesting?]
2
What is the identication strategy?
3
What parameter are you actually estimating?
4
To whom does the parameter apply?
5
What question does the analysis answer?
6
What is the method of statistical inference?
While applied work is open to multiple interpretations, these
interpretations and objections to research are lessened when one is precise
in answering these questions.

Microeconometrics Lecture Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Microeconometrics Lecture Notes

Uploaded by

Copyright:

Available Formats

ECO 7377

) is known based on asymptotic theory

) replaced by the std deviation of the

(A.vb) Heterogeneous treatment eects

Millimet & Tchernis (2009) nd that trimming is inecient when CIA

= e.g., years of education

DL Millimet (SMU) ECO 7377 Fall 2011 149 / 407

(y) = infy : F(y) >

(y) = arg min

for multiple values of

(), which is the

(), which implies that QTEs

obtained using sample analogues of y

where is the set of restrictions being combined

bias indicates how much larger SOU needs to be relative

in treatment and control groups =

Lee and Huang (2011) extend the existing literature on dynamic

) leads to the following approximate dbn

= 0, then this approach simply leads to a revised variance for the

= q(1, x, ) q(0, x, ) = QTE (parameter of interest)

DL Millimet (SMU) ECO 7377 Fall 2011 333 / 407

should equal zero

(z)/(z) is the Inverse Mills Ratio from before

is the coecient on the IMR

Results in attenuation bias for if

You might also like