You are on page 1of 33

1

Econ 5025 Applied Econometrics



York University
Department of Economics
Professor Xianghong Li

Practice Problems

Appendix A (review)

1. Suppose the following equation describes the relationship between the average
number of classes missed during a semester (missed) and the distance from school
(distance, measured in miles):

3 0.2 missed distance = +

a. Sketch this line, being sure to label the axes. How do you interpret the
intercept in this equation?
b. What is the average number of classes missed for someone who lives five
miles away?
c. What is the difference in the average number of classes missed for
someone who lives 10 miles away and someone who lives 20 miles away?

2. In example A.2, quantity of compact discs was related to price and income
by income price quantity 03 . 8 . 9 120 + = . What is the demand for CDs if price =
15 and income = 200? What does this suggest about using linear functions to
describe demand curves?

3. Suppose the unemployment rate in the United States goes from 6.4% in one year
to 5.6% in the next.
a. What is the percentage point decrease in the unemployment rate?
b. By what percentage has the unemployment rate fallen?

4. Suppose that the return from holding a particular firms stock goes from 15% in
one year to 18% in the following year. The majority shareholder claims that the
stock return only increased be 3%, while the chief executive officer claims that
the return on the firms stock has increased by 20%. Reconcile their
disagreement.

5. Suppose that Person A earns $35,000 per year and Person B earns $42,000.
a. Find the exact percentage by which Person Bs salary exceeds Person As.
b. Now, use the difference in natural logs to find the approximate percentage
difference.

6. Suppose the following model describes the relationship between annual salary
(salary) and the number of previous years of labour market experience (exper):
2

log( ) 10.6 0.027 salary exper = +

a. What is salary when exper = 0? When exper = 5? (Hint: You will need to
exponentiate.)
b. Use equation (A.28) to approximate the percentage increase in salary
when exper increases by five years.
c. Use the results of (a) to computer the exact percentage difference in salary
when exper = 5 versus exper = 0. Comment on how this compares with the
approximation in (b).

7. Let grthemp denote the proportionate growth in employment, at the county level,
from 1990 to 1995, and let salestax denote the county sales tax rate, stated as a
proportion. Interpret the intercept and slope in the equation
salestax grthemp 78 . 043 . =

8. Suppose the yield of a certain crop (in bushels per acre) is related to fertilizer
amount (in pounds per acre) as
fertilizer yield 19 . 120 + =
a. Graph this relationship by plugging in several values for fertilizer.
b. Describe how the shape of this relationship compares with a linear
relationship between yield and fertilizer.

Chapter 1

1. Suppose that you are asked to conduct a study to determine whether smaller class
sizes lead to improved student performance of fourth graders.
a. If you could conduct any experiment you want, what would you do? Be
specific.
b. More realistically, suppose you can collect observational data on several
thousand fourth graders in a given state. You can obtain the size of their
fourth-grade class and a standardized test score taken at the end of fourth
grade. Why might you expect a negative correlation between class size
and test score?
c. Would a negative correlation based on observational data necessarily
show that smaller class sizes cause better performance? Explain.

2. A justification for job training programs is that they improve worker productivity.
Suppose that you are asked to evaluate whether more job training makes workers
more productive. However, rather than having data on individual workers, you
have access to data on manufacturing firms in Ohio. In particular, for each firm,
you have information on hours of job training per worker (training) and number
of nondefective items produced per worker hour (output).

a. Carefully state the ceteris paribus thought experiment underlying this
policy question.
3
b. Does it seem likely that a firms decision to train its workers will be
independent of worker characteristics? What are some of those measurable
and unmeasurable worker characteristics?
c. Name a factor other than worker characteristics that can affect worker
productivity.
d. If you find a positive correlation between output and training, would you
have convincingly established that job training makes workers more
productive? Explain.

3. Suppose at your university you are asked to find the relationship between weekly
hours spent studying (study) and weekly hours spent working (work). Does it
make sense to characterize the problem as inferring whether study causes work
or work causes study? Explain.

Computer exercises:

C1.1 Use the data in WAGE1 for this exercise.
a. Find the average education level in the sample. What are the lowest and
highest years of education?
b. Find the average hourly wage in the sample. Does it seem high or low?
c. The wage data are reported in 1976 dollars. Using the Economic Report of
the President (2004 or later), obtain and report the Consumer Price Index
(CPI) for the years 1976 and 2003.
d. Use the CPI values from part (c) to find the average hourly wage in 2003
dollars. Now does the average hourly wage seem reasonable?
e. How many women are in the sample? How many men?

C1.2 Use the data in BWGHT to answer this question.
a. How many women are in the sample, and how many report smoking
during pregnancy?
b. What is the average number of cigarettes smoked per day? Is the average
a good measure of the typical woman in this case? Explain.
c. Among women who smoked during pregnancy, what is the average
number of cigarettes smoked per day? How does this compare with your
answer from (b), and why?
d. Find the average of fatheduc in the sample. Why are only 1,192
observations used to computer this average?
e. Report the average family income and its standard deviation in dollars.

C1.3 The data in MEAP01 are for the state of Michigan in the year 2001. Use these data
to answer the following questions.
a. Find the largest and smallest values of math4. Does the range make
sense? Explain.
b. How many schools have a perfect pass rate on the math test? What
percentage is this of the total sample?
c. How many schools have math pass rates of exactly 50 percent?
4
d. Compare the average pass rates for the math and reading scores. Which
test is harder to pass?
e. Find the correlation between math4 and read4. What do you conclude?
f. The variable exppp is expenditure per pupil. Find the average of exppp
along with its standard deviation. Would you say there is wide variation in
per pupil spending?
g. Suppose School A spends $6,000 per student and School B spends $5,500
per student. By what percentage does School As spending exceed School
Bs? Comapre this to ( ) ( ) 100 log 6000 log 5500 (

, which is the
approximation percentage difference based on the difference in the natural
logs. (See Section A.4 in Appendix A.)

C1.4 The data in JTRAIN2 come from a job training experiment conducted for low-
income men during 1976-1977; see LaLonde (1986)
a. Use the indicator variable train to determine the fraction of men receiving
job training.
b. The variable re78 is earnings from 1978, measured in thousands of 1982
dollars. Find the average of re78 for the sample of mean receiving job
training and the sample not receiving job training. Is the difference
economically large?
c. The variable unem78 is an indicator of whether a man is unemployed?
What about for men who did not receive job training? Comment on the
difference.
d. From parts (b) and (c), does it appear that the job training program was
effective? What would make our conclusions more convincing?


Chapter 2

1. Let kids denote the number of children ever born to a woman, and let educ denote
years of education for the woman. A simple model relating fertility to years of
education is
u educ kids + + =
1 0
| | ,
where u is the unobserved error.

a. Name a few factors that may be contained in u. Are these likely to be
correlated with level of education?
b. Will a simple regression analysis uncover the ceteris paribus effect of
education on fertility? Explain.

2. In the simple linear regression model u x y + + =
1 0
| | , suppose that 0 ) ( = u E .
Letting ) (
0
u E = o , show that the model can always be rewritten with the same
slope, but a new intercept and error, where the new error has a zero expected
value.

5
3. The following table contains the ACT scores and the GPA (grade point average)
for eight college students. Grade point average is based on a four-point scale and
has been rounded to one digit after the decimal.

Student GPA ACT
1 2.8 21
2 3.4 24
3 3.0 26
4 3.5 27
5 3.6 29
6 3.0 25
7 2.7 25
8 3.7 30

a. Estimate the relationship between GPA and ACT using OLS, that is, obtain the
intercept and slope estimates (formula: equation 2.19 and 2.17)

0 1

GPA ACT | | = +
b. Compute the fitted values and residuals for each observation, and verify that
the residuals (approximately) sum to zero.
c. What is the predicted value of GPA when ACT = 20?
d. How much of the variation in GPA for these eight students is explained by
ACT? Explain.

4. The data set BWGHT.RAW contains data on births to women in the United
States. Two variables of interest are the dependent variable, infant birth weight in
ounces (bwght) and an explanatory variable, average number of cigarettes the
mother smoked per day during pregnancy (cigs). The following simple regression
was estimated using data on n = 1388 births:

119.77 0.514 bwght cigs =



a. What is the predicted birth weight when cigs = 0? What about when cigs =
20 (one pack per day)? Comment on the difference.
b. Does this simple regression necessarily capture a causal relationship
between the childs birth weight and the mothers smoking habits?
Explain.
c. To predict a birth weight of 125 ounces, what would cigs have to be?
Comment.
d. The proportion of women in the sample who do not smoke while pregnant
is about 0.85. Does this help reconcile your finding from (c)?

5. In the linear consumption function
6

0 1

cons inc | | = +
the (estimated) marginal propensity to consume (MPC) out of income is simply the
slope,
1

| , while the average propensity to consume (APC) is


1 0

/ | | + = inc inc ns o c .
Using observations for 100 families on annual income and consumption (both
measured in dollars), the following equation is obtained:

124.84 0.853 cons inc = +


692 . 0 , 100
2
= = R n
a. Interpret the intercept in this equation, and comment on its sign and
magnitude.
b. What is the predicted consumption when family income is $30,000?
c. With inc on the x-axis, draw a graph of the estimated MPC and APC starting
at the annual income level of $1000.

6. Using the data from 1988 for houses sold in Andover, Massachusetts, from Kiel
and McClain (1995), the following equation relates housing price (price) to the
distance from a recently built garbage incinerator (dist):

log( ) 9.40 0.312log( ) price dist = +


162 . 0 , 135
2
= = R n
a. Interpret the coefficient on log(dist). Is the sign of this estimate what you
expect it to be?
b. Do you think simple regression provides an unbiased estimator of the ceteris
praibus elasticity of price with respect to dist? (Think about the citys decision
on where to put the incinerator.)

7. For the population of firms in the chemical industry, let rd denote annual
expenditures on research and development, and let sales denote annual sales (both
are in millions of dollars). Write down a regression model that uses sales to
explain the variation in rd. Your model shall imply a constant elasticity between
rd and sales. Which parameter is the elasticity?


Computer Exercises

C2.1 The data in 401K.RAW are a subset of data analyzed by Papke (1995) to study the
relationship between participation in a 401(k) pension plan and the generosity of the plan.
The variable prate is the percentage of eligible workers with an active account; this is the
variable we would like to explain. The measure of generosity is the plan match rate,
mrate. This variable gives the average amount the firm contributes to each workers plan
for each $1 contribution by the worker. For example, if mrate = 0.5, then a $1
contribution by the worker is matched by a 50 cents contribution by the firm.

a. Find the average participation rate and the average match rate in the sample of
plans.
7
b. Read the STATA output of the following simple regression equation
mrate te a pr
1 0

| | + = ,
and report the results along with the sample size and R-squared.
c. Interpret the intercept in your equation. Interpret the coefficient on mrate.
d. Find the predicted prate when mrate = 3.5. Is this a reasonable prediction?
Explain what is happening here.
e. How much of the variation in prate is explained by mrate?

C2.2 The data set in CEOSLA2.RAW contains information on chief executive officers
for U.S. corporations. The variable salary is annual compensation, in thousands of
dollars, and ceoten is prior number of years as company CEO.
a. Find the average salary and the average tenure in the sample.
b. How many CEOs are in their first year as CEO (that is, ceoten = 0)? What is
the longest tenure as a CEO?
c. Read the STATA output of the following simple regression model
u ceoten salary + + =
1 0
) log( | |
and write down the sample regression function. What is the (approximate)
predicted percentage increase in salary given one more year as a CEO?


C2.3 Use the data in WAGE2.RAW to estimate a simple regression explaining monthly
salary (wage) in terms of IQ score (IQ).
a. Find the average salary and average IQ in the sample. What is the sample
standard deviation of IQ? (IQ scores are standardized so that the average in
the population is 100 with a standard deviation equal to 15.)
b. I estimated a simple regression model where a one-point increase in IQ
changes wage by a constant dollar amount. Use the STATA output to find the
predicted increase in wage for an increase in IQ of 15 points. Does IQ explain
most of the variation in wage?
c. I then estimated a model where each one-point increase in IQ has the same
percentage effect on wage. If IQ increases by 15 points, what is the
approximate percentage increase in predicted wage? Calculate the same effect
without using the approximation and compare the two results.

C2.4 We used the data in MEAP93.RAW for Example 2.12. Let math10 denote the
percentage of tenth graders at a high school receiving a passing score on a standardized
mathematics exam. Now we want to explore the relationship between the math pass rate
(math10) and spending per student (expend).
a. Do you think each additional dollar spent has the same effect on the pass rate,
or does a diminishing effect seem more appropriate? Explain.
b. I estimated the model
0 1
10 log( ) math expend u | | = + + .
Read the STATA output and write down the sample regression function,
including the sample size and R-squared.
c. How big is the estimated spending effect? Namely, if spending increases by
10 percent, what is the estimated percentage point increase in math10?
8

Chapter 3
1. Using the data in GPA2.RAW on 4,137 college students, the following equation
was estimated by OLS:

2
1.392 .0135 .00148
4,137, .273,
colgpa hsperc sat
n R
= +
= =

where colgpa is measured on a four-point scale, hsperc is percentile in the high
school graduating class (Defined so that, for example, hsperc=5 means top five
percent of the class), and sat is the combined math and verbal scores on the
student achievement test.

a. Why does it make sense for the coefficient on hsperc to be negative?
b. What is the predicted college GPA when hsperc = 20 and sat = 1050?
c. Suppose that two high school graduates, A and B, graduated in the same
percentile from high school, but student As SAT score was 140 points
higher (about one standard deviation in the sample). What is the predicted
difference in college GPA for these two students?
d. Holding hsperc fixed, what difference in SAT scores leads to a predicted
colgpa difference of 0.50, or one-half of a grade point?

2. The data in WAGE2.RAW on working men was used to estimate the following
equation:

2
10.36 .094 .131 .210
722, .214,
educ sibs meduc feduc
n R
= + +
= =

Where educ is years of schooling, sibs is number of siblings, meduc is mothers
years of schooling, and feduc is fathers years of schooling.

a. Does sibs have the expected effect? Explain. Holding meduc and feduc
fixed, by how much does sibs have to increase to reduce predicted years of
education by one year? (A noninteger answer is acceptable here.)
b. Discuss the interpretation of the coefficient on meduc.
c. Suppose that Man A has no siblings, and his mother and father each have
12 years of education. Man B has no siblings, and his mother and father
each have 16 years of education. What is the predicted difference in years
of education between B and A?

3. The median starting salary for new law school graduates is determined by
,
) log(cos ) log( ) log(
5
4 3 2 1 0
u rank
t libvol GPA LSAT salary
+ +
+ + + + =
|
| | | | |

where LSAT is the median LSAT score for the graduating class, GPA is the median
college GPA for the class, libvol is the number of volumes in the law school library,
cost is the annual cost of attending law school, and rank is a law school ranking (with
rank=1 being the best).

9
a. Explain why we expect . 0
5
s |
b. What signs do you expect for the other slope parameters? Justify your
answers.
c. Using the data in LAWSCH85.RAW, the estimated equation is

2
log( ) 8.34 .0047 .248 .095log( )
.038log(cos ) .0033
136, .842
salary LSAT GPA libvol
t rank
n R
= + + +
+
= =

What is the predicted ceteris paribus difference in salary for schools with a
median GPA different by one point? (Report your answer as a percentage.)
d. Interpret the coefficient on the variable log(libvol).
e. Would you say it is better to attend a better ranked law school? How much
is a difference in ranking of 20 worth in terms of predicted starting salary?

4. In a study relating college grade point average to time spent in various activities,
you distribute a survey to several students. The students are asked how many
hours they spend each week in four activities: studying, sleeping, working, and
leisure. Any activity is put into one of the four categories, so that for each student,
the sum of hours in the four activities must be 168.
a. In the model

,
4 3 2 1 0
u leisure work sleep study GPA + + + + + = | | | | |

does it make sense to hold sleep, work, and leisure fixed, while changing
study?
b. Explain why this model violates Assumption MLR.3.
c. How could you reformulate the model so that its parameters have a useful
interpretation and it satisfies Assumption MLR.3?


5. Consider the multiple regression model containing three independent variables,
under Assumptions MLR.1 through MLR.4:
.
3 3 2 2 1 1 0
u x x x y + + + + = | | | |
You are interested in estimating the sum of the parameters on x
1
and x
1
; call this
.
2 1 1
| | u + =
a. Show that
2 1 1

| | u + = is an unbiased estimator of .
1
u
b. Find )

(
1
u Var in terms of )

(
1
| Var , )

(
2
| Var and )

(
2 1
| | corr .

6. Which of the following can cause OLS estimators to be biased?
a. Heteroskedasticity.
b. Omitting an important variable.
c. A sample correlation coefficient of .95 between two independent variables
both included in the model.

10
7. Suppose that average worker productivity at manufacturing firms (avgprod)
depends on two factors, average hours of training (avgtrain) and average worker
ability (avgabil):
.
2 1 0
u avgabil avgtrain avgprod + + + = | | |
Assume that this equation satisfies MLR.1 through MLR.4. If grants have been
given to firms whose workers have less than average ability, so that avgtrain and
avgabil are negatively correlated, what is the likely bias in
1
~
| Obtained from the
simple regression of avgprof on avgtrain? (using one of terminologies such as
upward bias, downward bias, or biased toward zero).

8. Suppose that you are interested in estimating the ceteris paribus relationship
between y and x
1
. For this purpose, you can collect data on two control variables,
x
2
and x
3
. (For concreteness, you might think of y as final exam score, x
1
as class
attendance, x
2
as GPA up to the previous semester, and x
3
as SAT or ACT score.)
Let
1
~
| be the simple regression estimate from y on x
1
and let
1

| be the multiple
regression estimate from y on x
1
, x
2
, x
3
.
a. If x
1
is highly correlated with x
2
and x
3
in the sample, and x
2
an x
3
have
large partial effects on y, would you expect
1
~
| and
1

| to be similar or
very different? Explain.
b. If x
1
is almost uncorrelated with x
2
and x
3
, but x
2
and x
3
are highly
correlated, will
1
~
| and
1

| tend to be similar or very different? Explain.


c. If x
1
is highly correlated with x
2
and x
3
, and x
2
and x
3
have small partial
effects on y, would you expect )
~
(
1
| se or )

(
1
| se to be smaller? Explain.
d. If x
1
is almost uncorrelated with x
2
and x
3
, and x
2
and x
3
have large partial
effects on y, and x
2
and x
3
are highly correlated, would you expect )
~
(
1
| se
or )

(
1
| se to be smaller? Explain.

9. Suppose the population model is
0 1
y x u | | = + +
The key condition needed for OLS to consistently estimate the | is that the error
term has mean zero and is uncorrelated with the regressor:
( ) ( ) 0, 0 E u E xu = = .
Show than the zero conditional mean assumption
( )
E u x is stronger than the
above condition. (actually given the zero conditional mean assumption, you can
show the error term is uncorrelated with any function of x .)

10. Derivations related to OLS estimators
a. Deriving OLS estimator for a simple regression (p.29)
b. Show that y y =
c. Show that
1
0
n
i i
i
u y
=
=


d. Show that SST SSE SSR = + (page 39)
11
e. Partialling out interpretation of multiple regression
Suppose the population regression is
0 1 1 2 2
...
i i i k ik i
y x x x u | | | | = + + + + +
Claim:
1

| from this multiple regression is equal to


1
from the following two steps
(partialling out procedures)
Step 1: regress
1 i
x on
2,..., i ik
x x with an intercept to get the regression residual
1

i
r
Step 2: regress
i
y on
1

i
r with an intercept
0 1

i i i
y r e = + +

then we claim:
1 1

| =
where
( )
1
1
1
2
1
1

n
i i
i
n
i
i
r y
r

=
=
| |
|
\ .
| |
|
\ .



According to (2.19) on page 29, for the simple regression in step 2, we have

( ) ( )
( )
1 1
1
1
2
1 1
1

=

n
i i
i
n
i
i
r r y y
r r

=
=
| |

|
\ .
| |

|
\ .



Show that
( )( )
( ) ( )
1 1 1
1 1
2 2
1 1 1
1 1


n n
i i i i
i i
n n
i i
i i
r r y y r y
r r r
= =
= =
| | | |

| |
\ . \ .
=
| | | |

| |
\ . \ .



(you need
1
1
0 thus 0
n
i
i
r r
=
= =

)

Show that
1 1

| = (Appendix 3A.2 on page 113)



11. Omitted variable bias in OLS estimators:
Suppose the true population model is
*
0 1 1 2 2
y x x u | | | = + + +
We assume this model satisfies the assumption
( ) ( )
1 2
, 0 E u x x E u = = . Our
primary interest is in
1
| , the partial effect of
1
x on y . For example, y is hourly
wage (or log of hourly wage),
1
x is education, and
*
2
x is innate ability. In order to
get an unbiased estimator of
1
| , we should run a regression of y on
1
x and
*
2
x .
However,
*
2
x is not observed. If we regress y on
1
x only, the estimator of
1
|
12
from this regression will suffer from omitted variable bias. Suppose
( )
*
2 1 0 1 1
E x x x o o = + . Derive the bias in
1
| from a simple regression of y on
1
x
only.


Computer exercises

C3.1 A problem of interest to health officials (and others) is to determine the effects of
smoking during pregnancy on infant health. One measure of infant health is birth weight;
a birth weight that is too low can put an infant at risk for contracting various illnesses.
Since factors other than cigarette smoking that affect birth weight are likely to be
correlated with smoking, we should take those factors into account. For example, higher
income generally results in access to better parental care, as well as better nutrition for
the mother. A regression model that recognizes that is
0 1 2
bwght cigs faminc u | | | = + + +
where birth weight (bwght ) is in ounces, cigs is average number of cigarettes the mother
smoked per day during pregnancy and family income (faminc) is in thousands.
a. What is the most likely sign for
2
| ?
b. Do you think cigs and faminc are likely to be correlated? Explain why the
correlation might be positive or negative.
c. I estimate the equation with and without faminc, using the data in
BWGHT.RAW. Use STATA output to report the results in equation form,
including the sample size and R-squared. Discuss the results, focusing on
whether adding faminc substantially changes the estimated effect of cigs on
bwght.
d. Interpret the coefficient of faminc in the multiple regression. Do you think this
effect is practically large?

C3.2 I use the data in HPRICE1.RAW to estimate the following model:
u bdrms sqrft price + + + =
2 1 0
| | |
where price is the house price measured in thousands of dollars, sqrft is square footage of
the house and bdrms is number of bedrooms.
a. Write out the sample regression function using the STATA output.
b. What is the estimated increase in price for a house with one more bedroom,
holding square footage constant?
c. What is the estimated increase in price for a house with an additional bedroom
that is 140 square feet in size? Compare this to your answer in part (b).
d. What percentage of the variation in price is explained by square footage and
number of bedrooms?
e. The first house in the sample has sqrft = 2,438 and bdrms = 4. Find the
predicted selling price for this house from the OLS regression line.
f. The actual selling price of the first house in the sample was $300,000 (so
price = 300). Find the residual for this house. Does it suggest that the buyer
underpaid or overpaid for the house?

13
C3.3 The file CEOSAL2.RAW contains data on 177 chief executive officers and can be
used to examine the effects of firm performance on CEO salary. The variable salary is
annual compensation, in thousands of dollars, ceoten is prior number of years as
company CEO, profits is firm profit in millions, mktval is firm market value in millions,
sales is firm sales in millions.
a. I estimate a model relating annual salary to firm sales and market value
making the model of the constant elasticity variety for both independent
variables. Write the SRF using the STATA output.
b. Then I add profits to the model in (a). I cannot include this variable in
logarithmic form, why? Would you say that these firm performance variables
explain most of the variation in CEO salaries?
c. Subsequently I add the variable ceoten to the model in (b). What is the
estimated percentage return for another year of CEO tenure, holding other
factors fixed?
d. Find the sample correlation coefficient between the variables log(mktval) and
profits. Are these variables highly correlated? What does this say about the
OLS estimators?

C3.4 The data in ATTEND.RAW are used for this exercise.
a. Report the minimum, maximum, and average values for the variables atndrte,
priGPA, and ACT.
b. I estimate the model
,
2 1 0
u ACT priGPA atndrte + + + = | | |
Write the SRF using the STATA output. Interpret the intercept. Does it have a
useful meaning?
c. Discuss the estimated slope coefficients. Are there any surprises?
d. What is the predicted atndrte if priGPA = 3.65 and ACT = 20? What do you
think of this result?
e. If Student A has priGPA = 3.1 and ACT = 21 and Student B has priGPA = 2.1
and ACT = 26, what is the predicted difference in their attendance rates?

C3.5 The data set in WAGE2.RAW is used for this problem.
First I run a simple regression of IQ on educ to obtain the slope coefficient, say, .
~
1
o Then
I run the simple regression of log(wage) on educ, and obtain the slope coefficient,
1
~
| .
Subsequently I run the multiple regression of log(wage) on educ and IQ, and obtain the
slope coefficients,
1

| and
2

| , respectively.
Based the above regression results verify that
1 2 1 1
~

~
o | | | + = .

C3.6 The data in MEAP93.RAW are used to estimate the following regression.
a. I estimate the model
0 1 2
10 log( ) , math expend lnchprg u | | | = + + +
Report the SRF, including the sample size and R-squared.
b. What do you make of the intercept (a)? In particular, does it make sense to set
the two explanatory variables to zero? [Hint: Recall that log(1)=0.]
14
c. Now I run the simple regression of math10 on log(expend), and compare the
slope coefficient with the estimate obtained in (a). Is the estimated spending
effect now larger or smaller than in (a)?
d. Report the correlation between lexpend = log(expend) and lnchprg. Does its
sign make sense to you?
e. Use (d) to explain your findings in (c).

C3.7 I Use the data in DISCRIM.RAW for this question. These are zip code-level data on
prices for various items at fast-food restaurants, along with characteristics of the zip code
population, in New Jersey and Pennsylvania. The idea is to see whether fast-food
restaurants charge higher prices in areas with a larger concentration of blacks.
a. Report the sample mean of prpblck and income, along with their standard
deviations. Can you deduce the units of measurement of prpblck and income?
b. Consider a model to explain the price of soda, psoda, in terms of the
proportion of the population that is black and median income:
u income prpblck psoda + + + =
2 1 0
| | |
Report the SRF, including the sample size and R-squared. Interpret the
coefficient on prpblck. Do you think the effect of prpblck on price of soda is
economically large (Comparing two hypothetical communities, one with
100% white and the other with 100% black)?
c. Compare the estimate from (b) with the simple regression estimate from
psoda and prpblack. Is the discrimination effect larger or smaller when you
control for income?
d. A model with constant price elasticity with respect to income may be more
appropriate. Report estimates of the model
( )
0 1 2
log( ) log psoda prpblck income u | | | = + + +
If prpblck increases by .20 (20 percentage points), what is the estimated
percentage change in psoda?
e. Now add the variable prppov to the regression in (d). What happens to
prpblck
|

?
f. Report the correlation between log(income) and prppov. Is it roughly what
you expected?
g. Evaluate the following statement: Because log(income) and prppov are so
highly correlated, they have no business being in the same regression.

Chapter 4

1. Consider an equation to explain salaries of CEOs in terms of annual firm sales,
return on equity (roe, in percentage), and return on the firms stock (ros, in
percentage):
. ) log( ) log(
3 2 1 0
u ros roe sales salary + + + + = | | | |
a. State the null hypothesis that, after controlling for sales and roe, ros has
no effect on CEO salary. State the alternative that better stock market
performance (higher ros) increases a CEOs salary.
15
b. Using the data in CEOSAL1.RAW, the following SRF was obtained by
OLS:

2
log( ) 4.32 .280 log( ) .0174 .00024
(.32) (.035) (.0041) (.00054)
209, .283.
salary sales roe ros
n R
= + + +
= =

What is the effect of ros on the predicted salary if ros increases by 50
percentage points? Does ros have a practically large effect on salary?

c. Test the null hypothesis that ros has no effect on salary against the
alternative that ros has a positive effect. Carry out the test at the 10%
significance level.
d. Would you include ros in a final model explaining CEO compensation in
terms of firm performance? Explain.

2. The variable rdintens is expenditures on research and development (R&D) as a
percentage of sales. Sales are measured in millions of dollars. The variable
profmarg is profits as a percentage of sales.
Using the data in RDCHEM.RAW for 32 firms in the chemical industry, the
following equation is estimated:

2
.472 .321log( ) .050
(1.369)(.216) (.046)
32, .099.
rdintens sales profmarg
n R
= + +
= =

a. Interpret the coefficient on log(sales). In particular, if sales increases by 10%,
what is the estimated effect on rdintens? It this an economically large effect?
b. Test the hypothesis that R&D intensity does not change with sales against the
alternative that it does increase with sales. Do the test at the 5% and 10%
levels.
c. Interpret the coefficient on profmarg. Is it economically large?
d. Does profmarg have a statistically significant effect on rdintens?

3. Are rent rates influenced by the student population in a college town? Let rent be
the average monthly rent paid on rental units in a college town in the United
States. Let pop denote the total city population, avginc the average city income,
and pctstu the student population as a percentage of the total population. One
model to test for a relationship between rent rates and percentage of students in
overall population is
0 1 2 3
log( ) log( ) log( ) . rent pop avginc pctstu u | | | | = + + + +
a. State the null hypothesis that size of the student body relative to the
population has no ceteris paribus effect on monthly rents. State the
alternative that there is an effect.
b. What signs do you expect for
1
| and
2
| ?
c. The equation estimated using 1990 data from RENTAL.RAW for 64
college towns is

16

2
log( ) .043 .066log( ) .507log( ) .0056
(.844) (.039) (.081) (.0017)
64, .458.
rent pop avginc pctstu
n R
= + + +
= =


What is wrong with the statement: A 10% increase in population is
associated with about a 6.6% increase in rent?
d. Test the hypothesis stated in (a) at the 1% level.

4. Consider the estimated equation from Example 4.3, which can be used to study
the effect of skipping class on college GPA:

2
1.39 .412 .015 .083
(.33) (.094) (.011) (.026)
141, .234
colGPA hsGPA ACT skipped
n R
= + +
= =

a. Find the 95% confidence interval for
hsGPA
| .
b. Can you reject the null hypothesis 4 . :
0
=
hsGPA
H | against the two-
sided alternative at the 5% level?
c. Can you reject the null hypothesis 1 :
0
=
hsGPA
H | against the two-sided
alternative at the 5% level?

5. In section 4.5, we used as an example testing the rationality of assessments of
housing prices. There, we used a log-log model in price and assess [see equation
(4.47)]. Here, we use a level-level specification.

a. In the simple regression model
,
1 0
u assess price + + = | |
the assessment is rational if 1
1
= | and 0
0
= | . The estimated equation is

2
14.47 .976
(16.27)(.049)
88, 165, 644.51, .820
price assess
n SSR R
= +
= = =

First, test the hypothesis that 0 :
0 0
= | H against a two-sided
alternative. Then, test 1 :
1 0
= | H against a two-sided alternative. What
do you conclude?
b. To test the joint hypothesis that 0
0
= | and 1
1
= | , we need the SSR in the
restricted model. This amounts to computing

=

n
i
i i
assess price
1
2
) ( , where
n = 88, since the residuals in the restricted model are just price
i
asses
i
.
(No estimation is needed for the restricted model because both parameters
are specified under H
0
.) This turns out to yield SSR = 209,448.99. Carry
out the F test for the joint hypothesis. Is the null hypothesis rejected at the
1% level?
17

c. Now, test 0 :
2 0
= | H , 0
3
= | , and 0
4
= | in the model
.
4 3 2 1 0
u bdrms sqrft lotsize assess price + + + + + = | | | | |
The R-squared from estimating this model using the same 88 houses is
.829. Can we reject the null hypothesis at the 10% level?

6. Consider the multiple regression model with three independent variables, under
the classical linear model assumptions MLR.1 through MLR.6:
0 1 1 2 2 3 3
. y x x x u | | | | = + + + +
You would like to test the null hypothesis . 1 3 :
2 1 0
= | | H
a. Let
1

| and
2

| denote the OLS estimators of


1
| and
2
| . Find
)

(
2 1
| | Var in terms of the variances of
1

| and
2

| and the covariance


between them. What is the standard error of )

(
2 1
| | ?
b. Write the t statistic for testing 1 3 :
2 1 0
= | | H .
c. Define
2 1 1
3| | u = and
2 1 1

3

| | u = . Write a regression equation
involving
0
| ,
1
u ,
2
| and
3
| that allows you to directly obtain
1

u and its
standard error.


7. The following table was created based on results from three regressions using the
data in CEOSAL2.RAW:

Dependent Variable: log(salary)
Independent Variables (1) (2) (3)
log(sales)
.224
(.027)
.158
(.040)
.188
(.040)
log(mktval) _______
.112
(.050)
.100
(.049)
profmarg _______
.0023
(.0022)
.0022
(.0021)
ceoten _______ _______
.0171
(.0055)
comten _______ _______
.0092
(.0033)
intercept
4.94
(0.20)
4.62
(0.25)
4.57
(0.25)
Observations
R-squared
177
.281
177
.304
177
.353

18
The variable mktval is market value of the firm, profmarg is the profit as a percentage
of sales, ceoten is years as CEO with the current company, and comten is total years
with the company.
a. Comment on the effect of profmarg on CEO salary based on the second and
third regressions in the table.
b. Based on the third regression in the table, does market value have a significant
effect in a two-sided test? Explain.
c. Interpret the coefficients on ceoten and comten in the third regression. Are the
variables statistically significant for a two-sided test at the 5% level?
d. What do you make of the fact that longer tenure with the company, holding
the other factors fixed, is associated with a lower salary?


Computer exercises

C4.1 The following model can be used to study whether campaign expenditures affect
election outcomes:
0 1 2 3
log( ) log( ) voteA expendA expendB prtystrA u | | | | = + + + +
where voteA is the percentage of the vote received by Candidate A, expendA and expendB
are campaign expenditures by Candidates A and B, and prtystrA is a measure of party
strength for Candidate A (the percentage of the most recent presidential vote that went to
As party).
a. What is the interpretation of
1
| ?
b. In terms of the parameters, state the null hypothesis that a 1% increase in As
expenditures is offset by a 1% increase in Bs expenditures.
c. I estimate the given model using the data in VOTE1.RAW. Report the SRF
with standard errors in parentheses. Is As expenditures variable statistically
significant? What about Bs expenditures? Can you use these results to test the
hypothesis in (b)?
d. Write down the model that directly gives the t statistic for testing the
hypothesis in (b).

C4.2 Use the data in LAWSCH85.RAW for this exercise.
a. Using the same model as problem 3 of chapter 3, state the null hypothesis that
the rank of law schools has no ceteris paribus effect on median starting salary
and a one-sided alternative hypothesis.
b. Based on the STATA output, interpret the rank coefficient. Can you reject the
null hypothesis in a) at the 5% level?
c. Are features of the incoming class of students, LSAT and GPA, individually or
jointly significant for explaining salary? (to account for missing data on LSAT
and GPA, I estimated the restricted model using individuals only if their LSAT
and GPA are not missing.)
d. Test whether the size of the entering class (clsize) or the size of the faculty
(faculty) needs to be added to this equation by carrying out a single test at the
5% level. (Again I accounted for missing data on clsize and faculty.)

19
C4.3 Use the data in MLB1.RAW for this exercise.
a. I estimate the model in equation (4.31) and drop the variable rbisyr. What
happens to the statistical significance of hrunsyr? What about the size of the
coefficient on hrunsyr?
b. I then add the variables runsyr (runs per year), fldperc (fielding percentage),
and sbasesyr (stolen bases per year) to the model in (a). Which of these
factors are individually significant? Interpret the significant coefficient(s).
c. In the model in (b), test the joint significance of bavg, fldperc, and sbasesyr.

C4.4 Use the data in WAGE2.RAW for this exercise.
a. Consider the standard wage equation
0 1 2 3
log( ) . wage educ exper tenure u | | | | = + + + +
State the null hypothesis that another year of general workforce experience
has the same effect on log(wage) as anther year of tenure with the current
employer.
b. Test the null hypothesis in (a) against a two-sided alternative, at the 5%
significance level, by constructing a 95% confidence interval. What do you
conclude?

C4.5 Refer to example used in Section 4.4. I will use the data set TWOYEAR.RAW.
a. The variable phsrank is the persons high school percentile. (A larger number
is better. For example, 90 means you are ranked better than 90 percent of your
graduating class.) Find the smallest, largest, and average phsrank in the
sample.
b. I then add phsrank to equation (4.26) and estimate the new model. Report the
OLS estimates in the usual form. Is phsrank statistically significant? How
much is 10 percentage points of high school rank worth in terms of wage?
c. Does adding phsrank to (4.26) substantively change the conclusions on the
returns to two- and four-year colleges? Explain.

C4.6 Use the data in DISCRIM.RAW to answer this equation. (See also Computer
Exercise C3.7 in Chapter 3.)
a. I estimate the model using STATA
, ) log( ) log(
3 2 1 0
u prppov income prpblck psoda + + + + = | | | |
Report the SRF with standard errors, number of observation and
2
R . Is
1

|
statistically different from zero at the 5% level against a two-sided
alternative? What about at the 1% level?
b. What is the correlation between log(income) and prppov? For both variables,
report the t statistics and two-sided p-values.
c. To the regression in (a), add the variable log(hseval) (hseval is
median housing value at zipcode level). Interpret its coefficient and report the
two-sided p-value for 0 :
) log(
=
hseval o
H | .
d. In the regression in (c), what happens to the individual statistical significance
of log(income) and prppov? Are these variables jointly significant? (Compute
a p-value.) What do you make of your answers?
20
e. Given the results of the previous regressions, which one would you report as
most reliable in determining whether the racial makeup of a zip code
influences local fast-food prices? What is the effect of prpblck on price of
soda based on the model you picked as the most reliable?

C4.7 Use the data in HPRICE1.dta to answer this question. We set a population model

( )
0 1 2
log price sqrft bdrms u | | | = + + +

a. You are interested in estimating and obtaining a confidence interval for the
percentage change in price when a 150-square-foot bedroom is added to a
house. In decimal form, this is
1 1 2
150 u | | = + . Use the data to estimate
1
u .
b. Write
2
| in terms of
1
u and
1
| and plus this into the regression equation
above.
c. Use the new regression you get in b) to obtain a standard error for
1

u and use
this standard error to construct a 95% confidence interval.

Chapter 5
Computer exercises
C5.1 Use the data in WAGE1.dta for this exercise.
a. Estimate the equation
0 1 2 3
wage educ exper tenure u | | | | = + + + +
Save the residuals and plot a histogram.
b. Repeat part (a), but with ( ) log wage as the dependent variable.
c. Would you say that Assumption MLR.6 is closer to being satisfied for the
level-level model or the log-level model?

C5.2 Use the data in GPA2.dta for this exercise.
a. Using all 4,137 observations, estimate the equation
0 1 2
lg co pa hsperc sat u | | | = + + +
and report the results
b. Reestimate the equation in part (a), using the first 2,070 observations.
c. Find the ratio of the standard errors on hsperc from parts (a) and (b). Compare
this with the result from equation (5.10) in the book.

Chapter 6

1. The following SRF was estimated using the data in CEOSAL.RAW:

2
2
log( ) 4.322 .276 log( ) .0215 .00008
(.324) (.033) (.0129) (.00026)
209, .282.
salary sales roe roe
n R
= + +
= =


21
This model allows roe to have a diminishing effect on log(salary). Is this
generality necessary? Explain why or why not.

2. Let
o
|

,
1

| , ,
k
|

be the OLS estimates from the regression of y


i
on x
i1
, , x
ik
,
i = 1, 2, , n. For nonzero constants c
0,
c
1
, , c
k
, argue that the OLS intercept
and slopes from the regression of c
0
y
i
on c
1
x
i1
, , c
k
x
ik
, i = 1, 2, , n, are given
by
o o o
c | |

~
= ,
k k o k o
c c c c | | | |

) / (
~
..., ,

) / (
~
1 1 1
= = .
(Hint: Use the fact that the
j
|

solve the first order conditions in (3.13), and the


j
|
~
must solve the first order conditions involving the rescaled dependent and
independent variables.)

3. Using the data in RDCHEM.dta, the following equation was obtained by OLS:

2
2
2.613 0.00030 0.0000000070
(0.429) (0.00014) (0.0000000037)
32 .1484
rdintens sales sales
n R
= +
= =

a. At what point does the marginal effect of sales on rdintens become
negative?
b. Would you keep the quadratic term in the model? Explain.
c. Define salesbil as sales measured in billions of dollars: salesbil =
sales/1,000. Rewrite (without re-estimating the model) the estimated
equation with salesbil and
2
salesbil as the independent variables. Be sure
to report standard errors and the R-squared.
d. For the purpose of reporting the result, which equation do you prefer?

4. The following model allows the return to education to depend upon the total
amount of both parents education, called pareduc:
. exp . ) log(
4 3 2 1 0
u tenure er pareduc educ educ wage + + + + + = | | | | |
a. Using calculus to show that the return to another year of education in this
model is roughly
1 2
log( ) / . wage educ pareduc | | A A = +
What sign do you expect for
2
| ? Why?

b. Using the data in WAGE2.RAW, the estimated equation is

2
log( ) 5.65 .047 .00078 .
(.13) (.010) (.00021)
.019exp .010
(.004) (.003)
722, .169
wage educ educ pareduc
er tenure
n R
= + + +
+
= =

(Only 722 observations contain full information on parents education.)
Interpret the coefficient on the interaction term. It might help to choose two
22
specific values for pareduc, for example, pareduc=32 if both parents have a
college education, or pareduc=24 if both parents have a high school
education, and to compare the estimated return to educ.

c. When pareduc is added as a separate variable to the equation, we get:

2
log( ) 4.94 .097 0.033 0.0016 .
(.38) (.027) (.017) (.0012)
.020exp .010
(.004) (.003)
722, .174
wage educ pareduc educ pareduc
er tenure
n R
= + + +
+
= =


Does the estimated return to education now depend positively on parent
education? Test the null hypothesis that the return to education does not depend
on parent education.

5. In example 4.2, where the percentage of students receiving a passing score on a
tenth-grade math exam (math10) is the dependent variable, does it make sense to
include sci10 the percentage of tenth graders passing a science exam as an
additional explanatory variable?

6. When
2
atndrte and ACT atndrte are added to the equation estimated in (6.19), the
R-squared becomes 0.232. Are these additional terms jointly significant at the
10% level? Would you include them in the model?

7. Suppose we want to estimate the effects of alcohol consumption (alcohol) on
colleage grade point average (colGPA). In addition to collecting information on
grade point average and alcohol usage, we also obtain attendance information
(say, percentage of lectures attended, called attend). A standardized test score
(say, SAT) and high school GPA (hsGPA) are also available.
a. Should we include attend along with alcohol as explanatory variables in a
multiple regression model? (think about how you would interpret
alcohol
| .)
b. Should SAT and hsGPA be included as explanatory variables? Explain.

Computer exercises

C6.1 I use the data in KEILMC.RAW for the year 1981 to run the following regressions.
The data are for houses that sold during 1981 in North Andover, Massachusetts; 1981
was the year construction began on a local garbage incinerator.
a. To study the effects of the incinerator location on housing price, consider the
simple regression model
, ) log( ) log(
1 0
u dist price + + = | |
where price is housing price in dollars and dist is distance from the house to
the incinerator measured in feet. Interpreting this equation casually, what sign
23
do you expect for
1
| if the presence of the incinerator depresses housing
prices?
b. I estimate this simple equation. Report the regression results and interpret the
results.
c. To the simple regression model in (a), I add the variables log(intst), log(area),
log(land), rooms, baths, and age, where intst is distance from the home to the
interstate (highway) measured in feet, area is square footage of the house,
land is the lot size in square feet, rooms is total number of rooms, baths is
number of bathrooms, and age is age of the house in years. Now, what do you
conclude about the effects of the incinerator?
d. Next I add
2
[log( )] intst to the model from c). Now what happens? What do
you conclude about the importance of functional form?
e. Is the square of log(dist) significant when I add it to the model in d)?


C6.2 I use the data in WAGE1.RAW for this exercise.
a. I estimate the equation
2
0 1 2 3
log( ) , wage educ exper exper u | | | | = + + + +
Report the results using the usual format.
b. Is exper
2
statistically significant at the 1% level?
c. Find the return to the fifth year of experience. What is the return to the
twentieth year of experience? (not using approximations)
d. At what value of exper does additional experience actually lower predicted
log(wage)? How many people have more experience in this sample?

C6.3 Consider a model where the return to education depends upon the amount of work
experience (and vice versa):
0 1 2 3
log( ) . . wage educ exper educ exper u | | | | = + + + +
a. Show that the return to another year of education, holding exper fixed, is
1 3
exper | | + .
b. State the null hypothesis that the return to education does not depend on the
level of exper. What do you think is the appropriate alternative?
c. Test the null hypothesis in (b) against your stated alternative.
d. Let
1
u denote the return to education. Write down the model that directly
gives the estimate and standard error for
1
u .

C6.4 Use the housing price data in HPRICE1.dta for this exercise.
a. Estimate the model
( ) ( ) ( )
0 1 2 3
log log log price lotsize sqrft bdrms u | | | | = + + + +
and report the results in the usual OLS format (as on page 154)
b. Find the predicted value of log(price), when 20, 000 lotsize = , 2,500 sqrt = , and
4 bdrms = . Using the method of equation (6.43), find the predicted value of
price at the same values of the explanatory variables.

24
C6.5 Use the data in VOTE1.dta for this exercise.
a. Consider a model with an interaction between expenditures:
0 1 2 3 4
exp exp voteA prtystrA expendA endB expendA endB u | | | | | = + + + + +
What is the partial effect of expendB on voteA, holding prtystrA and expendA
fixed? What is the partial effect of expendA on voteA? Is the expected sign for
4
| obvious?
b. Estimate the equation in a) and report the results in the usual form. Is the
interaction term statistically significant?
c. Find the average of expendA in the sample. Fix expendA at 300 (for
$300,000). What is the estimated effect of another $100,000 spent by
Candidate B on voteA? Is this a large effect?
d. Now fix expendB at 100. What is the estimated effect of 100 expendA A = on
voteA? Is this a large effect?
e. Now, estimate a model that replaces the interaction with shareA, Candidate
As percentage share of total campaign expenditures. Does it make sense to
hold both expendA and expendB fixed, while changing shareA?
f. In the model from e), find the partial effect of expendB on voteA, holding
prtystrA and expendA fixed. Evaluate this at expendA = 300 and expendB = 0
and comment on the results.

C6.6 Use the data in ATTEND.dta for this exercise.
a. Give the population regression function in Example 6.3, we have
2 4 6
2
stndfnl
priGPA atndrte
priGPA
| | |
c
= + +
c

Use equation (6.19) to estimate the partial effect when 2.59 priGPA = and
82 atndrte = . Interpret your estimate.
b. Reparameterize the model to capture the above effect by a single parameter
and estimate the reparameterized model.
( )
( )
2
2
0 1 2 3 4 5
6
2.59
82
stndfnl atndrte priGPA ACT priGPA ACT
priGPA atndrte u
u u u u u u
u
= + + + + + +
+
Where ( ) ( )
2 2 4 6
2 2.59 82 u | | | = + + . (Note that the intercept has changed, but
this is not important.) Use this to obtain the standard error of
2

u . Is it
statistically significant?

C6.7 Use the data in HPRICE1.dta for this exercise.
a. Estimate the model
0 1 2 3
price lotsize sqrft bdrms u | | | | = + + + +
and report the results in the usual form, including the standard error of the
regression. Obtain predicted price, when we plug in 10, 000 lotsize = ,
2300 sqrft = , and 4 bdrms = ; round this price to the nearest dollar.
b. Run a regression that allows you to put a 95% confidence interval around the
predicted value in a). Note that your prediction will differ somewhat due to
rounding error.
25
Chapter 7
1. In example 7.2, let noPC be a dummy variable equal to one if the student does not
own a PC, and zero otherwise.
a. If noPC is used in place of PC in equation 7.6, what happen to the
intercept in the estimated equation? What will be the coefficient on noPC?
b. What will happen to the R-squared if noPC is used in place of PC?
c. Should PC and noPC both be included as independent variable in the
model? Explain.
2. Suppose you collect data from a survey on wages, education, and gender. In
addition, you ask for information about marijuana usage. The original question is:
On how many separate occasions last month did you smoke marijuana?
a. Write an equation that would allow you to estimate the effects of
marijuana usage on wage, while controlling for other factors. You should
be able to make statement such as, Smoking marijuana five more times
per month is estimated to change wage by % x .
b. Write a model that would allow you to test whether drug usage has
different effects on wages for men and women. How could you test that
there are no differences in the effects of drug usage for mean and women?
c. Suppose you think it is better to measure marijuana usage by putting pople
into one of four categories: nonuser, light user (1 to 5 times per month),
moderate user (6 to 10 times per month), and heavy user (more than 10
times per month). Now write a model that allows you to estimate the
effects of marijuana usage on wage.
d. Using the model in c), explain in detail how to test the null hypothesis that
marijuana usage has no effect on wage. Be very specific and include a
careful listing of degrees of freedom.
e. What are some potential problems with drawing causal inference using the
survey data that you collected?

Computer Exercises
C 7.1 Use the data in WAGE2.dta for this exercise
a. Estimate the model
0 1 2 3 4
5 6 7
log( )
.
wage educ exper tenure married
black south urban u
| | | | |
| | |
= + + + +
+ + + +

and report the results in the usual form. Holding other factors fixed, what
is the approximate difference in monthly salary between blacks and non-
blacks? Is this difference statistically significant?
b. Expand the model in a) to allow the return to education to depend on race
and test whether the return to education does depend on race.
c. Again, start with the model in a), but now allow wages to differ across
four groups of people: married and black, married and nonblack, single
and black, and single and nonblack. What is the estimated wage
differential between married blacks and married nonblacks?

C 7.2 Use the data in GPA2.dta for this exercise
a. Consider the equation
26
2
0 1 2 3 4
5 6
.
colgpa hsize hsize hsperc sat
female athlete u
| | | | |
| |
= + + + +
+ + +

where colgpa is cumulative college grade point average, hsize is size of high
school graduating class, in hundreds, hsperc is academic percentile in
graduating class, sat is combined SAT score, female is a binary gender
variable, and athlete is a binary variable, which is one for student-athletes.
What are your expectations for the coefficients in this equation? Which ones
are you unsure about?
b. Estimate the equation in a) and report the results in the usual form. What
is the estimated GPA differential between athletes and nonathletes? Is it
statistically significant?
c. Drop sat from the model and reestimate the equation. Now what is the
estimated effect of being an athlete? Discuss why the estimate is different
than that obtained in b).
d. In the model from a), allow the effect of being an athlete to differ by
gender and test the null hypotheses that there is no ceteris paribus
difference between women athletes and women nonathletes.
e. Does the effect of sat on colgpa differ by gender? Justify your answer.

Chapter 8
Computer Exercises

C 8.1
a. Use the data in HPRICE1.dta to obtain the heteroskedasticity-robust
standard errors for equation (8.17). discuss any important differences with
the usual standard errors.
b. Repeat a) for equation (8.18).
c. What does this example suggest about heteroskedasticity and the
transformation used for the dependent variable?

Chapter 9
Computer Exercises

C9.1 Let math10 denote the percentage of students at a Michigan high school reveiving a
passing score on a standardized math test (see also Example 4.2). We are interested in
estimating the effect of per student spending on math performance. A simple model is
( ) ( )
0 1 2 3
log log math10 expend enroll poverty u | | | | = + + + +
Where poverty is the percentage of students living in poverty.
a. The variable lnchprg is the percentage of students eligible for the federally
funded school lunch program. Why is this a sensible proxy variable for
poverty?
b. Estimate the model with and without lnchprg as an explanatory variable
and report your regression results. Compare the effect of expenditures on
math10 from both regressions.
27
c. Does it appear that pass rates are lower at larger schools, other factors
being equal? Explain.
d. Interpret the coefficient of lnchprg.
e. What do you make of the substantial increase in
2
R after adding lnchprg?

C 9.2 Use the data set WAGE2.dta for this exercise.

a. Use the variable KWW (the knowledge of the world of work test score)
as a proxy variable for ability in place of IQ in Example 9.3. What is the
estimated return to education?
b. Now, use IQ and KWW together as proxy variables. What happens to the
estimated return to education?
c. In b), are IQ and KWW individually significant? Are they jointly
significant?

C 9.3 Use the data from JTRAIN.dta for this exercise.
a. Consider the simple regression model
( )
0 1
log scrap grant u | | = + +
where scrap is the firm scrap rate and grant is a dummy variable
indicating whether a firm received a job training grant. Can you think of
some reasons why the unobserved factor in u might be correlated with
grant?
b. Estimate the simple regression model using the data for 1988. (you should
have 54 observations.) Does receiving a job training grant significantly
lower a firms scrap rate?
c. Now, add as an explanatory variable ( )
87
log scrap . How does this change
the estimated effect of grant? Interpret the coefficient on grant. Is it
statistically significant at the 5% level against the one-sided alternative
: 0
a grant
H | < ?
d. Test the null hypothesis that the parameter on ( )
87
log scrap one against
the two-sided alternative. Report the p-value of the test.
e. Repeat c) and d), using heterskedasticity-robust standard errors, and
briefly discuss any notable differences.

C 9.4 You need to use two data sets for this exercise JTRAIN2.dta and JTRAIN3.dta.
(Before solving this problem, read the data dictionary regarding both data sets). The
former is data from a job training experiment, where job training was assigned by
randomization. The latter contains observational data (a random sample from the
population of (American) men working in 1978.), where job training participation was
largely determined by individual choice. The two data sets cover the same time period.
a. In the data set JTRAIN2.dta, what fraction of the men received job
training? What is the fraction in JTRAIN3.dta? Why do you think there is
such a big difference?
b. Using JTRAIN2.dta, run a simple regression of re78 on train. What is the
estimated effect of participating in job training on real earnings?
28
c. Now add as controls to the regression in b) the variables re74, re75, educ,
age, black, and hisp. Does the estimated effect of job training on re78
change much? How come?
d. Do the regression in b) and c) using the data in JTRAIN3.dta, reporting
only the estimated coefficients on train, along with their t statistics. What
is the effect now of controlling for the extra factors, and why?
e. Define ( ) 74 75 2 avgre re re = + . Find the sample averages, standard
deviations, and minimum and maximum values in the two data sets. Are
these data sets representative of the same populations in 1978?
f. Almost 96% of men in the data set JTRAIN2.dta have avgre less than
$10,000. Using only these men, run the regression re78 on train, re74,
re75, educ, age, black, hisp and report the training estimate and its t
statistic. Run the same regression for JTRAIN3.dta, using only men with
avgre less than $10,000. For the subsample of low-income men, how do
the estimated training effects compare across the experimental and
nonexperimental data sets?
g. Now use each data set to run the simple regression re78 on train, but only
for men who were unemployed in 1974 and 1975. How do the training
estimates compare now? If you fine the estimate from the observational
data is higher than that from the experiment data, can you think of an
explanation?
h. Using your findings from the previous regressions, discuss the potential
importance of having comparable populations underlying comparisons of
experimental and nonexperimental estimates.

Chapter 13
1. In example 13.1, assume that the average of all factors other than educ have
remained constant over time and that the average level of education is 12.2 for the
1972 sample and 13.3 in the 1984 sample. Using the estimates in Table 13.1, find
the estimated change in average fertility between 1972 and 1984. (Be sure to
account for the intercept change and the change in average education.)
2. Using the data in KIELMC.dta, the following two equations were estimated using
the years 1978 and 1981:

( )

2
log 11.49 .547 .394 81
(.26) (.058) (.080)
321, .220
price nearinc y nearinc
n R
= +
= =


( )

2
log 11.18 .563 81 .403 81
(.27) (.044) (.067)
321, .337
price y y nearinc
n R
= +
= =

29
The estimates on the interaction term 81 y nearinc from the above two equations
are very different from that in equation (13.9). Explain the difference between
these two regressions and equation (13.9).
3. Suppose we want to estimate the effect of several variables on annual saving and
that we have a panel data set on individuals collected on January 31, 1990, and
January 31, 1992. If we include a year dummy for 1992 and use first differencing,
can we also include age in the original model (the model before differencing)?
Explain.

Computer Exercises

C13.1
Use the data in FERTIL1.data for this exercise.
a. In the equation estimated in Example 13.1, test whether living
environment at age 16 has an effect on fertility. (the base group is large
city.) Report the value of the F statistic and the p-value.
b. Test whether region of the country at age 16 (South is the base group) has
an effect on fertility.
c. Add the interaction terms 74 y educ , 76 y educ ,, and 84 y educ to the
model estimated in Table 13.1. Explain what these terms represent. Are
they jointly significant?
d. Based on the SRF you got in c), find out the relative fertility level of 1984
compared to the base year 1972 for 12 years of education and at the
sample mean of education in 1984. Explain that how we know if the above
two estimates are significant, and you only need to suggest a regression to
run for each situation (educ = 12 and educ at the sample mean of 1984)?

C13.2
Use the data in CPS78_85.dat for this exercise.
a. How do you interpret the coefficient on 85 y in equation (13.2)? Does it
have an interesting interpretation? (Be careful here; you must account for
the interaction terms 85 y educ and 85 y female .)
b. Holding other factors fixed, what is the estimated percent increase in
nominal wage for a male with 12 years of education over this time period?
Propose a regression to obtain a confidence interval for this estimate.
c. Reestimate equation (13.2) but let all wages be measured in 1978 dollars.
In particular, define the real wage as rwage = wage for 1978 and as rwage
= wage/1.65 for 1985. Now use ( ) log rwage in place of ( ) log wage in
estimating (13.2). Before running the regression, try to predict which
coefficients will differ from those in equation (13.2).
d. Explain why the
2
R from your regression in c) is not the same as in
equation (13.2).
e. Describe how union participation changed from 1978 to 1985.
f. Starting with equation (13.2), test whether the union wage differential
changed over time.
30
g. Do your findings in e) and f) conflict? Explain.

C 13.3
Use the data in KIELMC.dta for this exercise
a. The variable dist is the distance from each home to the incinerator site, in
feet. Consider the model
( ) ( ) ( )
0 0 1 1
log 81 log 81 log price y dist y dist u | o | o = + + + +
If building the incinerator reduces the value of homes closer to the site,
what is the sign of
1
o ? What does it mean if
1
0 | > ?
b. Estimate the model in a) and report the results in the usual form. Interpret
the coefficient on ( ) 81 log y dist . What do you conclude?
c. Add age,
2
age , rooms, baths, ( ) log intst , ( ) log land , and ( ) log area to the
equation. Now, what do you conclude about the effect of the incinerator
on housing values?
C 13.4
For this exercise, we use JTRAIN.dta to determine the effect of the job training grant on
hours of job training per employee. The basic model for the three years is
( )
0 1 2 1 2 , 1 3
88 89 log
it it it it i t it i it
hrsemp d d grant grant employ a u | o o | | |

= + + + + + + +
a. Estimate the equation using first differencing. How many firms are used in
the estimation? How many total observations would be used if each firm
had data on all variables for all three time period?
b. Interpret the coefficient on grant and comment on its significance.
c. Is it surprising that
1
grant

is insignificant? Explain.
d. Do larger firms train their employees more or less, on average? How big
are the differences in training due to firm size?


Chapter 15
1. Consider a simple model to estimate the effect of personal computer (PC)
ownership on college grade point average for graduating seniors at a large public
university:
0 1
GPA PC u | | = + +
where PC is a binary variable indicating PC ownership.
a. Why might PC ownership be correlated with u?
b. Explain why PC is likely to be related to parents annual income. Does
this mean parental income is a good IV for PC? Why or why not?
c. Suppose that, four years ago, the university gave grants to buy computers
to roughly one-half of the incoming students, and the students who
received grants were randomly chosen. Carefully explain how you would
use this information to construct an instrumental variable for PC.

2. Suppose that you wish to estimate the effect of class attendance on student
performance, as in Example 6.3. A basic model is
0 1 2 3
stndfnl atndrte priGPA ACT u | | | | = + + + +
31
a. Let dist be the distance from the students living quarters to the lecture
hall. Assuming that dist and u are uncorrelated, what other assumption
must dist satisfy in order to be a valid IV for atndrte?
b. Suppose, as in equation (6.18), we add the interaction term
priGPA atndrte . What might be a good IV for priGPA atndrte ? [Hint:
if
( )
, , 0 E u priGPA ACT dist = , as happens when priGPA, ACT, and dist
are all exogenous, then any function of priGPA and dist is uncorrelated
with u.]

3. Consider the simple regression model
0 1
y x u | | = + +
and let z be a binary instrumental variable for x. Use (15.10) to show that the IV
estimator
1

| can be written as
( ) ( )
1 1 0 1 0

y y x x | =
where
0
y and
0
x are the sample average of
i
y and
i
x over the part of the sample
with 0
i
z = , and where
1
y and
1
x are the sample average of
i
y and
i
x over the
part of the sample with 1
i
z = . This estimator, known as a grouping estimator, was
first suggested by Wald (1940).

4. Refer to equations (5.19) and (15.20). Assume that
u x
o o = , so that the
population variation in the error term is the same as it is in x . Suppose that the
instrumental variable, z , is slightly correlated with u : ( ) , 0.1 Corr z u = . Suppose
that z and x have a somewhat stronger correlation: ( ) , 0.2 Corr z x = .
a. What is the asymptotic bias in the IV estimator?
b. How much correlation would have to exist between u and x before OLS
has more asymptotic bias than 2SLS?
5. The following is a simple model to measure the effect of a school choice program
on standardized test performance (see Rouse[1998] for motivation):
0 1 2 1
score choice faminc u | | | = + + +
where score is the score on a statewide test, choice is a binary variable indicating
whether a student attended a choice school in the last year, and faminc is family
income. The IV for choice is grant, the dollar amount granted to students to use
for tuition at choice schools. The grant amount differed by family income level,
which is why we control for faminc in the equation.
a. Even with faminc in the equation, why might choice be correlated with
1
u ?
b. If withing each income class, the grant amounts were assigned randomly,
is grant uncorrelated with
1
u ?
c. Write the reduced form equation for choice. What is needed for grant to
be partially correlated with choice?

6. Suppose that, in equation (15.8), you do not have a good instrumental variable
candidate for skipped. But you have two other pieces of information on students:
32
combined SAT score and cumulative GPA prior to the semester. What would you
do instead of IV estimation?

Computer Exercises

C15.1
Use the data in WAGE2.dta for this exercise.
a. In Example 15.2, using sibs as an instrument for educ, the IV estimate of the
return to education is 0.122. To convince yourself that using sibs as an IV for
educ is not the same as just plugging sibs in for educ and running an OLS
regression, run the regression of ( ) log wage on sibs and explain your findings.
b. The variable brthord is birth order (it is one for a first-born child, two for a
second-born child, and so on). Explain why educ and brthord might be negatively
correlated. Regress educ on brthord to determine whether there is a statistically
significant negative correlation.
c. Use brthord as an IV for educ in equation (15.1). Report and interpret the results.
d. Now, suppose that we include number of siblings as an explanatory variable in the
wage equation; this controls for family background, to some extent:
( )
0 1 2
log wage educ sibs u | | | = + + +
Suppose that we want to use brthord as an IV for educ, assuming that sibs is
exogenous. The reduced form for educ is
0 1 2
educ sibs brthord v t t t = + + +
State and test the identification assumption.
e. Estimate the wage equation in d) using brthord as an IV for educ (and sibs as its
own IV). Comment on the standard errors for

educ
| and

sibs
| .
f. Using the fitted values from e)

educ , compute the correlation between

educ and
sibs. Use this result to explain your findings from e).

C15.2
Use the data in CARD.dta for this exercise.
a. The equation we estimated in Example 15.4 can be written as
( )
0 1 2
log ... wage educ exper u | | | = + + + +
where the other explanatory variables are listed in Table 15.1. In order for IV to
be consistent, the IV for educ, nearc4, must be uncorrelated with u. Could nearc4
be correlated with things in the error term, such as unobserved ability? Explain.
b. For a subsample of the mean in the data set, an IQ score is available. Regress IQ
on nearc4 to check whether average IQ scores vary by whether the man grew up
near a four-year college. What do you conclude?
c. Now, regress IQ on nearc4, smsa66, and the 1966 regional dummy variables
reg662,,reg669. Are IQ and nearc4 related after the geographic dummy
variables have been partialled out?
d. From b) and c), what do you conclude about the importance of controlling for
smsa66 and the 1966 regional dummies in the ( ) log wage equation?


33
C15.3
The purpose of this exercise is to compare the estimates and standard errors obtained by
correctly using 2SLS with those obtained using inappropriate procedures. Use the data
file WAGE2.dta.
a. Use a 2SLS routine to estimate the equation
( )
0 1 2 3 4
log wage educ exper tenure black u | | | | | = + + + + +
where sibs is the IV for educ. Report the results in the usual form.
b. Now, manually carry out 2SLS. That is, first regress educ on sibs, exper, tenure
and black and obtain the fitted value

educ . Then run the second stage regression


( ) log wage on

educ , exper, tenure and black. Verity that the

| are identical to
those obtained from a), but that the standard errors are somewhat different. The
standard errors obtained from the second stage regression when manually carrying
out 2SLS are generally inappropriate.
c. Now, use the following two-step procedure, which generally yields inconsistent
parameter estimates of | , and not just inconsistent standard errors. In step one,
regress educ on sibs only and obtain the fitted value

educ (Note that this is an


incorrect first stage regression.) Then in the second step, run the regression of
( ) log wage on

educ , exper, tenure and black. Compare the estimate of the return
to education from this incorrect procedure with that from the proper procedure of
a).

You might also like