Professional Documents
Culture Documents
Columbia University
UN3412
Fall 2016
Calculator was once a job description. This problem set gives you an opportunity to do some
calculations on the relation between smoking and lung cancer using a (very) small sample of five
countries. The purpose of this exercise is to illustrate the mechanics of ordinary least squares
(OLS) regression. You will calculate the regression by hand using formulas from class and the
textbook. For these calculations, you may relive history and use long multiplication, long
division, and tables of square roots and logarithms; or you may use an electronic calculator or a
spreadsheet.
The data are summarized in the following table. The variables are per capita cigarette
consumption in 1930 (the independent variable, X) and the death rate from lung cancer in 1950
(the dependent variable, Y). The cancer rates are shown for a later time period because it takes
time for lung cancer to develop and be diagnosed.
Observation #
Country
1
2
3
4
5
Switzerland
Finland
Great Britain
Canada
Denmark
Cigarettes consumed
per capita in 1930 (X)
530
1115
1145
510
380
Source: Edward R. Tufte, Data Analysis for Politics and Management, Table 3.3.
1. Use a calculator, a spreadsheet, or by hand methods to compute the following; refer to the
textbook for the necessary formulas. (Note: if you use a spreadsheet, attach a printout)
a) The sample means of X and Y, X and Y .
X = 736, Y = 276
b) The standard deviations of X and Y, sX and sY.
sX = 364.41, sY = 132.35
c) The correlation coefficient, r, between X and Y.
r = 0.92625 0.93
d) 1 , the OLS estimated slope coefficient from the regression Yi = 0 + 1Xi + ui
= 0.336418
1
0 = 28.39656
f) Yi , i = 1,, n, the predicted values for each country from the regression
Switzerland
France
GreatBritain
Canada
Denmark
206.6981
403.5026
413.5952
199.9697
156.2354
2. On graph paper or using a spreadsheet, graph the scatterplot of the five data points and the
regression line. Be sure to label the axes, the data points, the residuals, and the slope and
intercept of the regression line.
3. You are hired by the governor to study whether a tax on liquor has decreased average liquor
consumption in New York. From a random sample of n individuals in New York, you obtain
each persons liquor consumption both for the year before and for the year after the
introduction of the tax. From this data, you compute Yi ="change in liquor consumption" for
individual i = 1,. n. Yi is measured in ounces so if, for example, Yi = 10, then individual i
increased his liquor consumption by 10 ounces. Let the parameters y and y2 of Y denote the
population mean and variance of Y.
a) You are interested in testing the hypothesis H0 that there was no change in liquor
consumption due to the tax. State this formally in terms of the population
parameters.
b) The alternative, H1, is that there was a decline in liquor consumption; state the
alternative in terms of the population parameters.
c) Suppose that your sample size is n = 900 and you obtain estimates = -32.8 and
= 466.4. Report the t-statistic for testing H0 against H1. Obtain the p-value for
the test [use Table 1 in Stock and Watson, p. 749-750]. Do you reject at a 5%
level? At 1% level?
d) Would you say that the estimated fall in consumption is large in magnitude?
Comment on the practical versus statistical significance of this estimate.
e) In your analysis, what has been implicitly assumed about other determinants of
liquor consumption over the two-year period in order to infer causality from the
tax change to liquor consumption?
Solution
a)
b)
c)
d)
e)
4. Let Y be a Bernoulli random variable with success probability Pr(Y=1) = p, and let Y1 ,..., Yn
be i.i.d. draws from this distribution. Let p be the fraction of successes (1s) in this sample.
a. Show that p = Y
b. Show that p is an unbiased estimator of p.
3
Solution
Each random draw Yi from the Bernoulli distribution takes a value of either zero or one with
probability Pr(Yi =1) = p and Pr(Yi =0) = 1 p. The random variable Yi has mean
E(Yi) = 0 Pr(Y = 0) + 1 Pr(Y = 1) = p
And variance Var(Yi) = E[(Yi - Y)2]
= (0 p)2 * Pr(Y = 0) + (1 p)2 * Pr(Y = 1) = p2 (1 p) + (1 p)2 p = p(1 p)
a) The fraction of successes is
p = # successes/n = # (Yi = 1) / n = Yi /n = Y
b) E( p ) = E ( Yi/n) = 1/n E(Yi) = 1/n p = p
c) Var( p ) = Var( Yi/n) = 1/(n^2)[ Var(Yi)] = np(1-p)/(n^2) = p(1-p)/n
5. Let Y1, Y2, Y3, Y4, be independently, identically distributed random variables from a
population with mean and variance 2. Let Y = (1/4) (Y1+Y2+Y3+Y4) denote the average of
these four random variables.
a. What are the expected value and variance of Y in terms of and 2?
E[ Y ] = (1/4) {E[Y1]+E[Y2]+E[Y3]+E[Y4]} =
Var [ Y ] = (1/16) (4
b. Now, consider a different estimator of : =(1/8)Y1+(1/8)Y2+(1/4)Y3+(1/2)Y4. This is an
example of a weighted average of the Yi.s. Show that is also an unbiased estimator of
. Find the variance of .
E[ ] = (1/8) E[Y1] + (1/8) E[Y2] + (1/4) E[Y3] + (1/2) E[Y4] =
Var [ ] = (1/64)(1/64)(1/16)(1/4)
c. Based on your answer to parts (a) and (b), which estimator of do you prefer, Y or ?
Both Y and are unbiased, since expected value of each is equal to population mean.
Y is preferred to since Var [ Y ] < Var [ ].
d. Suppose Y1, Y2, Y3, Y4 follow a Normal distribution with mean 5 and variance 2=3.
What is the distribution of Y and ?
By the property of normal distribution, distribution of Y will be Normal distribution with
mean 5 and variance 3/4 and the distribution of will be Normal distribution with mean
5 and variance 33/32
6. Suppose at Columbia University, grade point average (GPA) and SAT scores are related by
the conditional expectation E(GPA|SAT) = .90 + .001 SAT.
a. Find the expected GPA when SAT = 1600.
E(GPA|SAT=1600) = 2.5
b. Find E(GPA|SAT=2200)
E(GPA|SAT=2200) = 3.1
c. If the average SAT in the university is 2000, what is the average GPA?
The average GPA is 2.9
7. Let u and X be two random variables that satisfy E [u|X] = 0 and E[u2|X]=2.
a. Find the unconditional mean and variance of u.
E[u]= E[E[u/X]] = E[0]=0
E[u 2] = E[E[u2 /X]] = E[ ] =
b. What is the covariance between u and X?
Cov(u,X)= E[uX] E[u]E[X] = E[E[uX/X]] 0 = E[X E[u/X]] = E[X*0] = E[0] = 0
8. Adult males are taller, on average, than adult females. Visiting two recent American Youth
Girls
57.8
3.9
55
58.4
4.2
57
is the sample average height for boys, is the number of boys in the
Where
sample, is the sample variance of height of boys.
(a) Let your null hypothesis be that there is no difference in the height of females and males
at this age level. Specify the alternative hypothesis.
Let and denote the mean height of boys and girls in the population. The null and
alternative hypothesis are 0 : = 0 and 1 : 0
(b) What is the unbiased estimate of the difference in height between boys and girls? Provide
a formula and check the unbiasedness. Calculate the value of this estimate for the given
sample.
An unbiased estimator of is . By i.i.d. assumption,
[ ] = [ ] [ ]
5
=1
=1
1
1
= [ ] [ ]
1
1
[ ]
[ ]
=1
=
=
=1
1
1
[ ]
[ ]
=1
=1
=1
=1
1
1
=
The estimate we get from this sample is = 57.8 58.4 = 0.6
(c) Derive the formula for the variance of the estimate from (b). Calculate the estimate of the
variance for the given sample.
Let 2 and 2 denote the variance of the height of boys and girls in the population. By
the i.i.d. assumption,
[ ] = [ ] + [ ]
=1
=1
1
1
= [ ] + [ ]
=1
=1
1
1
= 2 [ ] + 2 [ ]
=1
=1
1
1
= 2 2 + 2 2
2 2
=
+
A natural estimator of = [ ] is
[ ] =
2 2
+
[ ] =
3.92 4.22
+
0.586
55
57
(d) Create a statistic for testing the hypothesis in (a) using the Central Limit Theorem and the
Law of Large Numbers.
Let us consider z-statistic
=
2 2
+
Suppose 0 is true. Then by the Central Limit Theorem, the z-statistic is approximately
distributed according to (0,1) in the large samples. We do not know 2 and 2 thus we
have to replace these unknown parameters with their estimators. This gives us the tstatistic
=
2 2
+
By the Law of Large Numbers 2 and 2 are consistent estimators of 2 and 2 . Thus, in
large samples t-statistic is also well approximated by (0,1). We can use this result to
test 0 .
(e) Calculate the t-statistic for comparing the two means. Is the difference statistically
significant at the 1% level? Which critical value did you use? Why would this number be
smaller if you had assumed a one-sided alternative hypothesis? What is the intuition
behind this?
0.6
In the given sample =
0.784. The 1% critical value for the t-statistic is 2.58
0.586
for the two-sided test. Thus, the difference is not statistically significant under 1% level.
If we had a one-sided alternative, we should use the one-sided test whose 1% critical
value is 2.33. Intuitively, by choosing one-sided alternative, we are excluding the other
half as an alternative. Thus, the probability that we observe an anomaly in par with the
given t-statistic is half as large as when compared to the two-sided alternative. That is, we
are less conservative with the null hypothesis and our critical value becomes smaller.
(f) Generate a 95% confidence interval for the difference in height.
The 95% confidence interval for is
[0.6 (1.96)(0.586)] = [2.1 ,0.9]
where 1.96 is the 95% critical value for a two-sided test. Note that since zero is included
in this interval, we would conclude that the difference in average heights of boys and
girls is not statistically significant.
Following questions will not be graded, they are for you to practice and will be discussed at
the recitation:
9. [Practice question, not graded] SW 2.3
Rain (X=0)
No Rain (X=1)
Total
0.15
0.07
0.22
0.15
0.63
0.78
Total
0.30
.70
1.00
Using the random variables X and Y from Table 2.2 (given above), consider two new random
variables W = 3 + 6X and V = 20 7Y. Compute:
a) E(W) and E(V).
b) W and V.
c) W,V and Corr(W,V).
Solution:
(a) E(V) = E(20-7Y) = 20 7E(Y) = 20 7 0.78 = 14.54
E(W) = E(3+6X) = 3 + 6E(X) = 3 + 6 0.70 = 7.2
(b) Var(W) = var(3+6X) = 62 Var(X) = 36 0.21 = 7.56
Var(V) = var(20-7Y) = (-7)2 Var(Y) = 49 0.1716 = 8.4084
(c) Cov(W,V) = Cov(3+6X,20-7Y) = 6 (-7)Cov(X,Y) = -42 0.084 = -3.528
Corr(W,V) = -3.528/(7.56*8.4084) = 0.4425 (= -Corr(X,Y) )
Unemployed (Y=0)
Employed (Y=1)
Total
0.045
0.709
0.754
0.005
0.241
0.246
Total
0.050
0.950
1.000
a. Compute E(Y).
b. The unemployment rate is the fraction of the labor force that is unemployed. Show that
the unemployment rate is given by 1-E(Y).
c. Calculate the E(Y|X=1) and E(Y|X=0).
d. Calculate the unemployment rate for (i) college graduates and (ii) non-college graduates.
e. A randomly selected member of this population reports being unemployed. What is the
probability that this worker is a college graduate? A non-college graduate?
f. Are educational achievement and employment status independent? Explain.
Solution:
(a) E(Y) = 0 Pr(Y=0) + 1 Pr(Y=1) = 0 0.05 + 1 0.095 = 0.95
(b) Unemployment Rate = #(unemployed)/ # (labor force)
= Pr(Y=0) = 1 - Pr(Y=1) = 1 - EY
(c) We calculate the conditional probabilities first:
Pr(Y=0|X=0) = Pr(X=0 & Y=0)/Pr(X=0) = 0.045/0.754 = 0.0597
Pr(Y=1|X=0) = Pr(X=0 & 1=0)/Pr(X=0) = 0.709/0.754 = 0.9403
Pr(Y=0|X=1) = Pr(X=1 & Y=0)/Pr(X=1) = 0.005/0.246 = 0.0203
Pr(Y=1|X=1) = Pr(X=1 & Y=1)/Pr(X=1) = 0.241/0.246 = 0.9797
The conditional expectations are:
E(Y|X=1) = 0 Pr(Y=0|X=1) + 1 Pr(Y=1|X=1)
= 0 0.0203 + 1 0.9797 = 0.9797
E(Y|X=0) = 0 Pr(Y=0|X=0) + 1 Pr(Y=1|X=0)
= 0 0.0597+ 1 0.9403 = 0.9403
(d) Use the Solution to part (b)
Unemployment rate for college grads
= 1 E(Y | X=1) = 1-0.9797 = 0.0203
Unemployment rate for non-college grads
= 1 E(Y | X=0) = 1-0.9403= 0.0597
(e) The probability that a randomly selected workers who is reported being unemployed
is a college graduate is
Pr(X=1|Y=0) = Pr(X=1 & Y=0)/Pr(Y=0) = 0.005/0.050 = 0.1
The probability that this worker is a non college graduate is
Pr(X=0|Y=0) = 1 Pr(X=1|Y=0) = 1 - 0.1 = 0.9
(f) Educational achievement and employment status are not independent because they do not satisfy
that, for all values of x and y,
Pr (Y y|X x) Pr (Y y)
For example,
Pr (Y 0|X 0) 00597 Pr (Y 0) 0050
11. [Practice question, not graded] SW 2.14 [Hint: Use SW Appendix Table 1.]
In a population E[Y] = 100 and Var(Y) = 43. Use the central limit theorem to answer the
following questions:
a. In a random sample of size n = 100, find Pr( Y 101)
b. In a random sample of size n = 165, find Pr( Y >98)
c. In a random sample of size n = 64, find Pr(101 Y 103)
Solution:
a. In a random sample of size n = 100, find Pr( <=101)
Var( ) = 43/100 = 0.43 so
Pr( <=101) = Pr( -100/0.43 (101-100)/ 0.43)
= (1.525) = 0.9364
b. In a random sample of size n = 165, find Pr( >98)
Var( ) = 43/165 = 0.2606
Pr( >98) = 1 - Pr( 98) = 1- Pr( -100/0.2606 (98-100)/ 0.2606)
= 1 - (-3.9178) = (3.9178) 1
c. In a random sample of size n = 64, find Pr(101 <= <=103)
Var() = 43/64 = 0.6719
Pr(101 <= <=103)
= Pr ( (101-100)/0.6719 (-100)/ 0.6719 (101-100)/0.6719 )
(3.6599) - (1.22) = 0.9999 0.8888 = 0.1111
Men
$3100
$200
100
Women
$2900
$320
64
a. What do these data suggest about wage differences in the firm? Do they represent
statistically significant evidence that wages of men and women are different? (To answer
this question, first state the null and alternative hypothesis; second, compute the relevant
t-statistic; and finally, use the p-value to answer the equation.)
10
b. Do these data suggest that the firm is guilty of gender discrimination in its compensation
politics? Explain.
Solution:
The standard error of Y 1 Y 2 is SE (Y 1 Y 2)
(a)
s12
n1
s2
n22
2002
100
320
44721.
64
2
H0 1 2 0 vs H1 1 2 0
With the t-statistic t act 44722, the p-value for the one-sided test is:
13. [Practice question, not graded] SW 2.10 [Hint: Use SW Appendix Table 1.]
Compute the following probabilities:
11
N Y , Y2 then
Y Y
(a)
Y 1 3 1
Pr (Y 3) Pr
(1) 08413
2
2
(b)
Pr(Y 0) 1 Pr(Y 0)
Y 3 03
1 Pr
(c)
40 50 Y 50 52 50
Pr (40 Y 52) Pr
5
5
5
(04) (2) (04) [1 (2)]
06554 1 09772 06326
(d)
65 Y 5 85
Pr (6 Y 8) Pr
2
2
2
(21213) (07071)
09831 07602 02229
c. What is the p-value for the test H0: p=0.5 vs. H1:p0.5?
d. What is the p-value for the test H0: p=0.5 vs. H1:p>0.5?
e. Why do the results from (c) and (d) differ?
f. Did the survey contain statistically significant evidence that the incumbent was ahead of
the challenger at the time of the survey? Explain.
Solution:
(a)
(b)
var( p )
215
400
05375.
p (1 p )
n
(1 0.5375)
0.5375 400
62148 104. The standard error is SE
( p ) (var( p )) 2 00249.
(c)
p p 0 05375 05
1506
SE( p )
00249
Because of the large sample size (n 400), we can use Equation (3.14) in the text to get the
p-value for the test H0 p 05 vs. H1 p 05 :
Using Equation (3.17) in the text, the p-value for the test H0 p 05 vs. H1 p 05 is
13