You are on page 1of 5

Discussion Section 1 ECON 139/239 2010 Summer Term II

1. Getting Started with Stata The point of this discussion section is to get you started using the statistical software package Stata. Starting from the Excel dataset cps98.csv (available on the course website), load the data into Stata. This le contains the data on average hourly earnings, education, gender and age of individuals for a sample of workers in year 1998. A quick way to do this is to save the Excel le and then load it into Stata using the insheet command. You might nd it easier to use Import command under the File menu. Use the sum command to summarize the variables in the dataset. What is the average hourly earnings and their standard deviation in the sample? How many male workers took the survey? What is the average, minimum and maximum age of the respondents? Use the tab age command to look at the distribution of age in the sample. What is the mode of this distribution? What is the median (approximately)? Use the hist ahe command to plot the histogram of the average hourly earnings. What can you say about the shape of its distribution? Use sum ahe if female == 1 command to calculate the average earnings for the females. What is the average earnings for the males? Do you nd the dierence economically signicant? Use the sum if ahe > 40 command to see who are the top earners their age, gender, education level. How about those with hourly earnings less than 3? Use the gen ahe2 = ahe*ahe command to generate a new variable ahe2 equal to hourly wages squared. Use the scatter ahe2 ahe, title(Ahe2) command to graph the relationship between ahe and ahe2 . Now try adding the , xlabel(0(2)50) ylabel(0(1000)3000) option to see how to change the axes in your graph. Use the pwcorr command to calculate the sample correlation between average hourly earnings and age. Does it come out with an expected sign? Plot the relationship with scatter command. Does this correlation change for top and bottom earners?

2. We know that, by denition, 1 X = n s2 = X sXY a. Prove that b. Prove that c. Prove that d. Prove that Solution: a. (Xi X) = Xi X= Xi nX = nX nX = 0 b. We prove one preliminary result: (Xi X)(Yi Y ) = = = = = = (Xi Yi Xi Y XYi + XY ) Xi Yi Y Xi X Yi + XY
n i=1 n i=1 n i=1 n i=1 n

Xi
i=1 n

1 n1 1 n1

Xi X
i=1 n

Xi X
i=1

Yi Y

Xi X = 0 Xi X Yi = (n 1) sXY Xi X Xi Y + 5 = (n 1) s2
k j=1

1 + Xi X

Yj Y

= n (k 1) s2 Y

Xi Yi nXY nXY + nXY Xi Yi nXY Xi Yi X (Xi X)Yi Yi

Therefore, just use Yi instead of Yi , and the result will of course still hold: (Xi X)Yi = c. (Xi X)(Xi Y + 5) = = (Xi X)Xi (Xi X)Y + (Xi X)5 (Xi X)(Yi Y ) = (n 1)SXY

(Xi X)Xi 0 + 0

2 = (n 1)SXX = (n 1)SX

Page 2

d. (1 + Xi X)(Yj Y )2
i j

=
i j

(Yj Y )2 +
i 2 j

(Xi X)(Yj Y )2 (Xi X) (Yj Y )2


j

= n
j

(Yj Y ) +
i 2 1)SY

= n(k

+0

3. Stock & Watson 3.5 (note: part a is a little tricky) A survey of 1055 registered voters is conducted, and the voters are asked to choose between candidate A and candidate B. Let p denote the fraction of voters in the population who prefer candidate A, and let p denote the fraction of voters in the sample who prefer candidate A. a. You are interested in the competing hypotheses: H0 : p = 0.5 vs. H1 : p = 0.5. Suppose you decide to reject H0 if | 0.5| > 0.02. p i. What is the size of this test? ii. Compute the power of this test if p = 0.53. b. In the survey p = 0.54. i. Test H0 : p = 0.5 vs. H1 : p = 0.5. using a 5% signicance level. ii. Test H0 : p = 0.5 vs. H1 : p > 0.5. using a 5% signicance level. iii. Construct a 95% condence interval for p. iv. Construct a 99% condence interval for p. v. Construct a 50% condence interval for p. c. Suppose that the survey is carried out 20 times, using independently selected voters in each survey. For each of these 20 surveys, a 95% condence interval for p is constructed. i. What is the probability that the true value of p is contained in all 20 of these condence intervals. ii. How many of these condence intervals do you expect to contain the true value of p?
1 d. In survey jargon, the margin of error is 1.96 SE (); that is, it is 2 the length of p the 95% condence interval. Suppose you wanted to design a survey that had a margin oferror of at most 1%. That is, you wanted Pr (| p| > 0.01) .05. How large should p n be if the survey uses simple random sampling?

Page 3

Solution: a. i. The size is given by Pr (| 0.5| > 0.02), where the probability is computed asp suming that p = 0.5. Pr (| 0.5| > 0.02) = 1 Pr (.02 p 0.5 0.02) p = 1 Pr
0.02 (.5.5)/1055

p0.5 (.5.5)/1055

0.02 (.5.5)/1055

= 1 Pr 1.30 = 0.19

p0.5 (.5.5)/1055

1.30

where the nal equality uses the central limit theorem approximation (and the normal tables). ii. The power is given by Pr (| 0.5| > 0.02), where the probability is computed p assuming that p = 0.53. Pr (| 0.5| > 0.02) = 1 Pr (.02 p 0.5 0.02) p = 1 Pr = 1 Pr
0.02 (.53.47)/1055 0.05 (.53.47)/1055

p0.5 (.53.47)/1055 p0.53 (.53.47)/1055

0.02 (.53.47)/1055 0.01 (.53.47)/1055

= 1 Pr 3.25 = 0.74

p0.53 (.53.47)/1055

0.65

where the nal equality uses the central limit theorem approximation (and the normal tables). b. i. t =
0.54.5 (0.540.46)/1055

= 2.61, Pr (|t| > 2.61) = .009 so that the null is rejected at the

5% level. ii. Pr (t > 2.61) = .0045 so that the null is rejected at the 5% level. iii. 0.54 1.96 (0.54 0.46)/1055 = 0.54 0.03, or 0.51 to 0.57. iv. 0.54 2.58 (0.54 0.46)/1055 = 0.54 0.04, or 0.50 to 0.58. v. 0.54 0.67 (0.54 0.46)/1055 = 0.54 0.01, or 0.53 to 0.55. c. i. The probability is 0.95 in any single survey, there are 20 independent surveys, so the probability is 0.9520 = 0.36. ii. 95% of the 20 condence intervals or 19.

Page 4

d. The relevant equation is 1.96 SE () < .01 or 1.96 p(1 p)/n < .01. Thus n p 1.962 p(1p) must be chosen so that n > , so that the answer depends on the value of p. .012 Note that the largest value that p(1 p) can take on is 0.25 (that is, p = 0.5 makes 2 p(1p) p(1 p) as large as possible). Thus if n > 1.96.012 = 9604, then the margin of error is less than 0.01 for all values of p.

Page 5