You are on page 1of 15

MA 585: Time Series Analysis and Forecasting

Homework 1

February 12, 2017

Problem a. Using the Massachusetts data set, run three bivariate regressions, test the
dependent variable Test Score for 8th graders ("totsc8") using the explanatory variables
(1) percentage of students eligible for free/reduced price lunch ("lnch pct"), (2) per capita
income of the district ("percap"), and (3) percentage of English learners ("pctel"). Relying
on your bivariate regressions, communicate the explanatory power of these three variables.

Solution a. Refer to the attached Appendix for the content of the R Script used to generate
the statistical findings and plots used in this report.
Although the term explanatory power suggests a broad and ambigous assessment of the
covariates relationship with the dependent variable, a common and acceptable assessment
may begin with an assessment of the linearity involving the scatterplots, the statistical
significance of the regression coefficient and the magnitude of R2 .
The three simple linear regression models involving the required covariates are the following:

(1) y1 = 716.07998 1.10041x


(2) y2 = 643.894 + 2.909x
(3) y3 = 702.9918 3.6412x

The following are the scatterplots of the data as well as the regression lines mentioned above.

The first scatterplot seems to depict a general linear trend, which calls for further assessment
of the data in order to confirm that linear regression techniques are adequate. The second

1
scatterplot depicts a weaker linear trend, however, the weakness of this trend is ambiguous
to a point where further assessment of the data is required in order to confirm that linear
regression techniques are adequate. The third scatterplot suggests that the fitted regression
line may not possess any meaningful information regarding the relationship of the two in-
volved variables.
With regards to the significance of the regression coefficient and the magnitude of R2 ,
For (1), the p-value of the regression coefficient (slope) is statistically significant and the R2
comes out to 0.6951.
For (2), the p-value of the regression coefficient (slope) is statistically significant and the R2
comes out to 0.603.
For (3), the p-value of the regression coefficient (slope) is statistically significant and the R2
comes out to 0.2823.
The above results do coincide nicely with the graphical assessment of the data as R2 for (3)
is significantly smaller than R2 for (1) and (2). As such, in the context of these regressions,
the explanatory power of (3) is less than (1) and (2). However, with that said, at the
current level of assessment, it is hard to compare the explanatory power of (1) and (2).
Problem b. Conduct a residual analysis of each regression testing for (1) heteroscedas-
ticity, (2) non-linearity, and (3) significant outliers. Provide any plots that might help to
communicate your findings.
Solution b. The following are the residual plots, the scatterplot and the residuals vs lever-
age plot for each of the respective regressions.

For (1), the regression with "lnch pct" as the covariate, with regards to the presence of
heteroscedasticity, the residual plot clearly depicts a fanning-out pattern and hence, it is
safe to say that the constant variance assumption required for linear regression is violated.
Furthermore, with regards to non-linearity, as mentioned in solution (a), the scatterplot
seems to depict a general linear trend, which calls for further assessment of the data in order
to confirm that linear regression techniques are adequate.

2
And with regards to outliers and/or influential cases, case 169 is quite close to the marked
Cooks Distance line, however, it is clear from the plot that it avoids the region in which the
datapoint can be deemed a significant outlier.

For (2), the regression with "percap" as the covariate, with regards to the presence of
heteroscedasticity, the residual plot depicts a parabolic pattern and hence, it is safe to say
that the constant variance assumption required for linear regression is violated. Furthermore,
with regards to non-linearity, as mentioned in solution (a), the scatterplot seems to depict
a general linear trend, which calls for further assessment of the data in order to confirm that
linear regression techniques are adequate.

3
And with regards to outliers and/or influential cases, case 208 traverses beyond the marked
Cooks Distance line and hence, this datapoint can be deemed to be influential to the re-
gression results.

And finally, for (3), the regression with "pctel" as the covariate, with regards to the presence
of heteroscedasticity, the residual plot clearly depicts a fanning-out pattern and hence, it is
safe to say that the constant variance assumption required for linear regression is violated.
Furthermore, with regards to non-linearity, as mentioned in solution (a), the scatterplot
seems to suggests a clear lack of linearity and that the fitted regression line may not possess
any meaningful information regarding the relationship of the two involved variables.

4
And with regards to outliers and/or influential cases, case 88 traverses beyond the marked
Cooks Distance line and hence, this datapoint can be deemed to be influential to the re-
gression results.

Problem c. Carry out a multiple regression model with all three covariates. Are all three
covariates statistically significant? If not, carry out a variable selection procedure. Write
down the final model and interpret the fitted model in the context of the problem.

Solution c. The multiple regression model with all three covariates generates the following
output:

Call:
lm(formula = totsc8 ~ lnch_pct + percap + pctel)

Residuals:
Min 1Q Median 3Q Max
-20.748 -5.708 -1.051 5.173 31.945

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 678.59026 3.45214 196.571 <2e-16 ***
lnch_pct -0.72428 0.06996 -10.353 <2e-16 ***
percap 1.69515 0.14760 11.485 <2e-16 ***
pctel -0.25028 0.30738 -0.814 0.417
-----------------------------------------------------
Residual standard error: 8.799 on 176 degrees of freedom
(40 observations deleted due to missingness)
Multiple R-squared: 0.8283,Adjusted R-squared: 0.8253
F-statistic: 282.9 on 3 and 176 DF, p-value: < 2.2e-16

5
Hence, the model with all three covariates is:

y = 678.59026 0.72428x1 + 1.69515x2 0.25028x3

It is clear that the covariate "pctel" is not statistically significant. As such, in carrying out
a variable selection procedure, namely a stepwise procedure based on AIC, the procedure
generates the following output:
Start: AIC=786.81
totsc8 ~ lnch_pct + percap + pctel

Df Sum of Sq RSS AIC


- pctel 1 51.3 13676 785.48
<none> 13625 786.81
- lnch_pct 1 8296.9 21922 870.41
- percap 1 10210.5 23835 885.47

Step: AIC=785.48
totsc8 ~ lnch_pct + percap

Df Sum of Sq RSS AIC


<none> 13676 785.48
- percap 1 10510 24186 886.10
- lnch_pct 1 17823 31500 933.66

Call:
lm(formula = totsc8 ~ lnch_pct + percap)

Coefficients:
(Intercept) lnch_pct percap
679.4743 -0.7638 1.6651
----------------------------------------
Residual standard error: 8.79 on 177 degrees of freedom
(40 observations deleted due to missingness)
Multiple R-squared: 0.8276,Adjusted R-squared: 0.8257
F-statistic: 424.9 on 2 and 177 DF, p-value: < 2.2e-16
As shown above, the model generated through the variable selection procedure is the follow-
ing:

y = 678.59026 0.7638x1 + 1.6651x2

And as for the interpretation of this model, as with any adequate multiple regression models,
the mean change in the response variable for one unit of change in the covariate while holding
other covariates in the model constant. Hence, in the context of this particular model, for one
unit of change in % eligible for free/reduced price lunch, while holding Per Capita Income
constant, the mean 8th grade score goes down by 0.7638. And for one unit change in Per

6
Capita Income, while holding % eligible for free/reduced price lunch constant, the mean 8th
grade score goes up by 1.6651.

Problem d. Using California data set, repeat (a,b and c) with dependent variable Test
Score for K-6 & K-8 ("testscr"), and the covariates (1) percent qualifying for reduced-
price lunch ("meal pct"), (2) district average income ("avginc"), and (3) percentage of
English learners ("el pct").
Discuss similarities and differences between the two data sets. Based on your final model,
which of the two multiple regression models is better and in what sense it is better.

Solution d (part a). As with the Massachusetts data set, a common and acceptable as-
sessment of explanatory power may begin with an assessment of the linearity involving the
scatterplots, the statistical significance of the regression coefficient and the magnitude of R2 .
The three simple linear regression models involving the required covariates are the following:

(1) y1 = 681.43952 0.61029x


(2) y2 = 625.3836 + 1.8785x
(3) y3 = 664.73944 0.67116x

The following are the scatterplots of the data as well as the regression lines mentioned above.

The first scatterplot seems to depict a general linear trend, which calls for further assessment
of the data in order to confirm that linear regression techniques are adequate. The second
scatterplot depicts a weak quadratic trend, however, the weakness of this trend is ambiguous
to a point where further assessment of the data is required in order to confirm that linear
regression techniques are adequate. The third scatterplot depicts a weak linear trend, again,
calls for further assessment of the data in order to confirm that linear regression techniques
are adequate.
With regards to the significance of the regression coefficient and the magnitude of R2 ,
For (1), the p-value of the regression coefficient (slope) is statistically significant and the R2
comes out to 0.7548.
For (2), the p-value of the regression coefficient (slope) is statistically significant and the R2
comes out to 0.5075.
For (3), the p-value of the regression coefficient (slope) is statistically significant and the R2

7
comes out to 0.4149.
With in the scope of linearity, the statistical significance of the regression coefficient and the
magnitude of R2 , the explanatory power of (1) seems to be stronger than (2) and (3). The
probable cause of a low R2 value for (3) is in the large variation of the datapoints around the
regression line. Furthermore, the probable cause of a low R2 value for (2) is in the non-linear
relationship between the variables.

Problem d (part b). Conduct a residual analysis of each regression testing for (1) het-
eroscedasticity, (2) non-linearity, and (3) significant outliers. Provide any plots that might
help to communicate your findings.

Solution d (part b). The following are the residual plots, the scatterplot and the residuals
vs leverage plot for each of the respective regressions.

For (1), the regression with "meal pct" as the covariate, with regards to the presence of
heteroscedasticity, the residual plot depicts a random dispersal of data points bounded by
two horizontal lines and hence, it is safe to say that the constant variance assumption required
for linear regression is not violated. Furthermore, with regards to non-linearity, as mentioned
in solution (a), the scatterplot seems to depict a general linear trend, which calls for further
assessment of the data in order to confirm that linear regression techniques are adequate.

8
And with regards to outliers and/or influential cases, none of the datapoints are near the
Cooks Distance line. Hence, it is clear from the plot that none of the datapoints can be
deemed a significant outlier.

For (2), the regression with "avginc" as the covariate, with regards to the presence of
heteroscedasticity, the residual plot depicts a parabolic pattern and hence, it is safe to say
that the constant variance assumption required for linear regression is violated. Furthermore,
with regards to non-linearity, as mentioned in solution (a), the scatterplot depicts a weak
quadratic trend, which calls for further assessment of the data in order to confirm that linear
regression techniques are inadequate with the absence of remedial measures.

9
And with regards to outliers and/or influential cases, cases 414, 404 and 405 are quite close
to the marked Cooks Distance line, however, it is clear from the plot that they avoid the
region in which the datapoints can be deemed as significant outliers.

And finally, for (3), the regression with "el pct" as the covariate, with regards to the
presence of heteroscedasticity, the residual plot clearly depicts a fanning-out pattern and
hence, it is safe to say that the constant variance assumption required for linear regression
is violated. Furthermore, with regards to non-linearity, as mentioned in solution (a), the
scatterplot depicts a weak linear trend, again, calls for further assessment of the data in
order to confirm that linear regression techniques are adequate.

10
And with regards to outliers and/or influential cases, none of the datapoints are near the
Cooks Distance line. Hence, it is clear from the plot that none of the datapoints can be
deemed a significant outlier.

11
Problem d (part c). Carry out a multiple regression model with all three covariates. Are
all three covariates statistically significant? If not, carry out a variable selection procedure.
Write down the final model and interpret the fitted model in the context of the problem.

Solution d (part c). The multiple regression model with all three covariates generates the
following output:

Call:
lm(formula = testscr ~ meal_pct + avginc + el_pct)

Residuals:
Min 1Q Median 3Q Max
-29.4385 -5.3882 -0.0416 5.1426 28.7538

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 663.64701 2.10386 315.443 < 2e-16 ***
meal_pct -0.38638 0.02727 -14.170 < 2e-16 ***
avginc 0.72329 0.08145 8.880 < 2e-16 ***
el_pct -0.20902 0.03099 -6.745 5.13e-11 ***
------------------------------------------------------
Residual standard error: 8.498 on 416 degrees of freedom
Multiple R-squared: 0.8025,Adjusted R-squared: 0.8011
F-statistic: 563.4 on 3 and 416 DF, p-value: < 2.2e-16

Hence, the model with all three covariates is:

y = 663.64701 0.38638x1 + 0.72329x2 0.20902x3

It is clear that all the covariates are statistically significant.


As for the interpretation of this model, as with any adequate multiple regression models, the
mean change in the response variable for one unit of change in the covariate while holding
other covariates in the model constant. Hence, in the context of this particular model, for
one unit of change in % eligible for reduced price lunch, while holding district average income
and % of english students constant, the mean average test score goes down by 0.38638. For
one unit change in district average income, while holding % eligible for reduced price lunch
% of english students constant, the mean average test score goes up by 0.72329. Finally, for
one unit change in % of english students, while holding % eligible for reduced price lunch
and district average income constant, the mean average test score goes down by 0.20902.

The Massachusetts Dataset includes 220 observations of 17 variables and the California
Dataset includes 420 observations with 18 variables. Although both are very comprehensive
in coverage, distinct differences between the two datasets lie in the time period in which
the observations were made, the size of the dataset and the school level in which the tests
were administered in the two states. Except for these differences, both datasets encapsulate
similar information for its respective states.

12
Overall, both models lack adequacy on multiple levels, namely the presence of heteroskedas-
ticity, the absence of confirming the normality assumption and the nonremoval of significant
outliers. However, with the assumption that no remedial measures will be taken, the model
that involves the Massachusetts dataset seems to be better than its companion for the
following reasons:

The regression coefficients in the Massachusetts model are all highly significant. Al-
though the same can be said about the California model, one of the regression coeffi-
cients is not as significant as the rest.

The magnitude of R2 in the Massachusetts model is greater than the California model.

13
Appendix
R Code
attach(massdata)

# Bivariate Regression 1 (Y = "totsc8", X = "lnch_pct")


reg1 = lm(totsc8 ~ lnch_pct)
summary(reg1)
plot(lnch_pct, totsc8,
main = "Test Scores for 8th Graders vs % of students eligible for
free/reduced price lunch",
xlab = "% of Students elg. for free/reduced price lunch",
ylab = "Test Scores (8th Graders)")
abline(reg1, col = "red")
# Bivariate Regression 2 (Y = "totsc8", X = "percap")
reg2 = lm(totsc8 ~ percap)
plot(percap, totsc8,
main = "Test Scores for 8th Graders vs Per Capita Income of District",
xlab = "Per Capita Income of District", ylab = "Test Scores (8th Graders)")
abline(reg2, col = "red")
summary(reg2)
# Bivariate Regression 3 (Y = "totsc8", X = "pctel")
reg3 = lm(totsc8 ~ pctel)
plot(pctel, totsc8,
main = "Test Scores for 8th Graders vs % of English learners",
xlab = "% of English learners", ylab = "Test Scores (8th Graders)")
abline(reg3, col = "red")
summary(reg3)

# residual analysis
plot(reg1)
plot(reg2)
plot(reg3)

# multiple regression
reg.mul = lm(totsc8 ~ lnch_pct + percap + pctel)
summary(reg.mul)

# variable selection
step(reg.mul)

# new multiple regression


reg.mul2 = lm(totsc8 ~ lnch_pct + percap)
summary(reg.mul2)

14
detach(massdata)
attach(CaliforniaData)

summary(CaliforniaData)
str(CaliforniaData)

# Bivariate Regression 1 (Y = "testscr", X = "meal_pct")


reg1.cal = lm(testscr ~ meal_pct)
summary(reg1.cal)
plot(meal_pct, testscr,
main = "Test Score for K-6 & K-8 vs % elg. for reduced price lunch",
xlab = "% of Students elg. for reduced price lunch",
ylab = "Test Score for K-6 & K-8")
abline(reg1.cal, col = "red")
# Bivariate Regression 2 (Y = "testscr", X = "avginc")
reg2.cal = lm(testscr ~ avginc)
plot(avginc, testscr,
main = "Test Score for K-6 & K-8 vs District Avg. Income",
xlab = "District Avg. Income",
ylab = "Test Score for K-6 & K-8")
abline(reg2.cal, col = "red")
summary(reg2.cal)
# Bivariate Regression 3 (Y = "testscr", X = "el_pct")
reg3.cal = lm(testscr ~ el_pct)
plot(el_pct, testscr, main = "Test Score for K-6 & K-8 vs % of English learners",
xlab = "% of English learners",
ylab = "Test Score for K-6 & K-8")
abline(reg3.cal, col = "red")
summary(reg3.cal)

# residual analysis
plot(reg1.cal)
plot(reg2.cal)
plot(reg3.cal)

# multiple regression
regmul.cal = lm(testscr ~ meal_pct + avginc + el_pct)
summary(regmul.cal)

# variable selection
step(regmul.cal)

str(massdata)

15

You might also like