Professional Documents
Culture Documents
a.) The data indeed has 366 rows and 7 columns. It has seven columns because the
data has a column for the name of the metropolitan statistical area, which is not a
variable.
msa.data = read.csv("http://www.stat.cmu.edu/~larry/=stat401/bea-2006.c
sv")
print(head(msa.data))
nrow(msa.data)
## [1] 366
ncol(msa.data)
## [1] 7
b.)
summary(msa.data)
c.) The EDA plot for per-capita GMP seems slightly skewed to the right and the EDA
plot for population seems very skewed to the right.
hist(msa.data$pcgmp)
boxplot(msa.data$pcgmp)
hist(msa.data$pop)
boxplot(msa.data$pop)
hist(msa.data$finance)
hist(msa.data$prof.tech)
hist(msa.data$ict)
hist(msa.data$management)
d.) In the bivariate EDA plot for per-capita GMP as a function of population, it seems
that for areas with low population, the per-capita GMP tends to be really low as well,
but there are some instances where the per-capita GMP is pretty high with very low
population.
plot(msa.data$pop, msa.data$pcgmp)
e.) The slope of the least squares line is 0.002416 and the intercept is 31277.57.
b1 = cov(msa.data$pop, msa.data$pcgmp)/var(msa.data$pop)
b0 = mean(msa.data$pcgmp) - b1 * mean(msa.data$pop)
b1
## [1] 0.002416201
b0
## [1] 31277.57
f.) The intercept is 3.128e+04 and the slope is 2.416e-03. This agrees with my
answer in the previous part. They should agree because they are using the same
formula to calculate the values.
lm(msa.data$pcgmp ~ msa.data$pop)
##
## Call:
## lm(formula = msa.data$pcgmp ~ msa.data$pop)
##
## Coefficients:
## (Intercept) msa.data$pop
## 3.128e+04 2.416e-03
g.) The line of best fit for the scatter plot is pretty bad. The assumptions of the
simple linear regression model doesn't appear to hold. There are places where the
fit seems better than others - it seems better towards the left of the graph, where the
population levels are really low and gets less accurate as we move towards the right
of the graph.
plot(msa.data$pop, msa.data$pcgmp)
abline(lm(msa.data$pcgmp ~ msa.data$pop))
h.) The population is 2361000. The per-capita GMP is 38350. The per-capita GMP
predicted by the model is: b1*2361000 + b0 = 36982.22. Therefore, the residual for
pittsburgh is 1367.775.
msa.data[msa.data$MSA == 'Pittsburgh, PA',]
print(b1*2361000 + b0)
## [1] 36982.22
## [1] 1367.775
## [1] 70697145
j.) The residual for Pittsburgh seems relatively small, given that the square root of
the mean squared error is around 6 times the error for Pittsburgh.
sqrt(mean(msa.data$residuals^2))
## [1] 8408.159
k.) If the assumptions of the simple linear regression model hold, the residuals in the
plot should not be either systematically high or low. So, the residuals should be
centered on zero throughout the range of fitted values. The actual plot is not
compatible with these assumptions, since the residuals doesn't have a constant
spread throughout the range.
plot(msa.data$pop, msa.data$residuals)
l.) If the assumptions of the simple linear regression model hold, the squared
residuals in the plot should not be either systematically high or low. So, the
residuals should be centered on zero throughout the range of fitted values. The
actual plot is not compatible with these assumptions, since the squared residuals
doesn't have a constant spread throughout the range.
plot(msa.data$pop, msa.data$residuals^2)
m.) As the population increases by one unit, we can expect the per-capita GMP to
increase by 0.002416.
n.) The model pricts a per-capita GMP of 37223.84.
print(b1*(2361000+10^5) + b0)
## [1] 37223.84
o.) If we added 10^5 people to the population, the model predicts that the total
value of all goods and services produced for sale in Pittsburgh per person would be
37223.84.