You are on page 1of 14

Homework 3 (Number 3)

a.) The data indeed has 366 rows and 7 columns. It has seven columns because the
data has a column for the name of the metropolitan statistical area, which is not a
variable.

msa.data = read.csv("http://www.stat.cmu.edu/~larry/=stat401/bea-2006.c
sv")
print(head(msa.data))

## MSA pcgmp pop finance prof.tech ict


## 1 Abilene, TX 24490 158700 0.09750 NA 0.01621
## 2 Akron, OH 32890 699300 0.12940 0.05440 NA
## 3 Albany, GA 24270 163000 0.08217 NA 0.00708
## 4 Albany-Schenectady-Troy, NY 36840 850300 0.15780 0.09399 0.04511
## 5 Albuquerque, NM 37660 816000 0.15990 0.09978 0.20500
## 6 Alexandria, LA 25490 152200 0.09152 0.03790 0.01134
## management
## 1 NA
## 2 0.054310
## 3 NA
## 4 NA
## 5 0.006509
## 6 0.015210

nrow(msa.data)

## [1] 366

ncol(msa.data)

## [1] 7

b.)
summary(msa.data)

## MSA pcgmp pop


## Abilene, TX : 1 Min. :14920 Min. : 54980
## Akron, OH : 1 1st Qu.:26532 1st Qu.: 135625
## Albany, GA : 1 Median :31615 Median : 231500
## Albany-Schenectady-Troy, NY: 1 Mean :32923 Mean : 680898
## Albuquerque, NM : 1 3rd Qu.:38212 3rd Qu.: 530875
## Alexandria, LA : 1 Max. :77860 Max. :18850000
## (Other) :360
## finance prof.tech ict management
## Min. :0.03845 Min. :0.01474 Min. :0.00349 Min. :0.000
42
## 1st Qu.:0.10403 1st Qu.:0.02932 1st Qu.:0.01215 1st Qu.:0.002
94
## Median :0.14140 Median :0.04212 Median :0.02218 Median :0.006
51
## Mean :0.15082 Mean :0.04905 Mean :0.03910 Mean :0.009
08
## 3rd Qu.:0.18122 3rd Qu.:0.05932 3rd Qu.:0.04072 3rd Qu.:0.011
91
## Max. :0.38480 Max. :0.19080 Max. :0.58600 Max. :0.054
31
## NA's :12 NA's :112 NA's :76 NA's :157

c.) The EDA plot for per-capita GMP seems slightly skewed to the right and the EDA
plot for population seems very skewed to the right.
hist(msa.data$pcgmp)

boxplot(msa.data$pcgmp)
hist(msa.data$pop)
boxplot(msa.data$pop)
hist(msa.data$finance)
hist(msa.data$prof.tech)
hist(msa.data$ict)
hist(msa.data$management)
d.) In the bivariate EDA plot for per-capita GMP as a function of population, it seems
that for areas with low population, the per-capita GMP tends to be really low as well,
but there are some instances where the per-capita GMP is pretty high with very low
population.
plot(msa.data$pop, msa.data$pcgmp)
e.) The slope of the least squares line is 0.002416 and the intercept is 31277.57.
b1 = cov(msa.data$pop, msa.data$pcgmp)/var(msa.data$pop)
b0 = mean(msa.data$pcgmp) - b1 * mean(msa.data$pop)
b1

## [1] 0.002416201

b0

## [1] 31277.57

f.) The intercept is 3.128e+04 and the slope is 2.416e-03. This agrees with my
answer in the previous part. They should agree because they are using the same
formula to calculate the values.
lm(msa.data$pcgmp ~ msa.data$pop)

##
## Call:
## lm(formula = msa.data$pcgmp ~ msa.data$pop)
##
## Coefficients:
## (Intercept) msa.data$pop
## 3.128e+04 2.416e-03

g.) The line of best fit for the scatter plot is pretty bad. The assumptions of the
simple linear regression model doesn't appear to hold. There are places where the
fit seems better than others - it seems better towards the left of the graph, where the
population levels are really low and gets less accurate as we move towards the right
of the graph.
plot(msa.data$pop, msa.data$pcgmp)
abline(lm(msa.data$pcgmp ~ msa.data$pop))

h.) The population is 2361000. The per-capita GMP is 38350. The per-capita GMP
predicted by the model is: b1*2361000 + b0 = 36982.22. Therefore, the residual for
pittsburgh is 1367.775.
msa.data[msa.data$MSA == 'Pittsburgh, PA',]

## MSA pcgmp pop finance prof.tech ict managemen


t
## 262 Pittsburgh, PA 38350 2361000 0.2018 0.0777 0.03434 0.0294
6

print(b1*2361000 + b0)
## [1] 36982.22

print(38350 - (b1*2361000 + b0))

## [1] 1367.775

i.) The mean squared error is 70697145.


msa.data$residuals = msa.data$pcgmp - (b1*msa.data$pop + b0)
mean(msa.data$residuals^2)

## [1] 70697145

j.) The residual for Pittsburgh seems relatively small, given that the square root of
the mean squared error is around 6 times the error for Pittsburgh.
sqrt(mean(msa.data$residuals^2))

## [1] 8408.159

k.) If the assumptions of the simple linear regression model hold, the residuals in the
plot should not be either systematically high or low. So, the residuals should be
centered on zero throughout the range of fitted values. The actual plot is not
compatible with these assumptions, since the residuals doesn't have a constant
spread throughout the range.
plot(msa.data$pop, msa.data$residuals)
l.) If the assumptions of the simple linear regression model hold, the squared
residuals in the plot should not be either systematically high or low. So, the
residuals should be centered on zero throughout the range of fitted values. The
actual plot is not compatible with these assumptions, since the squared residuals
doesn't have a constant spread throughout the range.
plot(msa.data$pop, msa.data$residuals^2)
m.) As the population increases by one unit, we can expect the per-capita GMP to
increase by 0.002416.
n.) The model pricts a per-capita GMP of 37223.84.
print(b1*(2361000+10^5) + b0)

## [1] 37223.84

o.) If we added 10^5 people to the population, the model predicts that the total
value of all goods and services produced for sale in Pittsburgh per person would be
37223.84.

You might also like