You are on page 1of 13

Homework - Week 7

Nathan Otten
April 7, 2019

Problem 3.31

I. Study Design, Aims, and Model

A. The goal of this study is to predict the sales price of a home (Y) as a function of the finished square
feet (X). The data for this study was collected by a city tax assessor on 522 arms-length transactions in
the mid-western United States for the year 2002. For this study, we have taken a random sample of 200
homes from the 522 total observations. The variables in the data include the sales in dollars of the home,
the finished square feet of the home, the total number of bedrooms and bathrooms, the size of the garage,
the style, and other factors.
B. The sales price is expected to increase as a function of the finished square feet. Our goals are:

1. To test if a linear relationship exist between the two variables.


2. To predict the price of a home that has 1,100 finished square feet and 4,900 finished square feet.

C. The following simple linear regression model is considered.

Yi = β0 + β1 Xi + i

for i = 1, 2, ..., 200 where i ∼ N (0, σ 2 ) and β0 , β1 , and σ 2 are the parameters of interest.

II. Preliminary Analysis

A. The scatter plot indicates a clear positive relationship between price and square feet (r = 0.877004).
B. The scatterplot appears to show a widening range of price as the finished square feet increases, which
gives the plot a cone-shaped appearance. This may indicate that our assumption of homoscedasticity
to be violated, that is, the variance of the i is not constant. It appears that as X increases, the variance
also increases. This observation is reasonable since other factors such as amenities can also drive up
the price of the home. In other words, smaller homes with expensive amenities, such as a pool, can
have high prices. See the appendix for a more complete analysis of the residuals on this point.

III. Statistical Analysis

Using ordinary least squares, the simple regression model is given by:

Ŷ = −89274.191 + 160.754X

and is shown in figure 2. In this model, finished square feet explains about %77 of the total variation in the
price of the home (R2 = 0.7691).

1. A two-sided t-test is used to select between H0 : β1 = 0 and H1 : β1 6= 0 with a type I error rate of
α = 0.05. Since b1 = 160.754 is more than 25 SE below what is expected for H0 , we reject H0 in favor
of H1 . Thus, there is evidence of a linear association between price and square footage.

1
7e+05
5e+05
Price

3e+05
1e+05

1500 2000 2500 3000 3500 4000 4500

Finished Square Feet

2. For the values X = 1100 and X = 4900, we predict the price of the home to be $87,555.67 and
$698,422.4 respectively.

IV. Conclusion

There is an increasing linear relationship between finished square feet and price of the home, and this
relationship can be modeled by:
Ŷ = −89274.191 + 160.754X

The linear model can be improved upon given that homoscedasticity appears to be violated in the ordinary
least squares model.

Appendix

A. Diagnostics for Predictors

Both sales price and finished square feet appear to be slightly non-symetric, having a right skewed distri-
bution. There are outliers in the right tail of the distribution for both of these variables. There are 3-4
unusually high values between 2,500-4,000 sq ft., which skew the distribution.

2
Sales Price Square Ft.

4500
7e+05

3500
5e+05

2500
3e+05

1500
1e+05

## Sales_Price SQ_FT
## Min. :112000 Min. :1198
## 1st Qu.:179400 1st Qu.:1668
## Median :221475 Median :1980
## Mean :268975 Mean :2229
## 3rd Qu.:328500 3rd Qu.:2717
## Max. :830000 Max. :4746

3
B. Residual Analysis

Residuals against X
3e+05
1e+05
Residuals

−1e+05

1500 2000 2500 3000 3500 4000 4500

Square Feet

Absolute Values of Residuals against X


250000
Absolute Value of Residuals

150000
0 50000

1500 2000 2500 3000 3500 4000 4500

Square Feet
4
1. Linearity of the Regression Function
Overall, linearity appears to be reasonable for values of X between 1,500 and 3,000. However, the linearity
model is suspect outside of that. All of the values below 1,500 are underestimated by the model, and the
values above 3,000 have a wide range and variance of values, which creates a lot of noise.

2. Constant Error Variance

## Loading required package: carData

The error variance of the model is by no means constant. As X increases, so does the error variance. It is
particularly clear that the residuals between 2,500 and 4,000 vary greaty from 0, while the residuals between
1,500 and 2,000 are clustered tightly around 0. This would indicate that the assumption of homoscedasticity
is suspect in this model. More formally, we apply the Breucsh-Pagan test to determine if σ 2 is a function of
X, and at the 5% significance level, we reject the null hypothesis that σ 2 is not a function of X in favor of
the alternative that σ 2 is a function of X (Chisquare = 91.74862, p = < 2.22e-16).

3. Detection of Outlying Observations


4
Studentized Residuals

2
0
−2

1500 2000 2500 3000 3500 4000 4500

Square Feet

The scatterplot above shows Square feet plotted against the studentized residuals. Again it is clear that
values above about 2,700 have a greater range and variance because some of the values are 3 and 4 standard
deviations from the mean of 0. Dotted lines are drawn on the scatter plot, and values above these are
considered outliers. As X increases, we see more outliers in the residuals.

5
Independence of Residuals

Sequence Plot of Residuals


3e+05
1e+05
ei

−1e+05

0 50 100 150 200

Index

From our sequence plot of residuals, there appears to be no patterns in the residuals plotted against their
index. The residuals appear to be independent.

Normality of Error Terms

6
Normal Probability Plot of Residuals
3e+05
Sample Quantiles

1e+05
−1e+05

−3 −2 −1 0 1 2 3

Theoretical Quantiles

Above is the Normal Probability Plot, which shows that the normality is a reasonable assumption for most
of the residuals. However, the outliers mentioned above are an exception to this assumption. We see from
the plot that there are 4-5 outliers where the observed quantiles do not match the theoretical quantiles.
Aside from these extreme values in the tails, the normality assumption generally holds.

Problem 3.32

I. Study Design, Aims, and Model

A. The goal of this study is to predict the prostate-specific antigen (PSA) level (Y) as a function of cancer
volume (X). The data was collected by a university medical urology group from 97 men who were about to
undergo radical prostectomies. The variables in the data include the PSA level, Cancer Volume, Weight,
Age, and other facotrs.
B. The PSA is expected to increase as a function of the cancer volume. Our goals are:

1. To test if a linear relationship exist between the two variables.


2. To predict the PSA level for a patient with a cancer volume of 20 cc

C. The following simple linear regression model is considered.

Yi = β0 + β1 Xi + i

for i = 1, 2, ..., 200 where i ∼ N (0, σ 2 ) and β0 , β1 , and σ 2 are the parameters of interest.

7
II Preliminary Analysis

A. The scatter plot indicates a positive relationship between PSA and cancer volume (r = 0.6241506).
B. The scatterplot appears to show a widening range of PSA as the cancer volume increases, which gives
the plot a cone-shaped appearance. This may indicate that our assumption of homoscedasticity to be
violated, that is, the variance of the i is not constant. It appears that as X increases, the variance
also increases. See the appendix for a more complete analysis of the residuals on this point.

Using ordinary least squares, the simple regression model is given by:

Ŷ = 1.125 + 3.230X

and is shown in the figure below. In this model, cancer volume explains about %39 of the total variation in
the PSA (R2 = 0.3896).

1. A two-sided t-test is used to select between H0 : β1 = 0 and H1 : β1 6= 0 with a type I error rate of
α = 0.05. Since b1 = 3.2299 is more than 7 SE above what is expected for H0 , we reject H0 in favor
of H1 . Thus, there is evidence of a linear association between PSA and cancer volume.
250
200
150
PSA

100
50
0

0 10 20 30 40

Cancer Volume

2. For the value X = 20 we predict that Y = 65.72353.

8
IV. Conclusion

There is an increasing linear relationship between cancer volume and PSA, and this relationship can be
modeled by:

Ŷ = 1.125 + 3.230X

The linear model can be improved upon given that homoscedasticity appears to be violated in the ordinary
least squares model.

Appendix

A. Diagnostics for Predictors

Both sales PSA and cancer volume appear to be non-symetric, having a right skewed distribution. There
are outliers in the right tail of the distribution for both of these variables. There are multiple unusually high
values between for PSA and Volume, which skew the distribution.

PSA Volume
250

40
200

30
150

20
100

10
50
0

## Sales_Price SQ_FT
## Min. :112000 Min. :1198
## 1st Qu.:179400 1st Qu.:1668
## Median :221475 Median :1980
## Mean :268975 Mean :2229
## 3rd Qu.:328500 3rd Qu.:2717
## Max. :830000 Max. :4746

9
B. Residual Analysis

Residuals against X
150
100
Residuals

50
0
−50

0 10 20 30 40

Cancer Volume

Absolute Values of Residuals against X


Absolute Value of Residuals

150
100
50
0

0 10 20 30 40

Cancer Volume
10
1. Linearity of the Regression Function
Overall, linearity appears to be reasonable for smaller values of X. However, the linearity model is suspect
outside of that. All of the values above 5 have a wide range and variance of values, which creates a lot of
noise.

2. Constant Error Variance


The error variance of the model is by no means constant. As X increases, so does the error variance. The
residuals between 0 and 5 are tightly clusered around 0, while the residuals greater than 5 are spread greatly
around 0. This would indicate that the assumption of homoscedasticity is suspect in this model. More
formally, we apply the Breucsh-Pagan test to determine if σ 2 is a function of X, and at the 5% significance
level, we reject the null hypothesis that σ 2 is not a function of X in favor of the alternative that σ 2 is a
function of X (Chisquare = 170.1514, p = < 2.22e-16).

3. Detection of Outlying Observations


6
Studentized Residuals

4
2
0
−2

0 10 20 30 40

Cancer Volume

The scatterplot above shows Cancer Volume plotted against the studentized residuals. Again it is clear
that values above about 5 have a greater range and variance because some of the values are between 4 to 6
standard deviations from the mean of 0. Dotted lines are drawn on the scatter plot, and values above these
are considered outliers. As X increases, we see more outliers in the residuals. Three outliers in particular
are obvious in the plot.

Independence of Residuals

11
Sequence Plot of Residuals
150
100
ei

50
0
−50

0 20 40 60 80 100

Index

From our sequence plot, there appears to be a pattern in the residuals, namely, the residuals are tightly
clustered around 0 for values between 1 and 15 and after that they grow increasingly more volitile. This
would seem to violate the assumption of independence in the residuals.

Normality of Error Terms

12
Normal Probability Plot of Residuals
150
Sample Quantiles

100
50
0
−50

−2 −1 0 1 2

Theoretical Quantiles

Above is the Normal Probability Plot, which shows that the normality is a reasonable assumption for most
of the residuals. However, the outliers mentioned above are an exception to this assumption. We see from
the plot that there are 4-5 outliers where the observed quantiles do not match the theoretical quantiles.
Aside from these extreme values in the tails, the normality assumption generally holds.

13

You might also like