You are on page 1of 6

TUTORIAL 7: Multiple Linear Regression

I. Multiple Regression
A regression with two or more explanatory variables is called a multiple regression. Multiple
linear regression is an extremely effective tool for answering statistical questions involving many
variables. The procedures PROC REG and PROC GLM can be used to perform regression in
SAS. In this tutorial we concentrate on using PROC REG. Much of the syntax is similar to that
used for fitting simple linear regression models; see Tutorial 6 for a review of this material.

A. PROC REG

PROC REG is the basic SAS procedure for performing regression analysis. The general form of
the PROC REG procedure is:

PROC REG DATA=dataset;


MODEL response_variable = explanatory_variables;
PLOT variable1 * variable2 <options>;
OUTPUT OUT = newdata <options>;
RUN;

The MODEL statement is used to specify the response and explanatory variables to be used in the
regression model. For example, the statement:

MODEL y = x1 x2;

fits a multiple linear regression model with the variable y as the response variable and the
variables x1 and x2 as explanatory variables.

The fit of the model and the model assumptions can be checked graphically using the PLOT
statement. This statement can be used to make all the relevant plots needed for the regression
model. In the regression models we have discussed so far, it is assumed that the errors are
independent and normally distributed with mean 0 and variance 2. After performing regression,
it is necessary to check these assumptions by analyzing the residuals and studying a series of
residual plots. To plot the residuals against the explanatory variables use the statement:

PLOT residual.*(x1 x2);

Note that residual. (the period is required) is the variable name for the residuals created by PROC
REG. To plot the residuals against the predicted values we would use the statement:

PLOT residual.*predicted.;

Note that predicted. (the period is again required) is the variable name for the predicted values
from the regression model.
The OUTPUT statement is used to produce a new data set containing the original data used in the
regression model, as well as the predicted values and residuals. This new data set can, in turn, be
used to produce further diagnostic plots and check the model fit. When using the OUTPUT
statement, there are a number of helpful options which help to control the contents of the
OUTPUT file. The statement:

OUTPUT out = outdata r = resid p = yhat;

creates a new data set named outdata which contains the residuals and predicted values. The
residuals are given the name resid, and the predicted values the name yhat. The data set outdata
can then be used to further study the residuals.

Ex. Data was collected on 15 houses recently sold in a city. It consisted of the sales price (in $),
house size (in square feet), the number of bedrooms, the number of bathrooms, the lot size (in
square feet) and the annual real estate tax (in $).

The following program reads in the data and fits a multiple regression model with price as the
response variable and size and lot as the explanatory variables. It also produces residual plots of
the residuals against both explanatory variables as well as the predicted values.

DATA houses;
INPUT tax bedroom bath price size lot;
DATALINES;
590 2 1 50000 770 22100
1050 3 2 85000 1410 12000
20 3 1 22500 1060 3500
870 2 2 90000 1300 17500
1320 3 2 133000 1500 30000
1350 2 1 90500 820 25700
2790 3 2.5 260000 2130 25000
680 2 1 142500 1170 22000
1840 3 2 160000 1500 19000
3680 4 2 240000 2790 20000
1660 3 1 87000 1030 17500
1620 3 2 118600 1250 20000
3100 3 2 140000 1760 38000
2070 2 3 148000 1550 14000
650 3 1.5 65000 1450 12000
;
RUN;

PROC REG data=houses;


MODEL price = size lot; % Model statement
PLOT residual.*(predicted. size lot); % Residual plots
RUN;
This program gives rise to the following output:

Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 44825992653 22412996326 19.10 0.0002
Error 12 14082023347 1173501946
Corrected Total 14 58908016000

Root MSE 34256 R-Square 0.7609


Dependent Mean 122140 Adj R-Sq 0.7211
Coeff Var 28.04684

Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -61969 32257 -1.92 0.0788
size 1 97.65137 18.16474 5.38 0.0002
lot 1 2.22295 1.13918 1.95 0.0748

We also obtained three residual plots which aren’t shown here. The output can be used to test a
variety of hypothesis tests regarding the model. For example, by studying the last paragraph we
see that the coefficient corresponding to size is significant (p-value=0.0002) when controlling for
lot size. However, the coefficient corresponding to lot size is not significant (p-value=0.0748)
when controlling for house size.

B. Testing a subset of variables using a partial F-test

Sometimes we are interested in simultaneously testing whether a certain subset of the coefficients
are equal to 0 (e.g. 3 = 4 = 0). We can do this using a partial F-test. This test involves
comparing the SSE from a reduced model (excluding the parameters we are testing) with the SSE
from the full model (including all of the parameters).

We can perform a partial F-test in PROC REG by including a TEST statement. For example, the
statement

TEST var1=0, var2=0;

tests the null hypothesis that the regression coefficients corresponding to var1 and var2 are both
equal to 0. However, note that any number of variables can be included in the TEST statement.

Ex. Housing data cont.

Suppose we include the variables bedroom, bath and size in our model and are interested in
testing whether the number of bedrooms and bathrooms are significant after taking size into
consideration. The following program performs the partial F-test:

PROC REG data=houses;


MODEL price = bedroom bath size; % Model statement
TEST bath=0, bedroom=0; % Partial F-test
RUN;
This gives rise to the following output:

Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 43908504107 14636168036 10.73 0.0013
Error 11 14999511893 1363591990
Corrected Total 14 58908016000

Root MSE 36927 R-Square 0.7454


Dependent Mean 122140 Adj R-Sq 0.6759
Coeff Var 30.23321

Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 27923 56306 0.50 0.6297
bedroom 1 -35525 25037 -1.42 0.1836
bath 1 2269.34398 22209 0.10 0.9205
size 1 130.79392 36.20864 3.61 0.0041

Test 1 Results for Dependent Variable price

Mean
Source DF Square F Value Pr > F
Numerator 2 1775487498 1.30 0.3108
Denominator 11 1363591990

The final paragraph shows the results of the partial F-test. Since F=1.30 (p-value=0.3108) we
cannot reject the null hypothesis ( 2 = 3 = 0). It appears that bedroom and bath do not contribute
significant information to the sales price once size has been taken into consideration.

C. Model Selection

Often we have data on a large number of explanatory variables and wish to construct a regression
model using some subset of them. The use of a subset will make the resulting model easier to
interpret and more manageable, especially if more data is to be collected in the future.
Unnecessary terms in the model may also yield less precise inference.

One approach to model selection is to consider all possible subsets of the pool of explanatory
variables and find the model that best fits the data according to some criteria. Different criteria
may be used to select the best model, such as adjusted R2 or Mallow’s Cp. These criteria assign
scores to each model and allow us to choose the model with the best score.

In SAS we can perform model selection using Mallow’s Cp by including the selection option in
the MODEL statement. The statement:

MODEL y = x1 x2 x3 x4 x5 /selection = cp;


steps through each possible model consisting of a subset of the 5 explanatory variables (x1, x2,
x3, x4 and x5) and calculates a Cp score for each one. Thereafter it chooses the model that
minimizes the score.

Ex. Housing data continued.

To use Mallow’s Cp to determine which subset of the 5 possible explanatory variables that best
models the data, we can use the following code:

PROC REG data=houses;


MODEL price = tax bedroom bath size lot/ selection = cp;
RUN;

This gives rise to the following output:

Number in
Model C(p) R-Square Variables in Model

3 2.3274 0.8115 tax bedroom size


2 2.7395 0.7628 tax size
2 2.8314 0.7609 size lot
3 3.0608 0.7967 bedroom size lot
2 3.6142 0.7451 bedroom size
3 3.9514 0.7787 tax size lot
4 4.0001 0.8182 tax bedroom size lot
1 4.1292 0.6943 tax
3 4.1942 0.7738 bath size lot
4 4.3138 0.8118 tax bedroom bath size
3 4.4539 0.7686 tax bath size
1 4.5857 0.6851 size
2 4.9046 0.7191 tax bath
4 4.9963 0.7980 bedroom bath size lot
4 5.5470 0.7869 tax bath size lot
3 5.6023 0.7454 bedroom bath size
2 5.9088 0.6988 bath size
5 6.0000 0.8182 tax bedroom bath size lot
2 6.0115 0.6967 tax bedroom
2 6.1249 0.6944 tax lot
3 6.8608 0.7199 tax bath lot
3 6.8755 0.7196 tax bedroom bath
3 7.9613 0.6977 tax bedroom lot
4 8.8538 0.7201 tax bedroom bath lot
3 13.7814 0.5801 bedroom bath lot
2 16.2074 0.4907 bath lot
2 18.5511 0.4433 bedroom bath
1 20.4532 0.3645 bath
2 23.5548 0.3422 bedroom lot
1 29.3255 0.1852 lot
1 31.1563 0.1482 bedroom

From the output we see that the model with tax, bedroom and size minimizes the Cp criteria.
When the number of explanatory variables is large it is often not feasible to fit all possible
models. It is instead more efficient to use a search algorithm to find the best model. A number of
such search algorithms exist. They include: forward selection, backward elimination and stepwise
regression.

In SAS we can perform model selection using these algorithms by including the selection option
in the MODEL statement. The statement:

MODEL y = x1 x2 x3 x4 /selection = forward;

fits the best model using forward regression. Other options can be used by exchanging forward
with either backward or stepwise in the MODEL statement.

D. Variance Inflation Factor

In multiple regression one would like the explanatory variables to be highly correlated with the
response variable. However, it is not desirable for the explanatory variables to be correlated with
one another. Multicollinearity exists when two or more of the explanatory variables used in the
regression model are moderately or highly correlated. The presence of a high degree of multi-
collinearity among the explanatory variables can result in the following problems: (i) the standard
deviation of the regression coefficients may be disproportionately large, (ii) the coefficient
estimates are unstable, and (iii) the regression coefficients may not be interpretable.

A method for detecting the presence of multicollinearity is the variance inflation factor (VIF). A
large VIF (>10) is taken as an indication that multicollinearity may be influencing the estimates.
Including the option VIF in the MODEL statement prints the variance inflation factors for each of
the explanatory variables.

Ex. Housing data continued.

If we want to determine whether there is any multicollinearity present in our model we need to
add the VIF option to the MODEL statement as shown below:

PROC REG data=houses;


MODEL price = size lot /VIF;
RUN;

This gives rise to the same output as in the previous example. The only difference is an additional
column in the parameter estimate section showing each explanatory variables VIF score.

Parameter Estimates
Parameter Standard Variance
Variable DF Estimate Error t Value Pr > |t| Inflation
Intercept 1 -61969 32257 -1.92 0.0788 0
size 1 97.65137 18.16474 5.38 0.0002 1.03891
lot 1 2.22295 1.13918 1.95 0.0748 1.03891

Note that all the VIF values are rather small, so there does not appear to be any problems with
multicollinearity in this example.

You might also like