You are on page 1of 19

Variable Selection in Multiple Regression Modeling

Use Multiple Regression to answer following questions? 1) Taking Sales as Target variables develop a Regression Model 2) Which variables should be included and why? 3) Is the model adequate for future predictions?

Section A : Find Correlations between Predictor Variables

Variable Accounts has significant correlations with variables Time, Poten and Share Variable Poten has significant correlation with variables Time Variable Rating has significant correlations with variable AdvExp

Section B : Perform Multiple Linear Regression B.1 : ENTER REGRESSION METHOD (Include ALL Predictor Variables)
Now enter all 8 variables as predictor and Sales as dependent variable for Linear regression. The output is as shown below. Model Summary output shows that adjusted R2 is 0.89 indicating that model explains 89% variability in the data.

ANNOVA Table below shows that the Regression model adequately fits the data.

The Coefficients Table below attests to the Correlation observations of section A. Variable Accounts is the highest correlated with VIF equal to 5.637.

Variable Share is highly correlated with VIF equal to 3.395. Similarly variable Time is also highly correlated with VIF equal to 3.356. Variables to be excluded include Time, Change, Accounts, Work and Rating since p-value of these variables exceeds value of 0.05.So the variables that pass the significance test and are to be retained in the Regression model are; Poten, AdvExp, Share To ascertain the reliability of this conclusion, we continue with other methods in Multiple Regression.

B.2: Stepwise Regression method. We apply the stepwise method as shown below

Now click on Options tab as shown below.

The options tab invokes Stepping Method Criteria as shown.

A variable will be entered into a solution if p-value is less than 0.05 and will be removed from the solution if p-value of that variable is greater than 0.10. In stepwise regression, the order of selection of variables into the solution is shown in Coefficients Table.

The selected variables are, Accounts, AdvExp, Poten, Share Each selected variable has p-value less than 0.05. As shown in ANNOV Table, the four variables are selected in four steps. Since p-value is less than 0.05, model adequately fits the data.

Strength of the model is provided in the Summary Table as shown below.

Note the final model is obtained in fourth stage with model explaining 88.1% variability in the data. It is interesting to observe that Adjusted R2 increase while Standard Error of the Estimate decreases at each step.

B.3: Backward Regression method. Invoke the Backward regression as shown below.

Initially all the variables are entered in the solution. Then variables Work, Rating and Accounts are removed at subsequent stages since their respective p-values are greater than 0.10 as shown below.

As seen in Coefficients Table below, all eight predictor variables are entered in the initial solution to find the respective p-values.

Since variable Work has highest p-value of 0.487, it is excluded from the solution. The solution for remaining seven variables (excluding Work) is shown below. Since variable Rating has highest p-value of 0.440, it is excluded from the solution.

The process of Regression model building is re-executed with six variables now. The output is shown below. Variable Accounts has the highest p-vale of 0.224. So it is excluded from subsequent solution.

Regression model building is now carried out with five variables to find their respective p-values. The output of Coefficients is shown below.

Since all the remaining five variables have their respective p-values lessthan 0.10, no further exclusion is carried out. ANNOVA Tables for four stages of model development are shown below.

Note that value of F increases with each stage. Five variables at Model 4 have collective p-value of 0.000 attesting that model is adequate for predictive analysis. To find the strength of model, we look at MODEL SUMMARY Table.

Five variables for Model 4 have Adjusted R Square value of 0.893 signifying very high variability explaining power. B.4: Forward Regression method. The forward selection procedure starts with no independent variables. It adds variables one at a time using the same procedure as stepwise regression for determining whether an independent variable should be entered into the model. However, the forward selection procedure does not permit a variable to be removed from the model once it has been entered. The procedure stops if the p-value for each of the independent variables not in the model is greater than to enter. We apply Forward Regression method by invoking option Forward as shown below.

We try all variables one-by-one as predictor variables with Sales as dependent variable. Variable Accounts has the smallest p-value out of eight predictor variable, so it is entered as first variable in the solution as shown below.

Now, there are seven variables remaining. Which one to choose? Select one of the remaining variables at a time with variable Accounts to find p-values of all pairs as shown below.

Note variable AdvExp when paired with variable Accounts yield lowest p-value. So it is chosen to join the selected variables group as seen in Coefficients Table.

Now we have two variables (Accounts, AdvExp) in the selected variable list. Adding the remaining six variables one-by-one and finding p-values for the remaining five variables, we have following Table.

As seen, variable Poten has p-value of 0.019 when combined with two variables (Accounts, AdvExp). So we select variable Poten to be included in the selected variables list as seen in the following Coefficients Table..

Next, we combine the remaining five variables one-by-one with already three selected variables (Accounts, AdvExp, Poten) to select next variable with lowest p-value (less than .05). The results are shown in Table below.

Variable Share has a p-value of .001 to join the remaining three variables as seen in the following Coefficients Table

Next we combine remaining four variables one-by-one with the selected four variables to find p-values of these remaining variables. The p-values of these four variables are shown in the following Table.

No variable from the remaining four variables has a p-value less than 0.05, so no further variable is added to the list of selected variables. ANNOVA Table for the four models is shown below attesting that Model 4 has highest F value.

MODEL SUMMARY Table shows that Model 4 has lowest standard error with highest value of adjusted R2 value of 0.881

SUMMARY OF FOUR REGRESSION METHODS Here we compare the four Regression methods and list down the eight variables as recommended to be included by each method.

Situation A: Analysis of Regression Model including Four Variables Suggested by Backward Regression

Situation B: Regression Model by including three Variables as Recommended by ENTER method.

Situation C: Regression Model by Including Four Variables as Suggested by (i) (ii) Stepwise Method Forward Method

Comparison of Three Situations Situation C has better parameters of the three situations. The model proposed by this situation has largest adjusted R2 of 0.881 with maximum F value of 45.226. However standard error of estimate is greater than obtained in situation B.

So the recommended model according to situation C is; Estimate of Sales = -1441.932 + 0.038 (Poten) + 0.175 (AdvExp) + 190.144 (Share) + 9.214 (Accounts)

You might also like