You are on page 1of 17

1519T_c14 03/27/2006 07:28 AM Page 614

14
Multiple Regression
14.1 Multiple Regression Analysis 14.2 Assumptions of the Multiple Regression Model 14.3 Standard Deviation of Errors 14.4 Coefficient of Multiple Determination 14.5 Computer Solution of Multiple Regression
n Chapter 13, we discussed simple linear regression and linear correlation. A simple regression model includes one independent and one dependent variable, and it presents a very simplified scenario of real-world situations. In the real world, a dependent variable is usually influenced by a number of independent variables. For example, the sales of a companys product may be determined by the price of that product, the quality of the product, and advertising expenditure incurred by the company to promote that product. Therefore, it makes more sense to use a regression model that includes more than one independent variable. Such a model is called a multiple regression model. In this chapter we will discuss multiple regression models.

Chapter

14.1 Multiple Regression Analysis


The simple linear regression model discussed in Chapter 13 was written as y A Bx This model includes one independent variable, which is denoted by x, and one dependent variable, which is denoted by y. As we know from Chapter 13, the term represented by in the above model is called the random error. Usually a dependent variable is affected by more than one independent variable. When we include two or more independent variables in a regression model, it is called a multiple regression model. Remember, whether it is a simple or a multiple regression model, it always includes one and only one dependent variable.

614

1519T_c14 03/27/2006 07:28 AM Page 615

14.1 Multiple Regression Analysis

615

A multiple regression model with y as a dependent variable and x1, x2, x3, p , xk as independent variables is written as y A B1x1 B2 x2 B3 x3 p Bk xk (1)

where A represents the constant term, B1, B2, B3, p , Bk are the regression coefficients of independent variables x1, x2, x3, p , xk, respectively, and represents the random error term. This model contains k independent variables x1, x2, x3, p , and xk. From model (1), it would seem that multiple regression models can only be used when the relationship between the dependent variable and each independent variable is linear. Furthermore, it also appears as if there can be no interaction between two or more of the independent variables. This is far from the truth. In the real world, a multiple regression model can be much more complex. Discussion of such models is outside the scope of this book. When each term contains a single independent variable raised to the first power as in model (1), we call it a first-order multiple regression model. This is the only type of multiple regression model we will discuss in this chapter. In regression model (1), A represents the constant term, which gives the value of y when all independent variables assume zero values. The coefficients B1, B2, B3, p , and Bk are called the partial regression coefficients. For example, B1 is a partial regression coefficient of x1. It gives the change in y due to a one-unit change in x1 when all other independent variables included in the model are held constant. In other words, if we change x1 by one unit but keep x2, x3, p , and xk unchanged, then the resulting change in y is measured by B1. Similarly the value of B2 gives the change in y due to a one-unit change in x2 when all other independent variables are held constant. In model (1) above, A, B1, B2, B3, p , and Bk are called the true regression coefficients or population parameters. A positive value for a particular Bi in model (1) will indicate a positive relationship between y and the corresponding xi variable. A negative value for a particular Bi in that model will indicate a negative relationship between y and the corresponding xi variable. Remember that in a first-order regression model such as model (1), the relationship between each xi and y is a straight-line relationship. In model (1), A B1x1 B2 x2 B3 x3 p Bk xk is called the deterministic portion and is the stochastic portion of the model. When we use the t distribution to make inferences about a single parameter of a multiple regression model, the degrees of freedom are calculated as df n k 1 where n represents the sample size and k is the number of independent variables in the model.

Definition Multiple Regression Model A regression model that includes two or more independent variables is called a multiple regression model. It is written as y A B1x1 B2 x2 B3 x3 p Bk xk where y is the dependent variable, x1, x2, x3, p , xk are the k independent variables, and is the random error term. When each of the xi variables represents a single variable raised to the first power as in the above model, this model is referred to as a first-order multiple regression model. For such a model with a sample size of n and k independent variables, the degrees of freedom are: df n k 1 When a multiple regression model includes only two independent variables (with k 2 2 , model (1) reduces to A multiple regression model with three independent variables (with k 3 2 is written as y A B1x1 B2x2 B3x3 y A B1x1 B2x2

1519T_c14 03/27/2006 07:28 AM Page 616

616

Chapter 14

Multiple Regression

If model (1) is estimated using sample data, which is usually the case, the estimated regression equation is written as a b1x1 b2x2 b3 x3 p bk xk y (2)

In equation (2), a, b1, b2, b3, p , and bk are the sample statistics, which are the point estimators of the population parameters A, B1, B2, B3, p , and Bk, respectively. In model (1), y denotes the actual values of the dependent variable for members of the sam denotes the predicted or estimated values of the dependent ple. In the estimated model (2), y values gives the error of prediction. For a variable. The difference between any pair of y and y multiple regression model, 22 SSE a 1 y y where SSE stands for the error sum of squares. As in Chapter 13, the estimated regression equation (2) is obtained by minimizing the sum of squared errors, that is, 22 Minimize a 1 y y The estimated equation (2) obtained by minimizing the sum of squared errors is called the least squares regression equation. Usually the calculations in a multiple regression analysis are made by using statistical software packages for computers, such as MINITAB, instead of using the formulas manually. Even for a multiple regression equation with two independent variables, the formulas are complex and manual calculations are time consuming. In this chapter we will perform the multiple regression analysis using MINITAB. The solutions obtained by using other statistical software packages such as JMP, SAS, S-Plus, or SPSS can be interpreted the same way. The TI-84 and Excel do not have built-in procedures for the multiple regression model.

14.2 Assumptions of the Multiple


Regression Model
Like a simple linear regression model, a multiple (linear) regression model is based on certain assumptions. The following are the major assumptions for the multiple regression model (1). Assumption 1: The mean of the probability distribution of is zero, that is, E12 0

If we calculate errors for all measurements for a given set of values of independent variables for a population data set, the mean of these errors will be zero. In other words, while individual predictions will have some amount of errors, on average our predictions will be correct. Under this assumption, the mean value of y is given by the deterministic part of regression model (1). Thus, where E 1 y 2 is the expected or mean value of y for the population. This mean value of y is also denoted by my 0 x1, x2, p , xk. Assumption 2: The errors associated with different sets of values of independent variables are independent. Furthermore, these errors are normally distributed and have a constant standard deviation, which is denoted by s. Assumption 3: The independent variables are not linearly related. However, they can have a nonlinear relationship. When independent variables are highly linearly correlated, it is referred to as multicollinearity. This assumption is about the nonexistence of the multicollinearity problem. For example, consider the following multiple regression model: y A B1x1 B2x2 B3x3 E 1 y 2 A B1x1 B2x2 B3 x3 p Bk xk

1519T_c14 03/27/2006 07:29 AM Page 617

14.4 Coefficient of Multiple Determination

617

All of the following linear relationships (and other such linear relationships) between x1, x2, and x3 should be invalid for this model. x1 x2 4x3 x2 5x1 2x3 x1 3.5x2 If any linear relationship exists, we can substitute one variable for another, which will reduce the number of independent variables to two. However, nonlinear relationships, such as x1 4x2 2 and x2 2x1 6x2 3 between x1, x2, and x3 are permissible. In practice, multicollinearity is a major issue. Examining the correlation for each pair of independent variables is a good way to determine if multicollinearity exists. Assumption 4: There is no linear association between the random error term and each independent variable xi.

14.3 Standard Deviation of Errors


The standard deviation of errors (also called the standard error of the estimate) for the multiple regression model (1) is denoted by s, and it is a measure of variation among errors. However, when sample data are used to estimate multiple regression model (1), the standard deviation of errors is denoted by se. The formula to calculate se is as follows. se SSE Bn k 1 22 where SSE a 1 y y

Note that here SSE is the error sum of squares. We will not use this formula to calculate se manually. Rather we will obtain it from the computer solution. Note that many software packages label se as Root MSE, where MSE stands for mean square error.

14.4 Coefficient of Multiple Determination


In Chapter 13, we denoted the coefficient of determination for a simple linear regression model by r 2 and defined it as the proportion of the total sum of squares SST that is explained by the regression model. The coefficient of determination for the multiple regression model, usually called the coefficient of multiple determination, is denoted by R2 and is defined as the proportion of the total sum of squares SST that is explained by the multiple regression model. It tells us how good the multiple regression model is and how well the independent variables included in the model explain the dependent variable. Like r2, the value of the coefficient of multiple determination R2 always lies in the range 0 to 1, that is, 0 R2 1 Just as in the case of the simple linear regression model, SST is the total sum of squares, SSR is the regression sum of squares, and SSE is the error sum of squares. SST is always equal to the sum of SSE and SSR. They are calculated as follows. SSE a e2 a 1 y y 22 SST SSyy a 1 y y 2 2

SSR a 1 y y22 SSR is the portion of SST that is explained by the use of the regression model, and SSE is the portion of SST that is not explained by the use of the regression model. The coefficient of multiple determination is given by the ratio of SSR and SST as follows. R2 SSR SST

1519T_c14 03/27/2006 07:29 AM Page 618

618

Chapter 14

Multiple Regression

The coefficient of multiple determination R2 has one major shortcoming. The value of R2 generally increases as we add more and more explanatory variables to the regression model (even if they do not belong in the model). Just because we can increase the value of R2 does not imply that the regression equation with a higher value of R2 does a better job of predicting the dependent variable. Such a value of R2 will be misleading, and it will not represent the true explanatory power of the regression model. To eliminate this shortcoming of R2, it is preferable to use the adjusted coefficient of multiple determination, which is denoted by R2. Note that R2 is the coefficient of multiple determination adjusted for degrees of freedom. The value of R2 may increase, decrease, or stay the same as we add more explanatory variables to our regression model. If a new variable added to the regression model contributes significantly to explain the variation in y, then R2 increases; otherwise it decreases. The value of R2 is calculated as follows. R2 1 1 1 R2 2a n1 b nk1 or 1 SSE/ 1 n k 1 2 SST/ 1 n 1 2

Thus, if we know R2, we can find the vlaue of R2. Almost all statistical software packages give the values of both R2 and R2 for a regression model. Another property of R2 to remember is that whereas R2 can never be negative, R2 can be negative. While a general rule of thumb is that a higher value of R2 implies that a specific set of independent variables does a better job of predicting a specific dependent variable, it is important to recognize that some dependent variables have a great deal more variability than others. Therefore, R2 .30 could imply that a specific model is not a very strong model, but it could be the best possible model in a certain scenario. Many good financial models have values of R2 below .50.

14.5 Computer Solution of Multiple Regression


In this section, we take an example of a multiple regression model, solve it using MINITAB, interpret the solution, and make inferences about the population parameters of the regression model.

EXAMPLE 141
Using MINITAB to find a multiple regression equation.

A researcher wanted to find the effect of driving experience and the number of driving violations on auto insurance premiums. A random sample of 12 drivers insured with the same company and having similar auto insurance policies was selected from a large city. Table 14.1 lists
Table 14.1 Monthly Premium (dollars) 148 76 100 126 194 110 114 86 198 92 70 120 Driving Experience (years) 5 14 6 10 4 8 11 16 3 9 19 13 Number of Driving Violations (past 3 years) 2 0 1 3 6 2 3 1 5 1 0 3

1519T_c14 03/27/2006 07:29 AM Page 619

14.5 Computer Solution of Multiple Regression

619

the monthly auto insurance premiums (in dollars) paid by these drivers, their driving experiences (in years), and the numbers of driving violations committed by them during the past three years. Using MINITAB, find the regression equation of monthly premiums paid by drivers on the driving experiences and the numbers of driving violations. Solution Let y the monthly auto insurance premium 1 in dollars 2 paid by a driver x1 the driving experience 1 in years 2 of a driver x2 the number of driving violations committed by a driver during the past three years We are to estimate the regression model y A B1x1 B2x2 (3)

The first step is to enter the data of Table 14.1 into MINITAB spreadsheet as shown in Screen 14.1. Here we have entered the given data in columns C1, C2, and C3 and named them Monthly Premium, Driving Experience and Driving Violations, respectively.

Screen 14.1

To obtain the estimated regression equation, select StatRegressionRegression. In the dialog box you obtain, enter Monthly Premium in the Response box, and Driving Experience and Driving Violations in the Predictors box as shown in Screen 14.2. Note that you can enter the column names C1, C2, and C3 instead of variable names in these boxes. Click OK to obtain the output, which is shown in Screen 14.3. From the output given in Screen 14.3, the estimated regression equation is: y 110 2.75x1 16.1x2

1519T_c14 03/27/2006 07:29 AM Page 620

620

Chapter 14

Multiple Regression

Screen 14.2

Screen 14.3

1519T_c14 03/27/2006 07:29 AM Page 621

14.5 Computer Solution of Multiple Regression

621

14.5.1

Estimated Multiple Regression Model

Example 142 describes, among other things, how the coefficients of the multiple regression model are interpreted.

EXAMPLE 142
Refer to Example 141 and the MINITAB solution given in Screen 14.3. (a) Explain the meaning of the estimated regression coefficients. (b) What are the values of the standard deviation of errors, the coefficient of multiple determination, and the adjusted coefficient of multiple determination? (c) What is the predicted auto insurance premium paid per month by a driver with seven years of driving experience and three driving violations committed in the past three years? (d) What is the point estimate of the expected (or mean) auto insurance premium paid per month by all drivers with 12 years of driving experience and 4 driving violations committed in the past three years? Solution (a) From the portion of the MINITAB solution that is marked I in Screen 14.3, the estimated regression equation is y 110 2.75x1 16.1x2 From this equation, a 110, b1 2.75, and b2 16.1 (4)
Interpreting parts of the MINITAB solution of multiple regression.

We can also read the values of these coefficients from the column labeled Coef in the portion of the output marked II in the MINITAB solution of Screen 14.3. From this column we obtain a 110.28, b1 2.7473, and b2 16.106

Notice that in this column the coefficients of the regression equation appear with more digits after the decimal point. With these coefficient values, we can write the estimated regression equation as y 110.28 2.7473x1 16.106x2 (5) The value of a 110.28 in the estimated regression equation (5) gives the value of y for x1 0 and x2 0. Thus, a driver with no driving experience and no driving violations committed in the past three years is expected to pay an auto insurance premium of $110.28 per month. Again, this is the technical interpretation of a. In reality, that may not be true because none of the drivers in our sample has both zero experience and zero driving violations. As all of us know, some of the highest premiums are paid by teenagers just after obtaining their drivers licenses. The value of b1 2.7473 in the estimated regression model gives the change in y for a one-unit change in x1 when x2 is held constant. Thus, we can state that a driver with one extra year of experience but the same number of driving violations is expected to pay $2.7473 (or $2.75) less per month for the auto insurance premium. Note that because b1 is negative, an increase in driving experience decreases the premium paid. In other words, y and x1 have a negative relationship. The value of b2 16.106 in the estimated regression model gives the change in y for a one-unit change in x2 when x1 is held constant. Thus, a driver with one extra driving violation during the past three years but with the same years of driving experience is expected to pay $16.106 (or $16.11) more per month for the auto insurance premium.

1519T_c14 03/27/2006 07:29 AM Page 622

622

Chapter 14

Multiple Regression

(b) The values of the standard deviation of errors, the coefficient of multiple determination, and the adjusted coefficient of multiple determination are given in part III of the MINITAB solution of Screen 14.3. From this part of the solution, se 12.1459, R2 93.1%, and R2 91.6%

Thus, the standard deviation of errors is 12.1459. The value of R2 93.1% tells us that the two independent variables, years of driving experiences and the numbers of driving violations, explain 93.1% of the variation in the auto insurance premiums. The value of R2 91.6% is the value of the coefficient of multiple determination adjusted for degrees of freedom. It states that when adjusted for degrees of freedom, the two independent variables explain 91.6% of the variation in the dependent variable. (c) To Find the predicted auto insurance premium paid per month by a driver with seven years of driving experience and three driving violations during the past three years, we substitute x1 7 and x2 3 in the estimated regression model (5). Thus, y 110.28 2.7473x1 16.106x2 110.28 2.7473 1 7 2 16.106 1 3 2 $139.37

is a point estimate of the predicted value of y, which is deNote that this value of y noted by yp. The concept of the predicted value of y is the same as that for a simple linear regression model discussed in Section 13.8.2 of Chapter 13. (d) To obtain the point estimate of the expected (mean) auto insurance premium paid per month by all drivers with 12 years of driving experience and four driving violations during the past three years, we substitute x1 12 and x2 4 in the estimated regression equation (5). Thus, y 110.28 2.7473x1 16.106x2 110.28 2.7473 1 12 2 16.106 1 4 2 $141.74

is a point estimate of the mean value of y, which is denoted by E 1 y 2 This value of y or my 0 x1x2. The concept of the mean value of y is the same as that for a simple linear regression model discussed in Section 13.8.1 of Chapter 13.

14.5.2 Confidence Interval for an Individual Coefficient


The values of a, b1, b2, b3, p , and bk obtained by estimating model (1) using sample data give the point estimates of A, B1, B2, B3, p , and Bk, respectively, which are the population parameters. Using the values of sample statistics a, b1, b2, b3, p , and bk, we can make confidence intervals for the corresponding population parameters A, B1, B2, B3, p , and Bk, respectively. Because of the assumption that the errors are normally distributed, the sampling distribution of each bi is normal with its mean equal to Bi and standard deviation equal to sbi. For example, the sampling distribution of b1 is normal with its mean equal to B1 and standard deviation equal to sb1. However, usually se is not known and, hence, we cannot find sbi. Consequently, we use sbi as an estimator of sbi and use the t distribution to determine a confidence interval for Bi. The formula to obtain a confidence interval for a population parameter Bi is given below. This is the same formula we used to make a confidence interval for B in Section 13.5.2 of Chapter 13. The only difference is that to make a confidence interval for a particular Bi for a multiple regression model, the degrees of freedom are n k 1. The 1 1 a 2 100% confidence interval for Bi is given by bi tsbi

Confidence Interval for Bi

The value of t that is used in this formula is obtained from the t distribution table for a 2 area in the right tail of the t distribution curve and 1 n k 1 2 degrees of freedom. The values of bi and sbi are obtained from the computer solution.

1519T_c14 03/27/2006 07:29 AM Page 623

14.5 Computer Solution of Multiple Regression

623

Example 143 describes the procedure to make a confidence interval for an individual regression coefficient Bi.

EXAMPLE 143
Determine a 95% confidence interval for B1 (the coefficient of experience) for the multiple regression of auto insurance premium on driving experience and the number of driving violations. Use the MINITAB solution of Screen 14.3. Solution To make a confidence interval for B1, we use the portion marked II in the MINITAB solution of Screen 14.3. From that portion of the MINITAB solution, b1 2.7473 and sb1 .9770
Making a confidence interval for an individual coefficient of a multiple regression model.

Note that the value of the standard deviation of b1, sb1 .9770, is given in the column labeled SE Coef in part II of the MINITAB solution. The confidence level is 95%. The area in each tail of the t distribution curve is obtained as follows. Area in each tail of the t distribution 1 1 .95 2 2 .025 The sample size is 12, which gives n 12. Because there are two independent variables, k 2. Therefore, Degrees of freedom n k 1 12 2 1 9 From the t distribution table (Table V of Appendix C), the value of t for .025 area in the right tail of the t distribution curve and 9 degrees of freedom is 2.262. Then, the 95% confidence interval for B1 is b1 tsb1 2.7473 2.262 1 .9770 2

2.7473 2.2100 4.9573 to .5373

Thus, the 95% confidence interval for b1 is 4.96 to .54. That is, we can state with 95% confidence that for one extra year of driving experience, the monthly auto insurance premium changes by an amount between $4.96 and $.54. Note that since both limits of the confidence interval have negative signs, we can also state that for each extra year of driving experience, the monthly auto insurance premium decreases by an amount between $.54 and $4.96.

By applying the procedure used in Example 143, we can make a confidence interval for any of the coefficients (including the constant term) of a multiple regression model, such as A and B2 in model (3). For example, the 95% confidence intervals for A and B2, respectively, are a tsa 110.28 2.262 1 14.62 2 77.21 to 143.35 b2 tsb2 16.106 2.262 1 2.613 2 10.20 to 22.02

14.5.3 Testing a Hypothesis about an Individual Coefficient


We can perform a test of hypothesis about any of the Bi coefficients of the regression model (1) using the same procedure that we used to make a test of hypothesis about B for a simple regression model in Section 13.5.3 of Chapter 13. The only difference is that the degrees of freedom are equal to n k 1 for a multiple regression model. Again, because of the assumption that the errors are normally distributed, the sampling distribution of each bi is normal with its mean equal to Bi and standard deviation equal to sbi. However, usually se is not known and, hence, we cannot find sbi. Consequently, we use sbi as an estimator of sbi, and use the t distribution to perform the test.

1519T_c14 03/27/2006 07:29 AM Page 624

624

Chapter 14

Multiple Regression

Test Statistic for bi

The value of the test statistic t for bi is calculated as t bi Bi sbi

The value of Bi is substituted from the null hypothesis. Usually, but not always, the null hypothesis is H0: Bi 0. The MINITAB solution contains this value of the t statistic. Example 144 illustrates the procedure for testing a hypothesis about a single coefficient.

EXAMPLE 144
Testing a hypothesis about a coefficient of a multiple regression model.

Using the 2.5% significance level, can you conclude that the coefficient of the number of years of driving experience in regression model (3) is negative? Use the MINITAB output obtained in Example 141 and shown in Screen 14.3 to perform this test. Solution From Example 141, our multiple regression model (3) is y A B1x1 B2x2 where y is the monthly auto insurance premium (in dollars) paid by a driver, x1 is the driving experience (in years), and x2 is the number of driving violations committed during the past three years. From the MINITAB solution, the estimated regression equation is y 110.28 2.7473x1 16.106x2 To conduct a test of hypothesis about B1, we use the portion marked II in the MINITAB solution given in Screen 14.3. From that portion of the MINITAB solution, b1 2.7473 and sb1 .9770 Note that the value of the standard deviation of b1, sb1 .9770, is given in the column labeled SE Coef in part II of the MINITAB solution. To make a test of hypothesis about B1, we perform the following five steps. Step 1. State the null and alternative hypotheses. We are to test whether or not the coefficient of the number of years of driving experience in regression model (3) is negative, that is, whether or not B1 is negative. The two hypotheses are H0 : B1 0 H1 : B1 6 0 Note that we can also write the null hypothesis as H0 : B1 0, which states that the coefficient of the number of years of driving experience in the regression model (3) is either zero or positive. Step 2. Select the distribution to use. The sample size is small 1 n 6 30 2 and se is not known. The sampling distribution of b1 is normal because the errors are assumed to be normally distributed. Hence, we use the t distribution to make a test of hypothesis about B1. Step 3. Determine the rejection and nonrejection regions. The significance level is .025. The sign in the alternative hypothesis indicates that the test is left-tailed. Therefore, area in the left tail of the t distribution curve is .025. The degrees of freedom are: df n k 1 12 2 1 9 From the t distribution table (Table V in Appendix C), the critical value of t for 9 degrees of freedom and .025 area in the left tail of the t distribution curve is 2.262, as shown in Figure 14.1.

1519T_c14 03/27/2006 07:29 AM Page 625

14.5 Computer Solution of Multiple Regression

625

Reject H0

Do not reject H0

Figure 14.1

= .025

2.262

0 Critical value of t

Step 4. Calculate the value of the test statistic and p-value. The value of the test statistic t for b1 can be obtained from the MINITAB solution given in Screen 14.3. This value is given in the column labeled T and the row named Driving Experience in the portion marked II in that MINITAB solution. Thus, the observed value of t is b1 B1 t 2.81 sb1 Also, in the same portion of the MINITAB solution, the p-value for this test is given in the column labeled P and the row named Driving Experience. This p-value is .020. However, MINITAB always gives the p-value for a two-tailed test. Because our test is one-tailed, the p-value for our test is p-value .020 2 .010 Step 5. Make a decision. The value of the test statistic, t 2.81, is less than the critical value of t 2.262 and it falls in the rejection region. Consequently, we reject the null hypothesis and conclude that the coefficient of x1 in regression model (3) is negative. That is, an increase in the driving experience decreases the auto insurance premium. Also the p-value for the test is .010, which is less than the significance level of a .025. Hence, based on this p-value also, we reject the null hypothesis and conclude that B1 is negative. Note that the observed value of t in Step 4 of Example 144 is obtained from the MINITAB solution only if the null hypothesis is H0 : B1 0. However, if the null hypothesis is that B1 is equal to a number other than zero, then the t value obtained from the MINITAB solution is no longer valid. For example, suppose the null hypothesis in Example 144 is H0 : B1 2 and the alternative hypothesis is H1 : B1 6 2 In this case the observed value of t will be calculated as t 2.7473 1 2 2 b1 B1 .765 sb1 .9770

To calculate this value of t, the values of b1 and sb1 are obtained from the MINITAB solution of Screen 14.3. The value of B1 is substituted from H0.

EXERCISES
CONCEPTS AND PROCEDURES
14.1 How are the coefficients of independent variables in a multiple regression model interpreted? Explain. 14.2 What are the degrees of freedom for a multiple regression model to make inferences about individual parameters?

1519T_c14 03/27/2006 07:30 AM Page 626

626

Chapter 14

Multiple Regression

14.3 What kinds of relationships among independent variables are permissible and which ones are not permissible in a linear multiple regression model? 14.4 Explain the meaning of the coefficient of multiple determination and the adjusted coefficient of multiple determination for a multiple regression model. What is the difference between the two? 14.5 What are the assumptions of a multiple regression model? 14.6 The following table gives data on variables y, x1, x2, and x3. y 8 11 19 21 7 23 16 27 9 13 x1 18 26 34 38 13 49 28 59 14 21 x2 38 25 24 44 12 48 38 52 17 39 x3 74 64 47 31 79 35 42 18 71 57

Using MINITAB, estimate the regression model y A B1x1 B2x2 B3x3 Using the solution obtained, answer the following questions. a. Write the estimated regression equation. b. Explain the meaning of a, b1, b2, and b3 obtained by estimating the given regression model. c. What are the values of the standard deviation of errors, the coefficient of multiple determination, and the adjusted coefficient of multiple determination? d. What is the predicted value of y for x1 35, x2 40, and x3 65? e. What is the point estimate of the expected (mean) value of y for all elements given that x1 40, x2 30, and x3 55? f. Construct a 95% confidence interval for the coefficient of x3. g. Using the 2.5% significance level, test whether or not the coefficient of x1 is positive. 14.7 The following table gives data on variables y, x1, and x2. y 24 14 18 31 10 29 26 33 13 27 26 x1 98 51 74 108 33 119 99 141 47 103 111 x2 52 69 63 35 88 54 51 31 67 41 46

Using MINITAB, find the regression of y on x1 and x2. Using the solution obtained, answer the following questions. a. Write the estimated regression equation. b. Explain the meaning of the estimated regression coefficients of the independent variables.

1519T_c14 03/27/2006 07:30 AM Page 627

14.5 Computer Solution of Multiple Regression

627

c. What are the values of the standard deviation of errors, the coefficient of multiple determination, and the adjusted coefficient of multiple determination? d. What is the predicted value of y for x1 87 and x2 54? e. What is the point estimate of the expected (mean) value of y for all elements given that x1 95 and x2 49? f. Construct a 99% confidence interval for the coefficient of x1. g. Using the 1% significance level, test if the coefficient of x2 in the population regression model is negative.

APPLICATIONS
14.8 The salaries of workers are expected to be dependent, among other factors, on the number of years they have spent in school and their work experiences. The following table gives information on the annual salaries (in thousands of dollars) for 12 persons, the number of years each of them spent in school, and the total number of years of work experiences. Salary Schooling Experience 52 16 6 44 12 10 48 13 15 77 20 8 68 18 11 48 16 2 59 14 12 83 18 4 28 12 6 61 16 9 27 12 2 69 16 18

Using MINITAB, find the regression of salary on schooling and experience. Using the solution obtained, answer the following questions. a. Write the estimated regression equation. b. Explain the meaning of the estimates of the constant term and the regression coefficients of independent variables. c. What are the values of the standard deviation of errors, the coefficient of multiple determination, and the adjusted coefficient of multiple determination? d. How much salary is a person with 18 years of schooling and 7 years of work experience expected to earn? e. What is the point estimate of the expected (mean) salary for all people with 16 years of schooling and 10 years of work experience? f. Determine a 99% confidence interval for the coefficient of schooling. g. Using the 1% significance level, test whether or not the coefficient of experience is positive. 14.9 The CTO Corporation has a large number of chain restaurants throughout the United States. The research department at the company wanted to find if the sales of restaurants depend on the size of the population within a certain area surrounding the restaurants and the mean income of households in those areas. The company collected information on these variables for 11 restaurants. The following table gives information on the weekly sales (in thousands of dollars) of these restaurants, the population (in thousands) within five miles of the restaurants, and the mean annual income (in thousands of dollars) of the households for those areas. Sales Population Income 19 21 58 29 15 69 17 32 49 21 18 52 14 47 67 30 69 76 33 29 81 22 43 46 18 75 39 27 39 64 24 53 28

Using MINITAB, find the regression of sales on population and income. Using the solution obtained, answer the following questions. a. Write the estimated regression equation. b. Explain the meaning of the estimates of the constant term and the regression coefficients of population and income. c. What are the values of the standard deviation of errors, the coefficient of multiple determination, and the adjusted coefficient of multiple determination? d. What are the predicted sales for a restaurant with 50 thousand people living within a five-mile area surrounding it and $55 thousand mean annual income of households in that area. e. What is the point estimate of the expected (mean) sales for all restaurants with 45 thousand people living within a five-mile area surrounding them and $46 thousand mean annual income of households living in those areas? f. Determine a 95% confidence interval for the coefficient of income. g. Using the 1% significance level, test whether or not the coefficient of population is different from zero.

1519T_c14 03/27/2006 07:30 AM Page 628

628

Chapter 14

Multiple Regression

USES AND MISUSES... AD D ITIVE VE R S U S M U LTI P LI C ATIVE E F F E C T


A first-order multiple regression model with (quantitative) independent variables is one of the simpler types of multiple regression models. However, there are many limitations of this model. A major limitation is that the independent variables have an additive effect on the dependent variable. What does additive mean here? Suppose we have the following estimated regression equation: 4 6 x1 3x2 y From this estimated regression equation, if x1 increases by 1 unit (with x2 held constant), our predicted value of y increases by 6 units. If x2 increases by 1 unit (with x1 held constant), our predicted value of y increases by 3 units. But what happens if x1 and x2 both increase by 1 unit each? From this equation, our predicted value of y will increase is simply the sum of the by 6 3 9 units. The total increase in y does not depend on the values of x1 two increases. This change in y and x2 prior to the increase. Since the total increase in the dependent variable is equal to the sum of the increases from the two individual parts (independent variables), we say that the effect is additive. Now suppose we have the following equation: 4 6x1 3x2 5x2 y 1x2 The important difference in this case is that the increase in the value is no longer constant when x1 and x2 both increase by 1 unit of y each. Instead, it depends on the original values of x1 and x2. For example, consider the values of x1 and x2 , and the changes in the value shown in the following table. of y x1 2 3 2 3 x2 3 3 4 4 y 85 166 108 214 (versus x1 2 and x2 3) Change in y 81 23 129 case, the effect is said to be multiplicative. It is important to recognize that the effect is multiplicative when the total increase does not equal the sum of the increases of the independent variables. Pharmaceutical companies are always looking for multiplicative effects when creating new drugs. In many cases, a combination of two drugs might have a multiplicative effect on a certain condition. Simply stated, the two drugs provide greater relief when you take them together than if you take them separately so that only one drug is in your system at any time. Of course, the companies also have to look for multiplicative effects when it comes to side effects. Individual drugs may not have major side effects when taken separately, but could cause greater harm when taken together. One of the most noteworthy examples of this was the drug Fen-Phen, which was a combination of two drugsFenfluramine and Phentermine. Each of these two drugs had been approved for short-term (individual) control of obesity. However, the drugs used in combination became popular for long-term weight loss. Unfortunately, the combination, when associated with longtime use, resulted in severe side effects that were detailed in the following statement from the Food and Drug Administration in 1997: Thanks to the reporting of health care professionals, as of August 22, FDA has received reports of 82 cases (including Mayos 24 cases) of cardiac valvular disease in patientstwo of whom were menon combination fenfluramine and phentermine. These reports have been from 23 different states. Severity of the cardiac valvular disease was graded as moderate or severe in over three-fourths of the cases, and two of the reports described deterioration from no detectable heart murmur to need for a valve replacement within one-and-a-half years. Sixteen of these 82 patients required surgery to repair their heart valves. At least one of these patients died following surgery to repair the valves. (The agencys findings, as of July 31, are described in more detail in the current issue of The New England Journal of Medicine, which also carries the Mayo study.) Source: http://www.fda.gov/cder/ news/phen/fenphenupdate.htm

is not Unlike the previous example, here the total increase in y equal to the sum of the increases from the individual parts. In this

Glossary
Adjusted coefficient of multiple determination Denoted by R 2, it gives the proportion of SST that is explained by the multiple regression model and is adjusted for the degrees of freedom. Coefficient of multiple determination Denoted by R2, it gives the proportion of SST that is explained by the multiple regression model. First-order multiple regression model When each term in a regression model contains a single independent variable raised to the first power. Least squares regression model The estimated regression model obtained by minimizing the sum of squared errors. Multicollinearity When two or more independent variables in a regression model are highly correlated. Multiple regression model A regression model that contains two or more independent variables. Partial regression coefficients The coefficients of independent variables in a multiple regression model are called the partial regression coefficients because each of them gives the effect of the corresponding independent variable on the dependent variable when all other independent variables are held constant.

1519T_c14 03/27/2006 07:30 AM Page 629

Self-Review Test

629

Standard deviation of errors Also called the standard deviation of estimate, it is a measure of the variation among errors. SSE (error sum of squares) The sum of the squared differences between the actual and predicted values of y. It is the portion of SST that is not explained by the regression model.

SSR (regression sum of squares) The portion of SST that is explained by the regression model. SST (total sum of squares) The sum of squared differences between actual y values and y.

Self-Review Test
1. When using the t distribution to make inferences about a single parameter, the degrees of freedom for a multiple regression model with k independent variables and a sample size of n are equal to a. n k 1 b. n k 1 c. n k 1 2. The value of R2 is always in the range a. zero to 1 b. 1 to 1 c. 1 to zero 3. The value of R 2 is a. always positive b. always nonnegative c. can be positive, zero, or negative 4. What is the difference between the population multiple regression model and the estimated multiple regression model? 5. Why are the regression coefficients in a multiple regression model called the partial regression coefficients? 6. What is the difference between R2 and R 2? Explain. 7. A real estate expert wanted to find the relationship between the sale price of houses and various characteristics of the houses. She collected data on four variables, recorded in the table, for 13 houses that were sold recently. The four variables are Price Sale price of a house in thousands of dollars Lot size Size of the lot in acres Living area Living area in square feet Age Age of a house in years Price 455 278 463 327 505 264 445 346 487 289 434 411 223 Lot Size 1.4 .9 1.8 .7 2.6 1.2 2.1 1.1 2.8 1.6 3.2 1.7 .5 Living Area 2500 2250 2900 1800 3200 2400 2700 2050 2850 2400 2600 2300 1700 Age 8 12 5 9 4 28 9 13 7 16 5 8 19

Using MINITAB, find the regression of price on lot size, living area, and age. Using the solution obtained, answer the following questions. a. Indicate whether you expect a positive or a negative relationship between the dependent variable and each of the independent variables. b. Write the estimated regression equation. Are the signs of the coefficients of independent variables obtained in the solution consistent with your expectations of part a? c. Explain the meaning of the estimated regression coefficients of all independent variables. d. What are the values of the standard deviation of errors, the coefficient of multiple determination, and the adjusted coefficient of multiple determination?

1519T_c14 03/27/2006 07:30 AM Page 630

630

Chapter 14

Multiple Regression

e. What is the predicted sale price of a house that has a lot size of 2.5 acres, a living area of 3000 square feet, and is 14 years old? f. What is the point estimate of the mean sale price of all houses that have a lot size of 2.2 acres, a living area of 2500 square feet, and are 7 years old? g. Determine a 99% confidence interval for each of the coefficients of the independent variables. h. Construct a 98% confidence interval for the constant term in the population regression model. i. Using the 1% significance level, test whether or not the coefficient of lot size is positive. j. At the 2.5% significance level, test if the coefficient of living area is positive. k. At the 5% significance level, test if the coefficient of age is negative.

Mini-Project 141
y A B1x1 B2x2 B3x3

Refer to the McDonalds data set explained in Appendix B and given on the Web site of this text. Use MINITAB to estimate the following regression model for that data set.

where

y calories x1 fat (measured in grams) x2 carbohydrate (measured in grams) x3 protein (measured in grams)

and

Now research on the Internet or in a book to find the number of calories in one gram of fat, one gram of carbohydrate, and one gram of protein. a. Based on the information you obtain, write what the estimated regression equation should be. b. Are the differences between your expectation in part a and the regression equation that you obtained from MINTAB small or large? c. Since each gram of fat is worth a specific number of calories, and the same is true for a gram of carbohydrate, and for a gram of protein, one would expect that the predicted and observed values of y would be the same for each food item, but that is not the case. The quantities of fat, carbohydrates, and protein are reported in whole numbers. Explain why this causes the differences discussed in part b.

DECIDE FOR YOURSELF Dummy Variables


In Sanford & Son, a very popular TV show of the 1970s, Fred Sanford would often refer to other people as big dummies. So, if a statistics professor questions your work and mentions a dummy in the process, should you be offended? Obviously, context will help you to answer that question, but if the professor is referring to a dummy variable, then do not take it personally. A dummy variable is the name given to a categorical independent variable used in a multiple regression model. The simplest version occurs when there are only two categories. In this case, we assign a value of 0 to one category and 1 to the other category of the variable. Suppose you have the following first-order regression equation to predict the amount of tar inhaled (y) by smoking a cigarette based on the amount of tar in the cigarette 1 x1 2 and the presence of a filter 1 x2 2 . Note that here x2 0 implies that a cigarette does not have a filter and x2 1 means that a filter exists.

y .94x1 .45x2
Answer the following questions. 1. Does the presence of a filter increase or decrease the tar consumption? What part of the regression equation tells you this?

2. On average, what percentage of the tar in a cigarette is consumed


if the cigarette is unfiltered? What if the cigarette is filtered?

3. Draw a graph of the above regression equation. (Hint: The graph


consists of two different regression lines with two variables, not a plane.)

You might also like