Professional Documents
Culture Documents
= a + b1 x1 + b2 x 2 + b3 x 3 + e y
For the two variable case we can find the multiple regression equation as follows:
= a + b1 x1 + b2 x 2 + e y
The normal equations for this are as follows:
y = na + b x + b x x y = a x + b x + b xx
1 1 2 2 1 1 1 2 1 2
= a + b1 x1 + b2 x2 + b3 x 3 + ............ + bk x k y
This equation is estimated by the computer . We now look at how a statistical package such as SPSS or Minitab handles the data. An example will help make the process clearer: Suppose the IRS in US wish to model discovery of unpaid taxes. They include the following independent variables: 1. 2. 3. 4. No. of hours of Field audit($00s) No. of computer hours($00s) Reward to informants ($000s) Actual unpaid taxes discovered. ($100000s)
x 2 y = a x2 + b1 x1 x 2 + b2 x 22
These can be solved to obtain the values of the parameters a, b1, b2 So far we have referred to a as the y intercept and b 1 as the slopes of the multiple regression. However are the estimated regression c o e f f i c i e n t s .T h ec o n s t a n tai st h ev a l u eo f y if both x 1, x 2 are zero. The coefficients b1, b2 describe how changes in x 1 affect the value of y . Thus b1 measures the value of changes in x 1 o n y holding x 2 constant. Similarly b2 measures the effect on y of changes in x 2 holding x 1 constant. Thus linear regression estimates a regression line between two variables. Multiple regression there is a regression plane among y, x1 and x2. This regression plane is determined in the same way as the regression line by minimizing the sum of squared deviations of data points from the regression plane. Each independent variable accounts for some of the variation in the dependent variable. This is shown in figure 1 below.
11.556
179
Month Jan Feb March Apr May Jun July Aug Sept Oct
Field audit 45 42 44 43 46 44 45 44 43 42
Comp hours 16 14 15 13 13 14 16 16 15
Reward to informers 71 70 72 71 75 74 76 69 74 73
RESEARCH METHODOLOGY
15
27
Now a regression is run on Minitab and the sample out put is presented below. We now have to interpret this output. This is given in table 2 Now how do we interpret this output? 1. The regression equation is of the form : the coefficient column we can read the estimating equation: =-45.8+.597Audit+1.18Comp+.405Rewards How do we interpret this equation? The interpretation is similar to that of the one variable simple linear regression case.
If we hold the number of field audit labour hours, number
of computer hours constant and we change rewards to informants by one unit , then y will change by an additional $405000 for each additional $1000 paid to informants.
Similarly holding x 1 and x 3 constant an additional 100 hours
example if we want to construct a 95% confidence interval around this estimate of $27905000 we can do it as follows: $27905000+/-t s e= $27905000+2.447(286,000) =$2860,800 upper limit =$27905000+2.447(286,000) =$27,205,200 Lower limit the standard error of the estimate measures the dispersion of data points around the regression plane. Smaller values of se indicate a better regression. If the addition of another variable reduces se then we say that the inclusion of the third variable improves the fir of the regression. The Coefficient of Multiple Determination In a multiple regression we measure the strength of the relationship among the three independent variables and the dependent variables by the coefficient of determination or R2. This defined as : R2 is the proportion of total variation in y that is explained by the regression plane. In our example we have R2=98.3% .This tells us the 98.3% of variation in unpaid taxes is explained by the three independent variables. AS we add more variables in a regression explanatory power of the equation improves if the R2 increases. Example 2 Insert exercise lr p732 Example Pam Schneider owns and operates an accounting firm in Ithaca, New York. Pam feels that it would be used to be able to predict in advance in the number of rush income-tax returns during the
We can also use this equation to solve problems such as : Suppose in Nov the IRS plans to leave field hours and computer hours at their Oct level but increase rewards to $75000 How much of recoveries can they expect to make in Nov? We can get a forecasted value by substituting in the equation.
)2 (y y se = n k 1
Where Y = sample values of the dependent variable
180
11.556
busy march 1 to April 15 period so that she can better ;oan her personnel need during this time. She has hypothesized that several factors may be useful in her production. Data for these factors and number of rush returns for past years are as follows:
X1 Economic index 99 106 100 129 179 X2 Population within 1 mile of office 10188 8566 10557 10219 9662 X3 Average income in Ithaca 21465 22228 27665 25200 26300 Y Number of rush returns, march 1 to April 15 2306 1266 1422 1721 2544
RESEARCH METHODOLOGY
Given the following set of data use whatever computer package is available to find the best fitting regression equation and answer the following: a. What is the regression equation? b. What is the standard error of estimate? c. What is R2 for this regression? e. Given an approximate 95 percent confidence interval for the value of Y when the values of X1, X2, X3, and X4 are 52.4, 41.6 35.8, and 3, respectively. Q3.We are trying to predict the annual demand for widgets (Demand)using the following independent variable. Price = price of widgets (in $) Income = consumer income (in $) Sub = price of a substitute commodity (in $) (Note: A substitute commodity is one that can be substituted for another commodity. For example, margarine is a substitute commodity for butter,)
a. Use the following Minitab output to determine the best fitting regression equation for these data: The regressions equation is
Predictor const
Coef -1275
X1 X2 X3
Year
R sq = 87.2%
Demand 40 45 50 55 60 70 65 65 75 75 80
Price ($) 9 8 9 8 7 6 6 8 5 5 5
Income 400 500 600 700 800 900 1000 1100 1200 1300 1400
Sub ($) 10 14 12 13 11 15 26 27 22 19 20
1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
b. What percentage of the total variation in the number of rush returns is explained by this equation? c. For this year, the economic index is 169, then population with in 1 mile of the office is 10212, and the average income in Ithaca is $26925. How many rush returns should Pam expect to prices between March 1 April 15? Results
3 4 3 4
23 18 24 21
11.556
181
a Using whatever computer package is available, determine the best-fitting regression equation for these data. b. Are the signs (+ or -) of the regression coefficients of the independent variables, as one would expect? Explain briefly. c. State and interpret the coefficient of multiple determinations for this problem. d. State and interpret the standard error of estimate for this problem. e. Using the equation, what would you predict for DEMAND if the price of widgets was $6, consumer income was $1200 and the price of the substitute commodity was $17? Notes
RESEARCH METHODOLOGY
182
11.556