Regression

REGRESSION AND CORRELATION
This chapter considers the problems of analysing the relationships between variables. Different types of scatter diagrams are depicted. Straight line equations are described and the method of calculating the least squares regression line is described. The uses and method of calculating the coefficient of determination and coefficient of correlation are described and the development of confidence limits for the regression line is explained in detail. The chapter concludes with an explanation of the Rank Correlation coefficients as non parametric measures of the statistical associations
7.1 Introduction Quit often, there are occasions in business when changes in one or many variable appear to be related in certain way to movements in one or several other variables. For example, a sales manager may observe that sales value changed when there has been a change in advertising expenditure, or the logistic manager may notice that as cars and trucks are more used and the number of clients increased then the maintenance expenses becomes larger.
Statistics for Business Administration
Certain questions may occur for the manager or analyst, as the followings: 1. Are the changes of the variables in the same or in opposite directions? 2. Could changes in one variable be influencing or be influenced by movements in the other variable? 3. This is an important relationship or could apparently related movements come about purely by chance? 4. Could movements in two variables be related, not directly, but through movements in a third variable? 5. What is the importance of this knowledge for the business decision system? In many occasions the manager or analyst is interested in predicting the value of one variable related to other variables which were considered to influence it. For example, the quality control manager may want to know what might be the effect on the number of failures if the amount of expenditure on inspection were increased. The Marketing Manager may wish to predict market share if advertising costs were cut by 20%. Suppose that a manager has sensed that two variables are behaving in some way related, how will the manager proceed to investigate the relation? A possible methodology might be as follows: a) Observe and note what is happening in a systematic way. b) Draw a scatter diagram of data that is being observed. c) Measure statistically the intensity of the relation, its significance and describe the relation. d) Use the result to improve your decisions In the managerial process it is necessary to dispose of statistical information as variate and complex as possible, that will be known and used to measure the relations of independence or dependence between the variables. The relationship between the statistical variables or between indicators can be observed in all economic activity: production, between the production indicators and these of efficiency and productivity, between resources and the results of their using, between the obtained results and the investment plan.
Regression and Correlation
Bivariate data implies two distinct categories of variables: independent and dependent variables. The independent variable is that variable occurring randomly or chosen freely and it is usually denoted by x. The dependent variable occurs as a result of the variation of the independent variable and it is usually denoted by y.
7.2 Categories of Relations between Variables The relations that can be found between x and y variables, modelled as y = f (x) + , allows characterizing the direction of change, the intensity of change and the shape of the relation. The relations are classified as follows: a. according to the way of change we can have : - direct relations, also called positive relations, meaning that a change in independent variable will induce a change of the dependent variable in the same direction: if x is increasing then y will also increase and if x is decreasing then y will decrease - opposite relations, also called negative relations, meaning that a change in the independent variable will induce a change of the dependent variable in opposite direction: if x is increasing then y will decrease and if x is decreasing then y will increase b. according to the intensity of the relation we can have: - high intensity, strong, or tight relations, expressed by high correlation level between the variables - medium intensity relations - low intensity causal relation c. according to the shape of the relations we can observe: - linear relations - non linear relations, as exponential growth, logarithmic decrease, etc. d. according to the randomness involved, we will have deterministic and probabilistic models The deterministic model is allowing us to determine the value of a dependent variable from the values of the
independent variables. Such models represent relationships in the natural sciences. Example of deterministic model: E = mc2 , where: E energy m mass c speed of light For practical models we have to represent the randomness that is part of a real life process. Such models are called probabilistic models. For a probabilistic model we add a random term (also called the error variables). The random term accounts for all the variables, measurable and immeasurable, that are not part of the model. In the case of the probabilistic first-order model: Y = 0 + 1 X+ , (7.1.) where: Y dependent or explained variable; X independent or explanatory variable; random variable; 0 , 1 parameters. Example 1: For the following situation in Table 7-1 we are asked to characterize the relation between expenditure on inspection and defective parts delivered to the customer for a company with ten operating plants of similar size producing small components:
Costs and defective products recording Observation number 1 2 3 4 5 6 7 8 9 10 Control costs per batch 25 30 15 75 40 65 45 24 35 70 Table 7-1 Defective parts per batch of units 50 35 60 15 46 20 28 45 42 22
We can deduce that there is likely to be an opposite relationship between the control cost and the number of defectives parts delivered to the customer; the higher the cost, the fewer defective units are delivered. Based on this assumption which is a form of hypothesis the data can be graphed using the scatter diagram. The scatter diagram is graphical form of data displaying constructed as follows: the horizontal or x axis is used for the independent variable variants or classes in this case, expenditure. the y or vertical axis is used for the dependent variable variants or classes, in this case, defective parts delivered. This type of diagram is known as a scatter diagram.
70 60 Defective parts 50 per delivered batch 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 110 Cost per controlled batch Figure 7.1 Scatter diagram based on the data in Example 1.
Figure 7.1 shows a clear drift downwards in defectives delivered as costs per batch increases. The scatter diagram shows: - an opposite or negative relation due to the negative slope - a linear relation due to the linear shape of the scatter diagram points (the linear equation can correctly model the relation due to the fact that the points are close to the line) - a medium intensity of the relation due to the fact that the points are not extremely gathered
- an inelastic relation due to the fat that the slope is almost 45o Sometimes other possibilities exist ranging from a perfect negative or perfect positive relationship to no discernible relationship. A perfect relationship is one where a single straight line can be drawn through all the point, for example 2.1 and 2.2 in Figure 7.2.
Figure 7.2 Perfect positive relationship
Figure 7.3 Perfect negative relationship
7.3 Simple Regression Model Regression is a statistical method providing a mathematical description of the statistical relations between variables. The purpose of the regression techniques is:
- to describe in mathematical terms a statistical relation, - to estimate the value of the dependent variable given the value of the independent variable and - to compare statistical relations between variables for two companies or for two countries, regions, companies Regression is concerned with obtaining a mathematical function describing the statistical relation between variables. If the relation is between one dependent and one independent variable then we are in the case of the simple regression; if the statistical relation is between one dependent and two or more independent variables then we are in the case of the multiple regression. This section deals only with the simple regression techniques. According to the mathematical function modelling the relation between the variables we can identify linear and non-linear equations 7.3.1 Simple Linear Regression Model Simple linear regression model: Y = 0 + 1 X+ (7.2)
The main attributes of the linear regression, modelling the relation between two variables using the first degree equation, are: a. Useful means of forecasting when the data has a generally linear relationship. Over operational ranges linearity (or near linearity) is often assumed for such items as costs, contributions and sales. b. A measure of the accuracy of fit (R , the ratio of correlation or r, the coefficient of correlation) can be easy calculated for any linear regression line. c. To have confidence in the regression relationship calculated it is preferable to have a large number of observations. d. With further analysis confidence limits can be calculated for forecasts produced by the regression formula.
2
e. Any form of extrapolation, including that based on regression analysis, must be done with great precaution taking into account also other forecasting techniques. Once outside the observed values relationships and conditions may change drastically. f. Regression is not an adaptive forecasting system, i.e. it is not suitable for incorporation in, say a stock control system where the requirements would be for a forecasting system automatically producing forecasts which adapt to current market conditions. g. In many circumstances it is not sufficiently accurate to assume that y depends only on one independent variable as discussed above in simple linear regression. Frequently, a particular value depends on two or more factors in which case multiple regression analysis is employed. For example, an analysis of a firm might produce the following multiple regression equation: Overheads (EUROS) = 10800 + 6.9x + 7.2y + 3.7z, where, x: labour hours worked y: machine hours z: production volume (tonnage) 7.3.2 Least Squares Method For defining the relationship between Y and X we need to know the values of the coefficients of the linear model 0 and 1 (the population parameters). We have to estimate the parameters by using a sample of observations of size n. Estimated linear regression is: yi = b0 + b1xi, i = 1,,n (7.3)
Usually the estimators for the parameters of the regression line are obtained by using the least squares method.
To find the line of best fit mathematically it is necessary to calculate a line that minimizes the total of the squared deviations of the actual observations from the calculated line. This is known as the method of least squares or the least squares method of linear regression.
s=
(y b
i i =1
b1 xi ) 2 min
n n n s xi = yi = 2 ( yi b0 b1 xi ) = 0 nb0 + b1 b0 i =1 i =1 i =1 so, n (7.4) n n n s 2 xi + b1 xi = xi yi b = 2 xi ( yi b0 b1 xi ) = 0 b0 i =1 i =1 i =1 i =1 1
By solving the system of equations we obtain the values for b0 and b1 and we calculate the value of the regression equation for each value of the x variable. These values of the regression equations are also called the theoretical values of the y variable depending on x and the operation to replace the real terms with the values of the regression equation (theoretical values) is called adjustment computation. The parameter b0 represents the fixed element and b1 is the slope of the line i.e. the change in the mean value of y per unit change in x. The two parameters b0 and b1 have a mean character and they have to be representative for the biggest part of the values which helped to their calculation. The b0 parameter called intercept has a mean character to the extent that its value shows at what level would reach the value of the y characteristic if all the factors were exercised a constant action over its formation. In this case, the individual values of the resultative variable would be equal between them, so equal to their mean. The b1 parameter, called regression coefficient, expresses geometrically the slope of the straight line. The regression coefficient measures the average variation of the y variable when the x variable increases by one unit. More, the regression coefficient shows the direction in which it is realized the relation: Thus, if b1 >0, positive relationship.
When b1<0, negative relationship.
When b1=0, the two variables are unrelated and y x = b0 , so the mean value of the regression equation equals the mean value of the dependent variable ( y x = y y ) .
The use of these equations will be demonstrated using the Example 1 data contained in Table 1. The equations become: 10 b0 + 424 b1 = 363 424 b0 + 21.926 b1 = 12.815 Solving gives b0= 63.97 and b1 = -0.65 to 2 decimal places. Therefore, the regression line for Example 1 is:
y = 63.97 0.65 x Note: the Normal equations automatically produce sign (+ or -) for the regression coefficient b1; in this case, minus.
The calculated values can be used to draw the mathematically correct line of best fit on a graph. This is usually done by plotting based on three values of x: the lowest, highest and mean. Based on Example 1 the three values of x are: 15, 42.4 and 75. Each of these values is substituted into the calculated regression line and the result values plotted on the graph.
Note: The values of b0 and b1 have been calculated in the example above by substituting in the Normal Equations. An alternative is to transpose the Normal Equations so as to be able to find b0 and b1 directly. The formulae are as follows:
b0 =
y -b x = y b x
1
(7.5)
b1 =
n x
n xy
2
x y (7.6) ( x )
2
It is often more convenient to use this alternative form especially when using a calculator. Values for b0 and b1 are re-calculated using the transposed formulae and the Table 1 data.
b1 =
10 12.815 424 363 10 21.916 (424 )

2
= -0.652467 = -0.65
b0 =
363 424 - 0.652467 = 63.97 10 10
Defective parts per delivered batch
70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 110 Inspection costs per batch
Figure 7.4 Calculated lines of best fit.
For any set of bivariate data a least squares regression line always passes through the mean point ( x , y ) of the data.
7.3.3 Using the Results of the Simple Regression Analysis
When the values have been calculated for b0 and b1, predictions or forecasts can be made for values of x that have not yet occurred. The predictions can be read from the graph on which the line of best fit has been plotted, Figure 7.4, or the values inserted into the straight-line formula. Reverting to Example 1 it will be recalled that the manager wished to know the likely number of defects if 50 parts per 1000 was spent on inspection.
From Figure 7.4 it will be seen that the number of defects would be 31 per 1000. The formula can also be used, thus: y = 63.97 0.65x, so when x is 50: y = 63.97 0.65 (50) = 31.47 Thus the manager would conclude that, on average, 31.47 defects per 1000 would be found if 50 parts per 1000 was spent on inspection. Predictions should be given only if the result has an economic meaning. If the x value can be used to make a prediction according to the regression line this does not necessarily mean that we have obtained a practical forecasted value. The predicted value is just a single point that needs to be qualified by the use of the confidence classes.
7.3.4 Quality of the Regression Line. Regression Line Standard Error

The regression line accuracy can be measured with the standard error of the regression. Also this measure is used to estimate the regression parameters b0 and b1, to construct their confidence class. The inference concerning these estimates can be made using the significance test t and using the confidence class construction. In both cases we need to calculate the standard error. It will be denoted by Se:
Se =
2
i
a y i b xi y i
n2
(7.7)
The above formula provides an estimate of the standard error due to the fact that it is using the regression line values b0 and b1 which are themselves estimates. This is why it is also called residual standard deviation. For the example concerning the defective parts we have computed the standard error as follows:
Se = 15,123 63,97 363 (0.65) 12815 = 5.76 defective parts 10 2
This value is used to set the confidence classes limits for an individual value prediction or for the whole regression line.
The line of best fit y = b0 + b1x is an average line which passes through x and y and any estimate based must be a mean value of a point estimate. The confidence limits for the whole of the regression line are calculated by using a quantity known as the standard error of the average forecast that is given by: S ef = S e
1 + n
(x x )2
( x )
n
(7.8)
7.3.5 Constructing the Confidence Interval
The actual confidence interval is constructed in exactly the same way as that for a mean or for a proportion. In this case since the number of observations is 10, then the t distribution is used with 10-2 = 8 degrees of freedom. The interval is calculated by estimating the fitted value of y for each value of x in the original data using the equation y =b0 + b1x. The interval then takes the form: y S ef t (7.9) ) (7.9 Given that (based on Example 1): b0 = 63.97 b1 = -0.65 S e = 5.76
S ef = S e
1 + n
(x x )2
n t = 2.306 for 8 degrees of freedom and a 95% confidence interval. The confidence interval can be now calculated as follows: When x = 15, y = 63.97 0.65 (15) = 54.2, the limits round these estimates are: 54.2 7.15. This gives an upper limit of 61.35 and a lower limit of 47.05 when x = 24, y = 48.37 5.72 giving a 54.09 upper limit and a 42.65 lower limit.
( x )
= 5.76 Value from Table 2 above and
When making an individual value prediction for y due to technical reasons it is necessary to amend the previous formula of the standard error, obtaining the standard error of the individual forecast:
S ef = S e 1 + 1 + n ( xi x ) 2 (7.10) ( xi ) 2 2 xi n
Using the data in our example and x value for instance 45 we are obtaining
S ef = 5.76 1 + 1 (45 42.4) 2 + = 6.04 . 10 424 2 21,926 10
When x = 45 and y = 34.72, the individual confidence interval is: 34.72 2.306 6.04, with the lower limit of 20.79 and the upper limit of 48.65, limits which are different from the previous limits computed. This is because when an individual prediction of y is made the confidence intervals are much wider.
7.3.6 Standard Errors for the Parameters b0 and b1
If b0 and b1 are computed from sample data they can be considered as estimates, statistics of the population intercept denoted by and the population coefficient of correlation denoted 1 in the case of repeated sampling. The mean value of b0 values coming from repeated sampling is expressed as 0, the population intercept and the mean value of b1 values is expressed as 1, the population slope. The standards deviations are:
Sa = Se
x n x ( x )
2 i 2 i i
, (7.11)
where:
Se= standard error of regression. The confidence class for 0 and 1 are obtained as follows: - for the intercept: b0 t x Sb
- for the slope: b1 t x Sb where:

Sb =
2 i
( x )
i
Se
is the value of the statistics t corresponds to n-2 degree of freedoms at the chosen probability, showing the confidence level. In addition we construct a significance test for and : - For the intercept H 0 : 0 = b0 chosen value H 1 : 0 b0 chosen value The test statistics is the t test: b 0 (7.12) t= 0 Sa - For the slope: H 0 : 1 = 0 H 1 : 1 0 The test statistics is the t test: b 1 (7.13) t= 1 Sb 0.65 0 = 7.07 = 0.092 Since 7.07 > 2.306, H 0 can be rejected. On the basis of this evidence the regression equation y = 63.97 0.65 x can be used as a basis of prediction for Example 1.
7. 4 Non-linear Regression Models There are many occasions when the relationship between variables cannot be adequately described by linear functions, whether they use a single independent variable or several. In such circumstances some form of non-
linear or curvy-linear model is likely to be more suitable and the following paragraphs describe some commonly encountered non-linear models. The exponential function The exponential function takes the form: y = ab x where y is the dependent variable a and b are constants and x denotes the independent variable Linear form of the exponential function The exponential function can be reduced to linear form by taking the logarithm of the function thus: log y = log a + x log b or log y = A + Bx where, A = log a and B = log b The similarity of this expression and the linear regression line previously discussed will be apparent. An interesting feature of the log form of the exponential function is that it is equivalent to fitting a straight line to a graph drawn on semi-logarithmic scale graph paper (i.e. a logarithmic scale on the vertical axis and an ordinary arithmetic scale on the horizontal axis). Logarithmic functions An alternative non-linear function is known as a logarithmic function which has the form of: y = ax b , where y denotes variable to be predicted, a and b are constants and x denotes the time periods. As with the exponential function, this function can be expressed in a linear form using logarithms thus log y = log a + b log x In this function y is said to be a logarithmic function of x. This function is equivalent to fitting a straight line to a graph drawn on log-log paper (i.e. both horizontal and vertical scales being logarithmic).
The hyperbolic curve This is another type of non-linear curve and takes the form b y =a+ x The values of a and b are calculated by reference to amended formulas: 1 1 n y y x x b= 2 2 1 1 n x x 1 b y x a= n n Data have been kept for 10 orders showing the variation in unit cost against order volume for 10 clients, as follows in Table 7-2:
Relation between order size and unit costs
Table 7-2
Client number 1 2 3 4 5 6 7 8 9 10 Order volume x 10 11 12 13 14 15 17 18 19 20 Unit cost y 150 127 123 117 110 107 104 101 97 95
These data have been graphed, Figure 7.5, and the graph suggests that the hyperbolic curve might be appropriate for predicting the unit cost of an order of 22 units. What is the predicted cost?
Unit cost
150 140 130 120 110 100 90 80 8 10 12 14 16 18 Order volume 20 22 24
Figure 7.5 Unit cost and order size relation
Solution: The calculations for the least squares line of best fit are shown in Table 7-3.
Table 7-3 Observation 1 2 3 4 5 6 7 8 9 10 Total
1 x
0.100 0.090 0.083 0.077 0.071 0.067 0.059 0.056 0.053 0.050 0.706
1 x
1 y x
15.000 11.545 10.250 9.000 7.857 7.133 6.118 5.611 5.105 4.750
150 127 123 117 110 107 104 101 97 95 1,131
0.0100 0.0083 0.0069 0.0059 0.0051 0.0044 0.0035 0.0031 0.0027 0.0025 0.0524
b=
10 0.0524 (0.706 ) b = 985.92 1,131 0.706 a= 985.92 10 10 a = 43.49
10 82.369 0.706 1131

2
Thus the hyperbolic function is:

985.92 x The calculated least squares line can now be fitted on the graph using the calculated values according to the hyperbolic function, in Table 7-4. y = 43.49 +
Table 7-4 X 10 11 12 13 14 15 17 18 19 20
a+
b x
Value of y 142.08 133.12 125.65 119.33 113.91 109.22 101.49 98.26 95.38 92.79
43.49 + 985.92 + 10 43.49 + 985.92 + 11 43.49 + 985.92 + 12 43.49 + 985.92 + 13 43.49 + 985.92 + 14 43.49 + 985.92 + 15 43.49 + 985.92 + 17 43.49 + 985.92 + 18 43.49 + 985.92 + 19 43.49 + 985.92 + 20
These values are plotted on Figure 7.6.

150 140 130 Unit cost120 110 100 90 80 8 10 12 14 16 18 20 22 24 Order volume
Figure 7.6 Scatter diagram with fitted hyperbolic curve
The same information can be reproduced in a linear form where the x axis is 1 defined as . The question posed in the problem; what is the unit cost for x an order size of 22 units can be answered from one of the graphs or by direct calculation. On the assumption that the known relationship between x and y continues beyond the observed range then the unit cost for an order size of 22 is: 985.92 y = 43.49 + = 88.30 22
Learning curves
Forecasting is concerned with what we anticipate will happen in the future. Unthinking extrapolation of past conditions is unlikely to produce good forecasts. If we are aware of an expected change in conditions in the future this must be taken into account when preparing the finalised forecast. A particular example of this relates to what are known as learning curves that are a practical application of a non-linear function. The learning curve depicts the way people learn by doing a task and are therefore able to complete the task more quickly the next time they attempt it. Learning is rapid in the early stages and the rate gradually declines until a sufficient number of units or tasks have been completed, when the time taken will become constant. The main practical application is concerned with direct labour times and costs. Cost predictions especially those relating to direct labour costs should allow for the effects of the learning process. During the early stages of producing a new part or carrying out a new process, experience and skill is gained, productivity increases and there is a reduction of time taken per unit. Studies have shown that there is a tendency for the time per unit to reduce at some constant rate as production mounts. For example, an 80% learning curve means that as cumulative production quantities double the average time per unit falls by 20%.
This is shown in Table 7-5:

Illustration of an 80% Learning Curve Table 7-5 Cumulative number of clients 20 40 80 160 Cumulative time taken (min.s) 400 640 1,024 1,638.4 Average time per client 20 16 12.8 10.24
(20 80% 80% )
(20 80% )
( 20 80% 80% 80% )
The learning curve is a non-linear function with the general form: y = ab x where, y: average labour hours for a client a: number of labour hours for the first client x: cumulative number of clients b: the learning coefficient The learning coefficient is calculated as follows:
b= log(1 Pr oportionatedecrease) log 2
thus for a 20% decrease (i.e. an 80% learning curve)
b=
log(1 0.2 ) 1.90309 = = -0.322 log 2 0.30103
Note: It will be remembered from mathematics that the log of 0.8 is conventionally written as 1.90309 but is actually -1 + 0.903309 i.e. -0.09691 which, divided by 0.303103, gives -0.322.
Having established the values for the function it can be used to find the expected labour time per unit. For example, with an 80% learning curve and a time of 10 minutes for the first client, what is the expected time per client when cumulative number of clients is 20 clients? Using the function we obtain: y = ax b = 10 * 20 0.322 = 3.812 mins
Note: Whilst it is clear that learning does take place and that average times are likely to reduce, in practice it is highly unlikely that there will be a regular consistent rate of decrease as exemplified above. According, any cost predictions based on conventional learning curves should be cautiously used when forecasting as any other kind of forecasting method.
Linear transformation of learning curve
An alternative method of calculating the learning curve coefficient uses the linear transformation formed by taking the logarithm of the function thus: log y = log (ax b ) log y = log a + b log x This will be recognised as a transformation to the general linear form y = a + bx If X stands for log x and Y stands for log y then the standard formulae for a and b become n XY X Y b= n X 2 ( X ) 2 log a =
Y - b X
n n
The above formulae are illustrated using the data in the previous paragraph thus from Table 7 5 into Table 7-6.
Cumulative Number of clients x 20 40 80 160 Cumulative time 400 640 1,024 1,638.4 Table 7-6 Average time per client y 20 16 12.8 10.24
The logarithms of the cumulative number of clients, x, and the average serving time, y, are used to find the values for the formulae above and are shown in Table 7-7.
Regression and Correlation Table 7-7 X i.e. log x 1.30103 1.60206 1.90309 2.20412 Y log y 1.30103 1.20412 1.10721 1.010303 X
2 2
(log x) 1.69268 2.56659 3.62175 4.85815
XY log x. log y 1.69268 1.92907 2.10712 2.22682
X = 7.01030
Y = 4.62266
= 12.73917
XY =
7.95569
These values are inserted into the formula 4 7.01030 4.62266 B= = -0.3223 4 12.73917 7.01030 2 This will be seen to be the same value as calculated above. The learning curve thus has the form: y = ax 0.3223 For completeness the value of a is calculated. This represents the number of labour hours for the first client. This was not one of the observed values, which started at 20 clients, so the value represents the theoretical time for the first client; given the relationships found for the observed range of 20 to 160 clients. Using the formula given above 4.62266 7.01030 log a = - (-0.3223) 4 4 log a = 1.725052, and finding the antilog gives a = 52.546. The full learning curve formula is thus: y = 52.546x 03223 The value of 52.546 hours for the first unit can be proved by inserting one of the observed values, say 20 units, and checking that the calculated time agrees with the observed time of 20 minutes. To find the value of y = 52.46 20 0.3223 we can compute:
Number 20 1.30103x0.3223 0-0.41932 52.546 log 1.30103 0.41932 1.58068 (this represents 20-03223) 1.72052+1.30120 the anti-log f which is almost
7.5 Multiple Regression Models
The section shows the development of a multiple regression model and how the closeness of fit is measured by the coefficient of multiple determinations. Various non-linear models such as the exponential, logarithmic and hyperbolic functions are explained and exemplified and the chapter concludes with an analysis of learning curves. There will be occasions when the simple model, y = b0 + b 1 x, will not be considered satisfactory. This means that the simple linear model will not be a good enough predictor. In such circumstances there are two possible courses of action: a. To investigate the possibility that movements in y, the dependent variable, depends on several independent variables and not just one as in the basic model. For example, changes in demand for a product may depend on: the price of a product the price of substitutes the level of incomes consumer tastes and so on If linearity can be assumed then a linear multiple regression models can be used. These models are dealt with in the first part of the chapter. b. Alternatively a non-linear model may be considered more appropriate and several of the more important non-linear functions are dealt with later in the chapter. A model which incorporates several independent variables is known as a multiple regression model. Because of the lengthy nature of the calculations it would be unlikely that a detailed question on multiple regression would appear in the examinations for which this manual is intended. Familiarity with the processes involved and the structure of the model is, however, necessary. The development of this model is shown below.
The basic two variable model (one dependent and one independent variable) is: y = b0 + b 1 x which can be solved using the Normal equations thus:
y = b0 n + b x xy = b0 x + b x
1 1
From this can be developed models with more than 2 variables and this is illustrated below using a 3 variable model (one dependent and two independent variables; y, x 1 and x 2 ).
y = b0 + b 1 x 1 + b2 x 2 (7.14)
In the case of mass phenomena the resulting variable is considered a function with many variables: y = f ( x1 , x2 ,K, xn ) + e where the variables x1 , x2 ,K, xn are the factorial variable, which determine in a certain measure the variation of the resulting variable (y). If the relation between every factor and the resulting variable is linear, than the estimation equation will be:
Y (x1 , x 2 ,K, x n ) = b0 + b1 x1 + K + bn x n + e (7.15)

where:
b0: represents the parameter that expressed the unregistered factors considered as having constant action that is all the other factors except for those considered factorial variables b1 , K , bn :
coefficients of regression that shows the measure with which
it is modified the resulting variable if the factorial variable is modified on average with a unit.
x1 , x2 ,K, xn :
independent
variables
included
in
the
relation
of
interdependence.
The determination of the parameters is made by the application of the method of the least squares, conditioning it that the sum of the errors of the empiric terms from the line of regression, square rose, to be minimum:
(y Y
x1, x 2 ,..., xn
= min
In order to find the value of these parameters is necessary to be established the system of the Normal equations:
[y (b0 + b x
1 1
+ K + bn x n )] = min
2
At the end of solving the system we have the parameters of estimation of the regression function. As in the case of multiple correlation in order to measure the degree of intensity of the correlation, we are using the ratio of correlation. The multiple linear model can be solved by the Normal equations for a three variable model, as follows:
y = b0n + b x + b x x y = x +b x +b x x x y = a x + b x x + b x
1
1 2
(7.16)
2 2
The line of best fit gives way to a plane of best fit. The parameter b 1 is the slope of the plane along the x 1 axis, b 2 is the slope along the x 2 axis, and the plane cuts the y axis at a. The aim of adding to the simple two variable model is to improve the fit of the data. The above models are illustrated by the following examples.
Example of multiple regression The X consultancy company is investigating the relationship between performance in Statistics Methods and hours studied per week and the general level of intelligence of candidates. The company has data on ten students as follows: Student Hours I.Q. Examination level (%) 1 6 100 45 2 6 117 55 3 12 119 80 4 14 95 73 5 11 110 71 6 9 99 56 7 19 98 95 8 16 101 86 9 3 100 34 10 9 115 66 It is required: to calculate the simple separate regressions, the multiple regression and the coefficients of determination. Solution Part A Calculation of separate regressions
Table 7-8
y 1 2 3 4 5 6 7 8 9 10 56 45 80 73 71 55 95 86 34 66 661 y
2
x1 9 6 12 14 11 6 19 16 3 9 105
2 1
x2 99 100 119 95 110 117 98 101 100 115 1,054
2 x2
x1 y 504 270 960 1,022 781 330 1,805 1,376 102 594 7,744
x2 y 5,544 4,500 9,520 6,935 7,810 6,435 9,310 8,686 3,400 7,590 69,730
x1 x 2 891 600 1,428 1,330 1,210 702 1,862 1,616 300 1,035 10,974
3,136 2,025 6,400 5,329 5,041 3,025 9,025 7,396 1,156 4,356 46,899
81 36 144 196 121 36 361 256 9 81 1,321
9,801 10,000 14,161 9,025 12,100 13,689 9,604 10,201 10,000 13,225 111,806
For Regression y on x 1 (Exam. Scores: hours studied) The parameters are: n x 1 y x 1 y 10 7,744 105 661 = ; b x1 = 3.68; b x1 = 2 2 110 1,321 105 2 n x 1 ( x 1 )
a x1 =
Y bx x
1
661 3.67734 105 ; a x1 = 27.59 110 10
The regression equation for the relationship of hours studied and examination result is: y x1 = a x1 + b x1 x 1 = 27.59 +3.68 The co-efficient of correlation for this relationship is:
rx1 =
n x 1 y x 1 y
2 n x 1
( x )
1
n y 2
( y )
Note: This formula is a direct equivalent of that given previously but is
easier to work with since all except n y 2

n y 2
( y )
( y )
is already known.
= 10 46,889-661 2 = 468,890 436,921 = 31,969 rx1 = 8,035 2,185 31,969 = 0.9613
rx2 = 0.9243 i.e. coefficient of determination for y: x 1 . 1
In a similar manner the regression y on x 2 (exam. scores: IQ scores) is calculated resulting in: y x 2 = a x 2 + b x 2 x 2 = 57.16 + 0.085x 2
rx21 = 0.001608
100 90 80 70 Examination 60 50 score 40 30 20 10 0 0 2 4 6 8
y = 3,69x + 27,59
10
12
14
16
18
20
Hours studied per week
Figure 7-6 Scatter diagram of examination scores and hours studied (y: x 1 ).
100 90 80 70 60 Examination 50 score 40 30 20 10 0 90 100 IQ Score
y = 0,085x + 57,16
110
120
Figure 7-7 Scatter diagram of examination scores I.Q. scores (y: x 2 ).
Solution: Part B The multiple regression (y : x 1 and x 2 ) The multiple regression calculations are carried out using the three variable
Normal Equations from Para 3 and the results in Table 1 above, thus: 661 = 10a + 1,05b 1 + 1,054b 2 7,744 = 105a + 1,321b1 + 10,974b 2 69,730 = 1,054a + 10,974b 1 + 111,806b 2 Using standard simultaneous equation procedures results in the following values for the coefficients in the equation: y = a + b1 x1 + b 2 x 2 y = 38.06 + 3.93x 1 + 0.6x 2 This result could be used to predict the examination score for a candidate, given the number of hours worked and IQ. For example, what is the expected score of a candidate who has worked for 13 hours per week and who has an IQ of 102? y = 38.06 + 3.93 13 + 0.6 102 = 74.23% expected examination score
Solution: Part C Coefficient of multiple determination, R 2 Using the computational formula given and the values calculated above, R 2 can be calculated thus:
( 38.66 661) + (3.93 7,744) + (0.6 69,730) 661

R2 =
10 6612 46,889 10
= 0.9995
The various coefficients of determination can now be summarised and interpreted rx21 = 0.9243
rx22 = 0.0016
R 2 = 0.9995
rx21 - This indicates that about 92% of the variation in examination scores is
caused by variation in hours of study, which is obviously a major influence.
rx22 - This indicates that only 0.16% of any variation in examination score is
caused by variation in IQ score which is a very small influence indeed. - This shows the combined effect of two independent variables and indicates that 99.95% of the movement in examination score is brought about by movements in hours studied and IQ score. This, however, assumes that it is a reasonable hypothesis that examination results are influenced by the intelligence of candidates and how hard they work!
R2
7.6 Correlation between Variables

The degree of correlation between two variables can be measured by using the following indicators: a) Covariance represents an absolute measure of the relation intensity and it is computed as the arithmetic mean of the product: ( xi x )( y i y ) . It can be also computed as: cov( x, y ) =
(x
x )( y i y )
n
(7.17)
If the results tend to zero then there is no relation between the variables. If the result is positive, than we have a positive correlation and if the result is negative, we have a negative correlation. The covariance maximum value equals the multiplication between the standard deviations of the variables in the case of the perfect correlation. b) Coefficient of Correlation, denoted by r, used only for the linear relations This provides a measure of the strength of association between two variables; r can range from -1, i.e. perfect negative correlation to +1 i.e. perfect positive correlation. The formula for the coefficient of correlation is:
r= cov( x, y ) = x y
(x
x )( y i y )
n x y
(7.18)
c) Ratio of determination denoted by R 2 - expresses how much of total variation of Y variable it is explained by the independent variable. d) The Rank Correlation Coefficients. This provides a measure of the association between two sets of ranked or ordered data. Whichever type of coefficient is being used it follows that a coefficient of zero or near zero generally indicates no correlation.
7.6.1 Parametric Measures of Simple Correlation. Coefficient and Ratio of Correlation
Coefficient of correlation
This coefficient represents a measure computed differently for different way of data presentations as related pairs or classified pair of figures: a simple bivariate numerical data which were not grouped b bivariate numerical data grouped by classes or variants with common frequencies for x and y variation c bivariate numerical data grouped by variants or classes into a cross table This coefficient gives an indication of the strength of the linear relationship between two variables. a. In the case of simple bivariate numerical data which are not grouped and are presented as related pair of figures:
X values Y values X1 Y1 Xi . Yi Xn Yn Xi Yi
The general formula is

r=
cov(x, y ) = x y
(x
x )( y i y )
n x y
(7.19)
There are several possible formulae but a practical one is the reduced computation formula (7.20):
r= n xy x y
2
n x ( x ) n y ( y )
2 2
(7.20)
This formula is used to find r from the data in Example 1 from Table 7-9. Table 7-9
X 15 24 25 30 35 40 45 65 70 75 424 Y 60 45 50 35 42 46 28 20 22 15 363 X2 225 576 625 900 1225 1600 2025 4225 4900 5625 21.926 Y2 3600 2025 2500 1225 1764 2116 784 400 484 225 15.123 XY 900 1080 1250 1050 1470 1840 1260 1300 1540 1125 12.815 XY
X
Using the formula above:
r= =
(10 x21.926 424 )x (10 x15.123 363 )

2 2
10 x12.815 424 x363
128.150 153.912
(219.26 179.776)x
151.230 131.769
25.762 39.484 x 19.461
= 0.93
Thus the correlation coefficient is -0.93 which indicates a strong negative linear association between expenditure on inspection and defective parts delivered. It will be seen that the formula automatically produces the correct sign for the coefficient.
b. In the case of bivariate numerical data grouped by classes or variants with common frequencies for x and y variation
In this case the input data are arranged as follows:
X values Y values Frequencies fi X1 Y1 f1. Xi . Yi fi Xn Yn fn Xi Yi fi
For the above table the practical formula for the coefficient of correlation is:
r=
f x y f x f y f f x f ( x f ) f y f ( y f )
i i i i i i i i 2 2 2 i i i i i i i i
In the case bivariate numerical data grouped into a cross table as follows in Table 7-10:
Table 7-10
Variation class middles or variants of the dependent variable(xi)

X1 . . . X2 . . . Xn Total (f.j) Y1 f11 . . . f21
Variation class middles or variants of the dependent variable (yj)

. Yj f1i . . . f2i Ym f1m . . . f2m
Total (fi)
f1.
fi.
fn1 f.1
fnj f.j
fnm f.m
f i. = f . j = f ij
fn.
For a cross table the practical formula of r is:

r=
f x y f x f y f f x f ( x f ) f y
ij i j ij i i j 2 2 ij i i i i ij j
f j ( y j f j )
(7.21)
No matter the data presentation and classification the coefficient of correlation is interpreted compared to zero and its limits, -1 and +1.
Interpretation of the value of r

Cautiousness is needed in the interpretation of the coefficient of correlation, r. A high value (above +0.9 or -0.9) only shows a strong association between the two variables and does not show a causal relationship. It is possible to find two variables which produce a high calculated r value yet which have no causal relationship. This is known as spurious or nonsense correlation. An example might be the wheat harvest in America and the number of deaths by drowning in Britain. There might be a high apparent correlation between these two variables but there clearly is no causal relationship. The coefficient of correlation can take values between 1 and +1 as follows:

r = 0 : no
relationship, independent variables is a low intensity relation between the variables
r (0,0.2 ) : there
r (0.2,0.5) :
there is a week correlation, case needing a significance test to be applied as for instance the Student test r (0.5,0.75) : medium intensity relation

r (0.75,0.95) : r (0.95,1.00 ) :
tight, high intensity relation
we have an extremely strong relationship between the variables, almost a deterministic relation (functional relation)
If we are comparing r with zero than:

r > 0: shows a positive relationship and should correspond to a positive slope of the regression line r < 0, shows annegative relationship and should correspond to a negative slope
A low correlation coefficient, somewhere near zero, does not always mean that there is no relationship between the variables. All it says is that there is no linear relationship between the variables- there may be a strong relationship but of a non-linear one.
A further problem in interpretation arises from the fact that the coefficient of correlation measures the relationship between a single independent variable and dependent variable, whereas a particular variable may be dependent on several independent variables in which case multiple correlation should have been calculated rather than the simple two-variable coefficient.
The significance of r
Frequently the set of X and Y observations is based upon a sample. Had a different sample between drawn then the value of r would be different, although the degree of correlation in the reference population would remain the same. In the same way that the knowledge of x s enables an estimate to be made of the population mean then the knowledge of r enables the analyst to make an estimate of , the population coefficient of correlation. Generally in examination questions the sample size is limited to some figure that can be dealt with in the time allowed. It is questionable whether the sample size given in examinations gives enough data for a credible judgment to be formed about a possible relationship between the X and Y values or is it just that the particular samples gives this impression? Conversely, if r is low does it really imply a lack of a relationship? There may indeed be a close relationship but the data has not revealed it. Further, the relationship may exist, but it may not to be linear or it may not be direct. It is possible to test whether the value of r is sufficiently different from zero for the analyst to decide whether the X and Y values are correlated. The test may be stated the null hypothesis and its alternative:
H0: = 0 H1: 0
It is a t test for which the test statistic is given by:
t =
r 1 r2
n2
(7.22)
Using the values from example 1, i.e. r= -0.93 and n=10 we obtain:
t =
0.93 0 1 0.93 2
10 2 = 2.53 2.83 = 7.16
Ratio of correlation
The ratio of correlation can be used to characterise any category of relation, linear or not linear relationship. It can be also used to measure the intensity of the relation no matter how many independent variables we take into account. The ratio of correlation shows only the intensity of the relation and it does not show the direction. It is computed with the formula:
R=
(y Y ) 1 (y y)
i x i
(7.23)
where:
yi:
array of dependent data
Yx: array of adjusted values, calculated according to the regression function y:
the arithmetic mean of the dependent values
The ratio of correlation is interpreted similarly with the coefficient of correlation and it can take values between 0 and +1.
7.6.2 Parametric Measures of Multiple Correlation
In the case of multiple correlation the closeness of fit is measured by the coefficient of multiple determination, coefficient R 2 for which the general formula and the useful computational formula are given below: R2 =
Explained var iation = Total var iation
(Yestimate Y ) (Y Y )
2
(7.24)
Where Y estimate now equals the estimate of Y for each value of x 1 and x 2 .
R2=
( y ) a y + b x y + b x y
1 1 2 2
( y) y n
2
(7.25)
It is not necessarily the case that the value of the coefficient of determination will improve with the addition of extra variables.
The ratio of multiple correlation is calculated as in the case of the simple correlation, depending on the specific weight of the dispersion produced by 2 ) over the total dispersion of the resulting registered factors: ( y x1 , x2
variable ( y 2 ). If we are using the relation between the three dispersions: ( y 2 = y 2 /x1,x2,,xn + y 2 /r), the ratio of correlation is computed after the formula:
Ry = 1
x1, x 2
( y Yx , x ,..., x ) (y y )
i
1
i
(7.26)
The ratio of multiple correlations can take values between 0 and +1. This ratio has the highest value by rapport to the simple correlation indicators, because it reunites the influence of each factor and of the interaction between them. So, the more there are considered many factors, the higher is the ratios value. Theoretically, it can be admitted that under the conditions in which the factors could be expressed numerically, than the ratio of multiple correlation should be 1, showing the functional dependence between all its determinative factors and its level (of the resulting variable). Therefore, the equation of regression will be equal to the empiric value of the factorial variable calculated by the size of all the determinative factors, and the free term would be 0:
Y (x1 , x 2 , K , x n ) = a + b1 x1 + b2 x 2 + K + bn x n .
(7.27)
But, actually there cant be identified all the influence factors and some of them cant be quantified. From this reason, the value of the multiple regression line will have errors more or less close to the real values of the series terms, because of the influence of those unregistered factors included in the value of the free term a0. In the case of linear relation verified with every of the considered factors, the ratio of multiple correlation transforms into a coefficient of a multiple correlation. The coefficient of multiple correlations equals the ratio of multiple correlations. In the case of multiple correlations, the ratio of linear correlation synthesizes all the simple linear relations. If the factors are independent between them, than the ratio of multiple determinations equals the sum of the ratios of simple determination. For instance, for two factors:
Ry2
x1, x 2
= Ry2 + Ry2
x1
(7.28)
x2
If the relation is linear, than R is substituted by r:
Ry2
x1, x 2
= ry 2 + ry 2 . (7.29)
x1 x2
From this, the ratio of correlation is:

Ry2 = ry 2 + ry 2
x1
.
x2
(7.30)
x1, x 2
Usually, among the socio-economic phenomena the factors of influence are independent between them and therefore it appears the necessity of considering the reciprocal influence of the factors. If the factors are interdependent, rx1, x 2 0 .
This inter-influence has to be eliminated because it can be found in the value of multiple correlation coefficients. The ratio of multiple linear correlations is calculated using the coefficient of simple correlation:
ry2 + ry2 2ry ry rx1
x1 x2 x1 x2
Ry
=
x1, x 2
x2
1 rx21
(7.31)
x2
7.6.3 Nonparametric Measures of Correlation
Sometimes in practice we cannot use for the interpretation of the relation any of the known functions, because we do not have enough elements to identify the rule of distribution of the errors for the used series. In this case there are used nonparametric methods like the coefficient of association proposed by Yule, the coefficients of ranks correlation proposed by Kendall and Spearman. These coefficients have the advantage they can be used in the case of a skewed distribution or a small number of units. This thing can be possible due to the fact in this type of situations the terms distribution is made in connection to the rank of each independent variable.
Yule coefficient of association

This coefficient is used when the statistical units can be separated into two groups according to the x and y variation or they have the form of the binary variables:
Table 7-11 X groups or variants X1 X2 Total Y groups or variants Y1 Y2 A B C D A+C B+D Total
A+B C+D A+B+C+D
In order to express the intensity of the relation we are using the formula: A D B C , (7.32) with the same interpretation as for the KYule = A D + B C coefficient of correlation, taking values between -1 and +1.
Ranks coefficients
This nonparametric method also has the advantage to include in the analysis the rapport of dependence between phenomena and qualitative variable that cannot be expressed numerically, but can be classified after a certain rank. Therefore, the data are arranged after the variation of the independent variable and each variant is replaced with its number of order called rank. The ranks can be distributed either increasingly, when the best value of the indicator is the one with the minimum value, or decreasingly, when the maximum value has the rank one. From the point of view of the value of coefficients of correlation, the sense of distributing the ranks does not have a great importance, if we maintain the same direction for all the variables. The direction is important only if the analysis of correlation is combined with the establishment of a hierarchic typology. Starting from the hypothesis between the two series of ranks there is concordance. When it exists a relation between the two variables of the same unit, there has to correspond the same number of units with a higher or smaller rank than them. The most frequent calculation formulas of the coefficient of correlation of the ranks are those of Spearman and Kendall. The rank coefficient proposed by Spearman:
rs = 1 6
n3 n
2 i
, (7.33)
where: di: the rank difference between correlated variables and n: the number of correlated units.
The coefficient of correlation of the ranks proposed by Kendall has the formula:
rK =
2S , (7.34) n(n 1)
where: S = P + Q: the score of the two different positions of the ranks of the correlated variables. P: the number of superior ranks that succeed the rank of the effect variable for which it is made the calculation. Q: the number of inferior ranks, of the effect variable, that succeed the same rank. Always P has a positive value, Q is negative, and so S can be positive or negative. So, the coefficients of correlation of the ranks can take values between 1 and +1. Their interpolation is made as the parametrical correlation. The advantages and the facility of calculation make these coefficients very applicable for studying the relation between specific phenomena including qualitative variables measured on the ordinal scale. Due to the fact it easier to be calculated, the most frequent used is Spearmans coefficient. It is deduced from the coefficient of simple linear correlation where the mean and the dispersion are based on the properties of the asymmetric progression. It has been concluded that Kendalls coefficient is smaller than Spearmans.
Tied rankings. Adjusted rankings

A slight adjustment to the formula is necessary if in a research recording the students marks, some students obtained the same marks in a test and thus are given the same ranking. The adjustment is:
t3 t 12
where t is the number of tied rankings.
(7.35)
The adjusted formula for the Spearman coefficient is:

6 1-
R=
d + 12 n(n 1)
2 2
t3 t
(7.35)
For example assume that students E and F achieved equal marks in QT and were given joint third place. The revised data are given by Table 7-12:
Table 7-12
Student
A B C D E F G H
Q.T Ranking
2 7 6 1
M.A. Ranking
3 6 4 2 5 1 8 7
d
-1 +1 +2 -1
d2
1 1 4 2 2
1 2 1 3 2
3 5 8
1 2 1 +2 2
-1 -3 +1
1 4 1 6 4
9 1
1 23 2 6 25 + 2 12 = + 0.69 8 82 1 ( ) As will be seen, the Spearman value has moved also from +0.74 to 0.69.
t3 t 6 d 2 + 12 = 1 R =1 = 2 n n 1
7.7 Exercises
Multiple choice exercises with answers
1. Which of the following techniques is used to predict the value of one variable on the basis of other variables? a. Correlation analysis b. Coefficient of correlation c. Covariance d. Regression analysis ANSWER: d

2 s 2 = 1225, then the coefficient of 2. If cov (X,Y) = 1260, s x = 1600 and y determination is: a. 0.7875 b. 1.0286 c. 0.8100 d. 0.7656 ANSWER: c
3. The coefficient of determination R 2 measures the amount of: a. variation in y that is explained by variation in x b. variation in x that is explained by variation in y c. variation in y that is unexplained by variation in x d. variation in x that is unexplained by variation in y ANSWER: a 4. In the simple linear regression model, the y-intercept represents the: a. change in y per unit change in x b. change in x per unit change in y c. value of y when x = 0 d. value of x when y = 0 ANSWER: c
Multiple choice exercises without answers

5. In a regression problem, if the coefficient of determination is 0.95, this means that: a. 95% of the y values are positive b. 95% of the variation in y can be explained by the variation in x c. 95% of the x values are equal d. 95% of the variation in x can be explained by the variation in y 6. In a regression problem, if all the values of the independent variable are equal, then the coefficient of determination must be: a. 1 b. .5 c. 0 d. 1
7. The following sum of squares are produced:
(y
y ) 2 = 200 ,
(y
y i ) 2 = 50 ,
(y
y ) 2 = 150
The proportion of the variation in y that is explained by the variation in x is: a. 25% b. 75% c. 33% d. 50%
Open ended exercises with answers
8. Consider the following data values of variables x and y.

x y 2 7 4 11 6 17 8 21 10 27 13 36
a. b. c. d.
Determine the least squares regression line. Find the predicted value of y for x = 9. What does the value of the slope of the regression line tell you? Calculate the coefficient of determination, and describe what this statistic tells you about the relationship between the two variables. e. Calculate the Pearson coefficient of correlation. What sign does it have? Why? f. What does the coefficient of correlation calculated in part (e) tell you about the direction and strength of the relationship between the two variables?
ANSWERS:
a. b. c. d. y = .934 + 2.637x 24.667 If x increases by one unit, y on average will increase by 2.637. R 2 = .995. This means that 99.5% of the variation in the dependent variable y is explained by the variation in the independent variable x. e. r = .9975. It is positive since the slope of the regression line is positive. f. There is a very strong (almost perfect) positive linear relationship between the two variables.
9. A professor of economics wants to study the relationship between income (y in $1000s) and education (x in years). A random sample eight individuals is taken and the results are shown below.
Education Income 16 58 11 40 15 55 8 35 12 43 10 41 13 52 14 49
a. Draw a scatter diagram of the data to determine whether a linear model appears to be appropriate. b. Determine the least squares regression line. c. Interpret the value of the slope of the regression line. d. Determine the standard error of estimate and describe what this statistic tells you about the regression line. a. y = 10.6165 + 2.9098x b. For each additional year of education, the income on average increases by $2,909.80. c. s = 2.436; the models fit is good. 10. A scatter diagram includes the following data points:
x y
3 8
2 6
5 12
4 10
5 14
Two regression models are proposed: Model 1: y = 1.2 + 2.5x Model 2: y = 5.5 + 4.0x Using the least squares method, which of these regression models provide the better fit to the data? Why? ANSWERS: Scatter Diagram a.
70 60 50
Income
40 30 20 10 0 0 2 4 6 8 10 12 14 16 18
Y ears of Education
It appears that a linear model is appropriate.
Standard error = 4.95 and 593.25 for models 1 and 2, respectively. Therefore, model 1 is better than model 2. 11. Consider the following data values of variables x and y:
x y
a. b. c. d.
2 7
4 11
6 17
8 21
10 27
13 36
Determine the least squares regression line. Find the predicted value of y for x = 9. What does the value of the slope of the regression line tell you? Calculate the coefficient of determination, and describe what this statistic tells you about the relationship between the two variables. e. Calculate the Pearson coefficient of correlation. What sign does it have? Why? f. What does the coefficient of correlation calculated in part (e) tell you about the direction and strength of the relationship between the two variables? ANSWERS:
y = 0.934 + 2.637x 24.667 If x increases by one unit, y on average will increase by 2.637. R 2 = 0.995. This means that 99.5% of the variation in the dependent variable y is explained by the variation in the independent variable x. e. r = 0.9975. It is positive since the slope of the regression line is positive. f. There is a very strong (almost perfect) positive linear relationship between the two variables.
a. b. c. d.
Open ended exercises without answers
12. Refer to Exercise 10. a. Determine the coefficient of determination and discuss what its value tells you about the two variables. b. Calculate the Pearson correlation coefficient. What sign does it have? Why?
c. Conduct a test of the population coefficient of correlation to determine at the 5% significance level whether a linear relationship exists between years of education and income. 13. In a simple linear regression problem, the following statistics are calculated from a sample of 10 observations. ( x x )( y y ) = 2250, s x = 10, x = 50, y = 75 Compute the regression equation.
14. Given the least squares regression line y = -2.48 + 1.63x, and a coefficient of determination of 0.81, compute and interpret the coefficient of correlation
15. Refer to Exercise 10. a. Use the regression equation to determine the predicted values of y. b. Use the predicted and actual values of y to calculate the residuals. c. Plot the residuals against the predicted values of y. Does the variance appear to be constant. d. identify possible outliers. 16. For a company we know the information regarding the turnover and the profit evolution:
Year Turnover mobile relative change (%) Profit chain base absolute change (mill. m.u.) 1991 +3% 6 1992 +4% 4 1993 +2% -1 1994 +4% 9 1995 -2% 2 1996 +8% 9
The turnover in 1990 was 80 m.u. and the average rate of profit per year was + 6.4 %. Considering the profit evolution is influenced by the turnover evolution you are asked to: a reconstruct and graph the historical evolutions b. forecast the turnover evolution for the next year, choosing the most appropriate method between the simple and the analytic methods c. forecast the profit in 1997 taking into account its dependency upon the turnover (according to the regression line). d. Measure the intensity of the relation using parametrical and nonparametric measures.

Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression

Uploaded by

Copyright:

Available Formats

REGRESSION AND CORRELATION

Statistics for Business Administration

Regression and Correlation

Statistics for Business Administration

Regression and Correlation

Statistics for Business Administration

Figure 7.2 Perfect positive relationship

Figure 7.3 Perfect negative relationship

Regression and Correlation

Statistics for Business Administration

Regression and Correlation

n n n s xi = yi = 2 ( yi b0 b1 xi ) = 0 nb0 + b1 b0 i =1 i =1 i =1 so, n (7.4) n n n s 2 xi + b1 xi = xi yi b = 2 xi ( yi b0 b1 xi ) = 0 b0 i =1 i =1 i =1 i =1 1

Statistics for Business Administration

When b1<0, negative relationship.

Regression and Correlation

10 12.815 424 363 10 21.916 (424 )

363 424 - 0.652467 = 63.97 10 10

Defective parts per delivered batch

70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 110 Inspection costs per batch

Figure 7.4 Calculated lines of best fit.

Statistics for Business Administration

7.3.4 Quality of the Regression Line. Regression Line Standard Error

Regression and Correlation

7.3.5 Constructing the Confidence Interval

= 5.76 Value from Table 2 above and

Statistics for Business Administration

Regression and Correlation

- for the slope: b1 t x Sb where:

Statistics for Business Administration

Regression and Correlation

Statistics for Business Administration

150 140 130 120 110 100 90 80 8 10 12 14 16 18 Order volume 20 22 24

Figure 7.5 Unit cost and order size relation

150 127 123 117 110 107 104 101 97 95 1,131

10 0.0524 (0.706 ) b = 985.92 1,131 0.706 a= 985.92 10 10 a = 43.49

10 82.369 0.706 1131

Regression and Correlation

Thus the hyperbolic function is:

These values are plotted on Figure 7.6.

Statistics for Business Administration

Regression and Correlation

This is shown in Table 7-5:

(20 80% 80% )

( 20 80% 80% 80% )

thus for a 20% decrease (i.e. an 80% learning curve)

log(1 0.2 ) 1.90309 = = -0.322 log 2 0.30103

Statistics for Business Administration

Linear transformation of learning curve

(log x) 1.69268 2.56659 3.62175 4.85815

XY log x. log y 1.69268 1.92907 2.10712 2.22682

Statistics for Business Administration

7.5 Multiple Regression Models

Regression and Correlation

Y (x1 , x 2 ,K, x n ) = b0 + b1 x1 + K + bn x n + e (7.15)

coefficients of regression that shows the measure with which

Statistics for Business Administration

Regression and Correlation

x2 99 100 119 95 110 117 98 101 100 115 1,054

81 36 144 196 121 36 361 256 9 81 1,321

661 3.67734 105 ; a x1 = 27.59 110 10

Statistics for Business Administration

Note: This formula is a direct equivalent of that given previously but is

easier to work with since all except n y 2