Module 5

POLYTECHNIC UNIVERSITY OF THE PHILIPPINES
COLLEGE OF ENGINEERING
DEPARTMENT OF INDUSTRIAL ENGINEERING
MODULE 5
SIMPLE LINEAR REGRESSION AND CORRELATION
I. Empirical Models
Many problems in Industrial Engineering involve exploring the

relationships between two or more variables. IEs use regression
analysis as a statistical technique to help understand such
relationships. The following are some uses of regression
analysis:
1. To help build a model to predict an output given certain

inputs
2. To help in process optimization of an output given varying
levels of input
3. To help in process control purposes
Consider the data in Table 1-1 in the next page.
In Table 1-1, y is the purity of oxygen produced in a chemical

distillation process, while x is the percentage of hydrocarbons
that are present in the main condenser of the distillation unit.
The figure immediately below Table 1-1 is called a scatter
diagram; it represents a graph on which each (xi , yi ) pair is
represented as a single point.
Looking at the scatter diagram, we can see that although no

straight line will pass exactly through all the points, there is a
strong indication that the points lie scattered randomly
around a straight line.
It is therefore reasonable to assume that the mean of the random

variable Y is related to x by the following straight-line
relationship:
E (Y x ) = Y x = 0 + 1 x
1 of 28
Table 1-1: Oxygen and Hydrocarbon Levels
Observation Hydrocarbon Level Purity
Number x (%) y (%)
1 0.99 90.01
2 1.02 89.05
3 1.15 91.43
4 1.29 93.74
5 1.46 96.73
6 1.36 94.45
7 0.87 87.59
8 1.23 91.77
9 1.55 99.42
10 1.40 93.65
11 1.19 93.54
12 1.15 92.52
13 0.98 90.56
14 1.01 89.54
15 1.11 89.85
16 1.20 90.39
17 1.26 93.25
18 1.32 93.41
19 1.43 94.98
20 0.95 87.33
Scatter Diagram of Oxygen Purity vs Hydrocarbon
Level
100
98
96
Purity (y)
94
92
90
88
86
0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55
Hydrocarbon Level (x)
2 of 28
Figure 1. Scatter Diagram of Oxygen Purity vs Hydrocarbon
Level
where 0 and 1 are referred to as the slope and intercept of the
line; collectively, they are called the regression coefficients.
While the mean of Y is a linear function of x, the actual

observed value y does not fall exactly on a straight line. To
generalize this, we need to assume that the expected value of Y
is a linear function of x, but that for a fixed value of x, the
actual value of Y is determined by the mean value function
(the linear model) plus a random error term:
Y = 0 + 1 x + (Equation 1-1)
where is the random error term. Equation 1-1 is what is

commonly known as the simple linear regression model; this
is because the model above only has one independent variable
(x) or regressor.
The simple linear regression model is an example of an

empirical model, i.e. a simplified representation of a system or
phenomenon that is based on experience of experimentation.
Y, being a dependent variable, has certain properties. It can be

described via the mean and variance. Let us assume that the
mean and variance of are 0 and 2 respectively and that the
value of x is fixed, i.e. constant. Using the theorems from your
previous probability and statistics course:
E (Y x ) = E ( 0 + 1 x + ) = 0 + 1 x + E ( ) = 0 + 1 x
V (Y x ) = V ( 0 + 1 x + ) = V ( 0 + 1 x ) + V ( ) = 0 + 2 = 2
Thus, in the true regression model, the height of the regression

line at any value of x is just the expected value of Y for that x
3 of 28
while the slope, i.e. 1 can be interpreted as the change in the
mean of Y for a unit change in x. Also, the variability of Y at a
particular value of x is determined by the error variance 2 .
If the variability of Y is determined by the error variance 2 ,

what does this imply? This implies that when 2 is small, the
observed values of Y will fall close to the regression line; and
when 2 is large, the observed values of Y may be significantly
far away from the regression line. And since 2 is constant, the
variability of Y at any value of x is the same.
II. Simple Linear Regression
As mentioned earlier, the simple linear regression model only

considers a single regressor or predictor x and a dependent or
response variable Y.
We assume that each observation Y can be described by the

model
Y = 0 + 1 x +
where is the random error term with mean zero and

(unknown) variance 2 . The random errors corresponding to
different observations are also assumed to be uncorrelated
random variables.
The estimates of 0 and 1 should result in a line that is a best

fit to the data. A line that is considered as a best fit was
described by a German scientist named Karl Gauss as the line
that minimizes the sum of squares of the vertical deviations of
each actual observed data from the expected value. This
methodology is known as the method of least squares. Figure 2
shows the scatter plot from figure 1 with the vertical deviations
of each actual data from a potential regression line.
4 of 28
Figure 2. Deviations of the data from the estimated regression
model
To estimate the values of 0 and 1 for a set of data values (x,y),

we start with equation 1-1:
y i = 0 + 1 xi + i , i = 1, 2, ,n
The sum of the squares of the deviations of the observations

from the true regression line is:
n n
L= = 2
( y i 0 1 xi )2
i =1 i =1
The least squares estimators of 0 and 1 , signified as 0 and

, must satisfy:
1
L
(y )
n
= 2 i 0 1 xi = 0
0 0 , 1 i =1
L
(y )
n
= 2 i 0 1 xi xi = 0
1 0 , 1 i =1
5 of 28
Simplifying the two preceding partial differential equations, we
get:
n n
n0 + 1 xi = yi
i =` i =1
n n n
(equations 1-2 and1-3)
0 xi + 1 x = 2
i y i xi
i =1 i =1 i =1
Equations 1-2 and 1-3 are referred to as the least squares

normal equations. From these equations, we get the solution to
estimate our regression coefficients:
0 = y 1 x
n n
n
yi xi
i =1 i =1
y i xi
n
1 = i =1
2
n
n
xi
i =1
xi2
i =1 n
where y and x are the average values of your y and x values

respectively.
The fitted or estimated regression line is therefore:
y = 0 + 1 x
with each pair of observations satisfying the relationship:
yi = 0 + 1 xi + ei , i = 1, 2, ,n
6 of 28
where ei = y i y i is called the residual. The residual describes
the error in the fit of the model for each actual observation.
Notationally, it is convenient to give special symbols to the

numerator and denominator of equation for 1 :
n 2
xi
(x )
n n
2 i =1
S xx = i x = x 2
i
i =1 i =1 n
n n
xi yi
( )
n n
2 i =1 i =1
S xy = y i xi x = xi y i
i =1 i =1 n
therefore the equation for 1 can be rewritten as:
S xy
1 =
S xx
EXAMPLE: Fit a simple linear regression model to the oxygen
purity data in Table 1-1.
The following quantities may be computed from the given data

in Table 1-1:
20 20
n = 20 xi = 23.92 y i = 1,843.21 x = 1.196 y = 92.1605
i =1 i =1
20 20 20
y i2 = 170,044.5321 xi2 = 29.2892 xi y i = 2,214.6566
i =1 i =1 i =1
7 of 28
n 2
xi
S xx =
n
x
2 i =1
= 29.2892
(23.92)
2
= 0.68088
i
i =1 n 20
n n
xi yi
S xy =
n
xi yi i =1 i =1
= 2,214.6566
(23.92)(1,843.21) = 10.17744
i =1 n 20
S xy 10.17744
1 = = = 14.94748
S xx 0.68088
0 = y 1 x = 92.1605 (14.94748)1.196 = 74.28331
The fitted simple linear regression model is therefore:
y = 74.283 + 14.947 x
Using the regression model above, we would predict oxygen
purity of y = 89.23% when the hydrocarbon level is x = 1.00% .
The purity y = 89.23% may be interpreted as an estimate of the
true population mean purity when x = 1.00% or as an estimate of
a new observation when x = 1.00% . Such estimate is, of course,
subject to error, i.e. you cant expect that a future observation of
purity at x = 1.00% be exactly equal to 89.23%.
Estimating
2
To estimate , we use the residuals ei = y i y i . The sum of

2
the squares of the residuals, which is often called the error sum
of squares, is equal to
n n
SS E = e =
2
i ( yi y i )2
i =1 i =1
which is used to estimate for :

2
8 of 28
SS E
2 =
n2
However, computing for SS E given using the equation that was
just present can be very mind-numbing. Another way to
compute for SS E is through:
SS E = SST 1 S xy , where SS T =
n
i =1
(y i y )
2
=
n
i =1
y 2
i ny
2
EXERCISES
1. An article in Concrete Research (Near Surface

Characteristics of Concrete: Intrinsic Permeability, Vol. 41,
1989), presented data on compressive strength x and intrinsic
permeability y of various concrete mixes and cures. Summary
quantities are n = 14, yi = 572 , yi2 = 23,530 , xi = 43 ,
xi2 = 157.42 , and xi yi = 1697.80 . Assume that the two
variables are related according to the simple linear regression
model.
a. Calculate the least squares estimates of the slope and

intercept.
b. Use the equation of the fitted line to predict what
permeability would be observed when the compressive
strength is x = 4.3.
c. Give a point estimate of the mean permeability when
compressive strength is x = 3.7.
d. Suppose that the observed value of permeability at x =
3.7 is y = 46.1. Calculate the value of the corresponding
residual.
2. An article in Technometrics by S.C. Narula and J.F.

Wellington (Prediction, Linear Regression, and a Minimum
Sum of Relative Errors, Vol. 19, 1977) presents data on the
9 of 28
selling price and annual taxes for 24 houses. The data is given in
the table immediately proceeding below.
Sale Price (in Taxes (in Sale Price (in Taxes (in
US$k) US$k) US$k) US$k)
25.9 4.9176 30.0 5.0500
29.5 5.0208 36.9 8.2464
27.9 4.5429 41.9 6.6969
25.9 4.5573 40.5 7.7841
29.9 5.0597 43.9 9.0384
29.9 3.8910 37.5 5.9894
30.9 5.8980 37.9 5.7422
28.9 5.6039 44.5 8.7951
35.9 5.8282 37.9 6.0831
31.5 5.3003 38.9 8.3607
31.0 6.2712 36.9 8.1400
30.9 5.9592 45.8 9.1416
a. Assuming that a simple linear regression model is

appropriate, obtain the least squares fit relating selling
price to taxes paid. What is the estimate of ?
2
b. Find the mean selling price given that the taxes paid are
x = 7.50.
c. Calculate the fitted value of y corresponding to x =
5.8980. Find the corresponding residual.
3. Suppose we wish to fit a regression model for which the true

regression line passes through the point (0,0). The appropriate
model is Y = x + . Assume that we have n pairs of data. Find
the least squares estimate of .
III. Properties of the Least Squares Estimators
Recall that we have assumed that the error term in the model
Y = 0 + 1 x + is a random variable with mean zero and
10 of 28
variance . Because the values of x are fixed and Y is a
2
random variable with mean Y x = 0 + 1x and variance ,

2
the values of 0 and 1 depend on the observed ys. Because of

these properties, the estimate of could be used to provide
2
estimates of the variance of the slope and the intercept. The

square roots of such variance estimators are known as the
estimated standard errors of the slope and intercept,
respectively.
In simple linear regression, the estimated standard error of the

slope and estimated standard error of the intercept are
( )
se 1 =
2
S xx and
( )
se 0 = 2
1 x
+
n S xx
respectively, where = SS E (n 2 ) as mentioned previously.

2
IV. Hypothesis Tests in Simple Linear Regression
One of the important parts of assessing the adequacy of a linear

regression model is testing statistical hypotheses about the
model parameters and constructing confidence intervals. To test
the hypotheses about the slope and intercept of the regression
model, the following assumptions need to be made:
1. The errors i are normally and independently distributed

2. The errors i have a mean equal to zero
3. The errors i have a variance equal to
2
The three assumptions above are notationally abbreviated as

NID(0, ).
2
11 of 28
Use of t-Tests
Suppose we wish to test the hypothesis that the slope equals a

constant, say 1, 0 . The appropriate hypotheses are:
H 0 : 1 = 1, 0
H1 : 1 1, 0
Review Question: Given the hypotheses above, do we have a

one-sided test of hypothesis or a two-sided test of hypothesis?
The test statistic to be used for this case is:
1 1,0 1 1,0
T0 = =
S xx
2
( )
se 1
which follows the t-distribution with n 2 degrees of freedom.

The critical region is given as:
t0 > t 2,n2
This means that if the absolute value of the computed test

statistic is greater than the critical value t 2,n2 , we reject the
null hypothesis. Otherwise, we fail to reject the null hypothesis.
NOTE: We NEVER SAY we accept the null hypothesis or we

accept the alternative hypothesis. Saying these statements is a
major sin in Statistics. You will be laughed at and made fun of
by people who know and study Statistics if you say these things.
A similar method can be used to test hypotheses about the

intercept:
12 of 28
H 0 : 0 = 0, 0
H 1 : 0 0, 0
For this case, we use the test statistic:
0 0, 0 0 0, 0
T0 = =
1 x
2 ( )
se 0
2
+
n S xx
The critical region is given as:
t0 > t 2,n2
Same with that of the slope, if the absolute value of the

computed test statistic is greater than the critical value t 2,n2 ,
we reject the null hypothesis. Otherwise, we fail to reject the
null hypothesis.
One special case of the test of hypotheses that we have just

discussed is the case wherein:
H 0 : 1 = 0
H1 : 1 0
The test of hypotheses above relates to what we term as

significance of regression. If we fail to reject the null
hypothesis for this case, this means that there is no linear
relationship between x and Y. This conclusion implies that:
1. x is of little value in explaining the variation in Y

2. the true relationship between x and Y is not linear
13 of 28
EXAMPLE: Let us test the significance of regression using the
model for the oxygen purity data. Use = 0.01
Step 1: Declare the null and alternative hypotheses:
H 0 : 1 = 0
H1 : 1 0
Step 2: Declare the significance level:
= 0.01
Step 3: Declare the test statistic:
1 1,0 1 1,0
T0 = =
S xx
2
( )
se 1
Step 4: Compute for the test statistic:
From the given data, the following estimates were computed:
1 = 14.97 n = 20 S xx = 0.68088 2 = 1.18
therefore, computing for the test statistic:
1 1, 0
T0 =
2 S xx
14.97 0
=
1.18 0.68088
= 11.35
14 of 28
Step 5: Declare the critical region:
t0 > t0.01 2, 202

t0 > t0.005,18
t0 > 2.88
Step 6: State the result of your test:
Since the computed test statistic is within the critical region, i.e.
11.35 > 2.88, we reject the null hypothesis.
Step 7: State your conclusion:
The regression model is significant, i.e. the values of x can

significantly explain the variability in Y.
EXERCISES
4. Consider the data from Exercise # 1.
a. Test for significance of regression using = 0.05 .

b. Estimate and the standard deviation of 1 .
2
c. What is the standard error of the intercept in this

model?
5. Regression methods were used to analyze the data from a

study investigating the relationship between roadway surface
temperature (x) and pavement deflection (y). Summary
quantities were n = 20, yi = 12.75 , yi2 = 8.86 , xi = 1,478 ,
xi2 = 143,215.8 , xi yi = 1,083.67 .
a. Test for significance of regression using = 0.05 .

b. Estimate the standard errors of the slope and intercept.
15 of 28
6. A rocket motor is manufactured by bonding together two
types of propellants, an igniter and a sustainer. The shear
strength of the bond y is thought to be a linear function of the
age of the propellant x when the motor is cast. Data from twenty
observations are shown below:
Observation Strength Age x Observation Strength Age x

Number y (psi) (weeks) Number y (psi) (weeks)
1 2158.70 15.50 11 2165.20 13.00
2 1678.15 23.75 12 2399.55 3.75
3 2316.00 8.00 13 1779.80 25.00
4 2061.30 17.00 14 2336.75 9.75
5 2207.50 5.00 15 1765.30 22.00
6 1708.30 19.00 16 2053.50 18.00
7 1784.70 24.00 17 2414.40 6.00
8 2575.00 2.50 18 2200.50 12.50
9 2357.90 7.50 19 2654.20 2.00
10 2277.70 11.00 20 1753.70 21.50
a. Test for the significance of regression with = 0.01 .

b. Estimate the standard errors of 0 and 1 .
c. Test H 0 : 1 = 30 versus H1 : 1 30 using
= 0.01 .
d. Test H 0 : 0 = 0 versus H1 : 0 0 using = 0.01 .
e. Test H 0 : 0 = 2500 versus H1 : 0 > 2500 using
= 0.01 .
V. Confidence Intervals
Confidence Intervals on the Slope and Intercept
We already learned how to compute for the point estimates of

the slope and intercept, denoted by 1 and 0 respectively.
However, we can also obtain confidence interval estimates of
16 of 28
these parameters. The width of the confidence interval is a
measure on the overall quality of the regression line.
Given the assumption that the observations are NID, a

100(1 )% confidence interval on the slope 1 in simple
linear regression is
2 2
1 t 2,n 2 1 1 + t 2,n 2
S xx S xx
Consequently, a 100(1 )% confidence interval on the

intercept 0 is
2 2
1 x 1 x
0 t 2, n 2 2 + 0 0 t 2, n 2 2 +
n S xx n S xx
EXAMPLE: Using the oxygen purity data, let us construct a

95% confidence interval on the slope of the regression line.
Recalling that:
1 = 14.97 S xx = 0.68088 2 = 1.18
and using the equation for confidence interval on the slope
above, we get:
2 2
1 t0.025,18 1 1 + t0.025,18
S xx S xx
1.18 1.18
14.97 2.101 1 14.97 + 2.101
0.68088 0.68088
12.197 1 17.697
17 of 28
Confidence Interval on the Mean Response
A confidence interval may be constructed on the mean response

at a specified value of x, which can be denoted as x0 . This is
often called a confidence interval about the regression line.
A 100(1 )% confidence interval about the mean response at

the value of x = x0 , say Y x is given by 0
Y x t 2,n 2 2
+
(
1 x0 x )2
Y x0 Y x0 + t 2,n 2 2 (
1 x0 x
+
)2
0
n S xx n S xx
where Y x = 0 + 1 x0 is computed from the given regression

0
model.
EXAMPLE: Let us construct a 95% confidence interval about

the mean response for the oxygen purity data when x = 1.00%.
The fitted model is Y x0 = 74.283 + 14.947 x0 . Using this model,
we compute for Y x : 0
Y x = 74.283 + 14.947(1.00)
0
Y x = 89.23
0
Using the computed value for Y x , we now construct the 95%

0
confidence interval:
1 (1.00 1.196)
2
89.23 2.101 1.18 +
20 0.68088
89.23 0.75
88.48 Y 1.00 89.98
18 of 28
VI. Prediction of New Observations
One important application of a regression model is predicting

new or future observations of Y. Linear regression is one of the
forecasting techniques that IEs usually employ. You will learn
more about other forecasting techniques in your higher IE
subjects.
If x0 is the value of the regressor variable of interest,
Y = 0 + 1 x0
is the point estimator of the new or future value of the response

Y0 .
Consider obtaining an interval estimate for this future

observation Y0 . This new observation is independent of the
observations used to develop the regression model. Thus, the
confidence interval for Y x is inappropriate, since it is based
0
only on the data used to fit the regression model. Thus, we use a
difference equation to construct the confidence interval for
predicting new values.
A 100(1 )% prediction interval on a future observation Y0

at the value x0 is given by:
y 0 t 2,n 2
(
1 x x
1 + + 0
2 ) 2
y 0 y 0 t 2,n 2 2 (
1 x x
1 + + 0
)
2
n S xx n S xx
EXAMPLE: Let us find 95% prediction interval on the next

observation of oxygen purity at x0 = 1.00% . Recalling from the
19 of 28
previous example that y 0 = 89.23 . Constructing the prediction
interval gives us:
1 (1.00 1.196 )
2
89.23 2.101 1.18 1 + +
20 0.68088
86.83 y 0 91.63
EXERCISES
7. Refer to the data in Exercise # 1. Find a 95% confidence

interval on each of the following:
a. Slope
b. Intercept
c. Mean permeability when x = 2.5
d. Find a 95% prediction interval on permeability when
x = 2.5
8. The number of pounds of steam used per month by a chemical

plant is thought to be related to the average ambient temperature
(in oF) for that month. The past years usage and temperature are
shown in the table below:
Month Temp Usage/1000 Month Temp Usage/1000

January 21 185.79 July 68 621.55
February 24 214.47 August 74 675.06
March 32 288.03 September 62 562.03
April 47 424.84 October 50 452.93
May 50 454.84 November 41 369.95
June 59 539.03 December 30 273.98
a. Find a 99% confidence interval for 1 .

b. Find a 99% confidence interval for 0 .
c. Find a 95% confidence interval on mean steam usage
when the average temperature is 55oF.
d. Find a 95% prediction interval on steam usage when
temperature is 55oF.
20 of 28
VII. Adequacy of the Regression Model
In using regression models, IEs should always consider the

validity of the assumptions used and conduct analyses to
examine the adequacy of the regression model.
There are two ways to assess the adequacy of the regression

model:
1. Residual Analysis
2. Coefficient of Determination (R2)
Residual Analysis
The residuals from a regression model are ei = y i y i where yi

is an actual observation and y i is the corresponding fitted value
from the regression model. To check if the regression model is
adequate through residual analysis, the residuals should show
that they approximate the normal distribution with constant
variance.
To check for normality, the experimenter can construct a

frequency histogram of the residuals or a normal probability plot
of residuals.
21 of 28
As can be seen from the two immediately preceding figures, we
can infer that the residuals approximate a normal distribution.
The first graph plots the residuals against the predicted values.
We can see that there is a random pattern evident in the figure
implying the residuals approximating a normal distribution. The
second figure on the other hand is a normal probability plot of
residuals. We can see that the residuals are fall approximately
along a straight line, which implies that there is no severe
departure from normality.
To summarize:
In residual analysis, the residuals approximate the normal

distribution if:
1. in a residual plot against predicted values, there is a

random pattern evident
2. in a normal probability plot of residuals, the residual fall
approximately along a straight line
If the residuals are proven to approximate the normal

distribution, then the regression model is adequate in relation to
its error terms.
22 of 28
Coefficient of Determination (R2)
The quantity
SS R SS
R2 = = 1 E
SS T SS T
is called the coefficient of determination, which is often used to

judge the adequacy of a regression model. The range of values
of the coefficient of determination is:
0 R2 1
R2 is often referred to loosely as the amount of variability in

the data explained or accounted for by the regression model. For
the oxygen purity model, we have R = 0.877 ; this means that
2
the regression model accounts for 87.7% of the variability in the

data.
There are some misconceptions about R2 though:
1. it does not measure the magnitude of the slope of the

regression line
2. a large value of R2 does not imply a steep slope
3. R2 does not measure the appropriateness of the model,
since it can be artificially inflated
2
4. even if R is large, it is not an assurance that the
regression model will provide accurate predictions of
future observations
EXERCISE: 9. Refer to the data in Exercise # 2. Find the

residuals for the least squares model and determine the
proportion of total variability explained by the regression model.
23 of 28
VIII. Transformations to a Straight Line
There are situations wherein the true regression function is

nonlinear. However, there are times that we can express such
nonlinear functions as linear. Nonlinear regression functions that
can be transformed to linear regression functions are called
intrinsically linear.
Consider the exponential function:
Y = 0 e 1 x
This function is intrinsically linear since it can be transformed to

a straight line by a logarithmic transformation:
ln Y = ln 0 + 1 x + ln
The transformed linear regression model above is working with

the assumption that the transformed error terms ln are
NID(0, ).
2
Another intrinsically linear function is:
1
Y = 0 + 1 +
x
1
If we let z = , the model is linearized to:
x
Y = 0 + 1 z +
Sometimes several transformation can be employed jointly to

linearize a function. Consider the function:
24 of 28
1
Y=
exp( 0 + 1 x + )
If we let Y * = 1 Y , we have the linearized form:
ln Y * = 0 + 1 x +
IX. Correlation
The discussion in previous sections had us treating x as a

mathematical variable and Y as a random variable. However,
most regression analyses involve situation in which both X and
Y are random variables!
For example, we wish to develop a regression model relating the

shear strength of spot welds (Y) to the weld diameter (X). In this
example, weld diameter cannot be controlled making X a
random variable. Each pair of observations ( X i , Yi ) then is
treated as jointly distributed random variables.
The estimators of the intercept and slope when both X and Y are
random variables are identical to what was already discussed.
An additional estimator though can be computed from the case
when both X and Y are random variables and that is the
estimator for the correlation coefficient, . Recall that the
correlation coefficient, is defined as:
XY
=
XY
The estimator of is the sample correlation coefficient, R:
25 of 28
( )
n
Yi X i X
S XY SS R SS
R= i =1
= = = 1 E
SST SST
(X ) (Y Y ) S XX ST
n n
2 2
i X i
i =1 i =1
Note that the sample correlation coefficient, R is just the square

root of the coefficient of determination, R2.
It is often useful to test the hypotheses
H0 : = 0
H1 : 0
wherein the appropriate test statistic for such is
R n2
T0 =
1 R2
which has the t-distribution with n 2 degrees of freedom. The

null hypothesis is to be rejected if t 0 > t 2,n 2 .
Another test of hypotheses that we can do is
H 0 : = 0
H1 : 0
For a sample size greater than or equal to 25 (n 25) , the test

statistic to be used is
Z 0 = (arctanh R arctanh 0 ) (n 3)
We reject the null hypothesis if the value of the test statistic falls
within the critical region z0 > z 2 .
26 of 28
It is also possible to construct an approximate 100(1 )%
confidence interval for :
z 2 z 2
tanh arctanh r tanh arctanh r +
n3 n3
EXERCISES
10. The final test and exam averages for 20 randomly selected
students taking a course in engineering statistics and a course in
operations research follow. Assume that the final averages are
jointly normally distributed.
Statistics 86 75 69 75 90
OR 80 81 75 81 82
OR 95 80 81 76 72
OR 85 72 65 93 81
OR 70 73 72 80 98
a. Find the regression line relating the statistics final

average to the OR final average.
b. Test for significance of regression using = 0.05 .
c. Estimate the correlation coefficient.
d. Test the hypothesis that = 0 , using = 0.05 .
e. Test the hypothesis that = 0.5 using = 0.05 .
f. Construct a 95% confidence interval for the correlation
coefficient.
27 of 28
11. A random sample of 50 observations was made on the
diameter of spot welds and the corresponding weld shear
strength.
a. Given that r = 0.62, test the hypothesis that = 0 ,

using = 0.01 .
b. Find a 99% confidence interval for .
c. Based on the confidence interval in letter (b), can you
conclude that = 0.5 at the 0.01 level of significance?
28 of 28

Module 5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 5

Uploaded by

Copyright:

Available Formats

POLYTECHNIC UNIVERSITY OF THE PHILIPPINES

Many problems in Industrial Engineering involve exploring the

1. To help build a model to predict an output given certain

Consider the data in Table 1-1 in the next page.

In Table 1-1, y is the purity of oxygen produced in a chemical

Looking at the scatter diagram, we can see that although no

It is therefore reasonable to assume that the mean of the random

While the mean of Y is a linear function of x, the actual

where is the random error term. Equation 1-1 is what is

The simple linear regression model is an example of an

Y, being a dependent variable, has certain properties. It can be

Thus, in the true regression model, the height of the regression

If the variability of Y is determined by the error variance 2 ,

II. Simple Linear Regression

As mentioned earlier, the simple linear regression model only

We assume that each observation Y can be described by the

where is the random error term with mean zero and

The estimates of 0 and 1 should result in a line that is a best

To estimate the values of 0 and 1 for a set of data values (x,y),

The sum of the squares of the deviations of the observations

The least squares estimators of 0 and 1 , signified as 0 and

Equations 1-2 and 1-3 are referred to as the least squares

where y and x are the average values of your y and x values

The fitted or estimated regression line is therefore:

with each pair of observations satisfying the relationship:

Notationally, it is convenient to give special symbols to the

therefore the equation for 1 can be rewritten as:

The following quantities may be computed from the given data

The fitted simple linear regression model is therefore:

To estimate , we use the residuals ei = y i y i . The sum of

which is used to estimate for :

1. An article in Concrete Research (Near Surface

a. Calculate the least squares estimates of the slope and

2. An article in Technometrics by S.C. Narula and J.F.

a. Assuming that a simple linear regression model is

3. Suppose we wish to fit a regression model for which the true

III. Properties of the Least Squares Estimators

random variable with mean Y x = 0 + 1x and variance ,

the values of 0 and 1 depend on the observed ys. Because of

estimates of the variance of the slope and the intercept. The

In simple linear regression, the estimated standard error of the

respectively, where = SS E (n 2 ) as mentioned previously.

IV. Hypothesis Tests in Simple Linear Regression

One of the important parts of assessing the adequacy of a linear

1. The errors i are normally and independently distributed

The three assumptions above are notationally abbreviated as

Suppose we wish to test the hypothesis that the slope equals a

Review Question: Given the hypotheses above, do we have a

The test statistic to be used for this case is:

which follows the t-distribution with n 2 degrees of freedom.

This means that if the absolute value of the computed test

NOTE: We NEVER SAY we accept the null hypothesis or we

A similar method can be used to test hypotheses about the

For this case, we use the test statistic:

The critical region is given as:

Same with that of the slope, if the absolute value of the

One special case of the test of hypotheses that we have just

The test of hypotheses above relates to what we term as

1. x is of little value in explaining the variation in Y

Step 1: Declare the null and alternative hypotheses:

Step 2: Declare the significance level:

Step 4: Compute for the test statistic:

From the given data, the following estimates were computed:

1 = 14.97 n = 20 S xx = 0.68088 2 = 1.18