You are on page 1of 28

POLYTECHNIC UNIVERSITY OF THE PHILIPPINES

COLLEGE OF ENGINEERING
DEPARTMENT OF INDUSTRIAL ENGINEERING
MODULE 5
SIMPLE LINEAR REGRESSION AND CORRELATION

I. Empirical Models

Many problems in Industrial Engineering involve exploring the


relationships between two or more variables. IEs use regression
analysis as a statistical technique to help understand such
relationships. The following are some uses of regression
analysis:

1. To help build a model to predict an output given certain


inputs
2. To help in process optimization of an output given varying
levels of input
3. To help in process control purposes

Consider the data in Table 1-1 in the next page.

In Table 1-1, y is the purity of oxygen produced in a chemical


distillation process, while x is the percentage of hydrocarbons
that are present in the main condenser of the distillation unit.
The figure immediately below Table 1-1 is called a scatter
diagram; it represents a graph on which each (xi , yi ) pair is
represented as a single point.

Looking at the scatter diagram, we can see that although no


straight line will pass exactly through all the points, there is a
strong indication that the points lie scattered randomly
around a straight line.

It is therefore reasonable to assume that the mean of the random


variable Y is related to x by the following straight-line
relationship:

E (Y x ) = Y x = 0 + 1 x

1 of 28
Table 1-1: Oxygen and Hydrocarbon Levels
Observation Hydrocarbon Level Purity
Number x (%) y (%)
1 0.99 90.01
2 1.02 89.05
3 1.15 91.43
4 1.29 93.74
5 1.46 96.73
6 1.36 94.45
7 0.87 87.59
8 1.23 91.77
9 1.55 99.42
10 1.40 93.65
11 1.19 93.54
12 1.15 92.52
13 0.98 90.56
14 1.01 89.54
15 1.11 89.85
16 1.20 90.39
17 1.26 93.25
18 1.32 93.41
19 1.43 94.98
20 0.95 87.33
Scatter Diagram of Oxygen Purity vs Hydrocarbon
Level

100
98
96
Purity (y)

94
92
90
88
86
0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55
Hydrocarbon Level (x)

2 of 28
Figure 1. Scatter Diagram of Oxygen Purity vs Hydrocarbon
Level
where 0 and 1 are referred to as the slope and intercept of the
line; collectively, they are called the regression coefficients.

While the mean of Y is a linear function of x, the actual


observed value y does not fall exactly on a straight line. To
generalize this, we need to assume that the expected value of Y
is a linear function of x, but that for a fixed value of x, the
actual value of Y is determined by the mean value function
(the linear model) plus a random error term:

Y = 0 + 1 x + (Equation 1-1)

where is the random error term. Equation 1-1 is what is


commonly known as the simple linear regression model; this
is because the model above only has one independent variable
(x) or regressor.

The simple linear regression model is an example of an


empirical model, i.e. a simplified representation of a system or
phenomenon that is based on experience of experimentation.

Y, being a dependent variable, has certain properties. It can be


described via the mean and variance. Let us assume that the
mean and variance of are 0 and 2 respectively and that the
value of x is fixed, i.e. constant. Using the theorems from your
previous probability and statistics course:

E (Y x ) = E ( 0 + 1 x + ) = 0 + 1 x + E ( ) = 0 + 1 x

V (Y x ) = V ( 0 + 1 x + ) = V ( 0 + 1 x ) + V ( ) = 0 + 2 = 2

Thus, in the true regression model, the height of the regression


line at any value of x is just the expected value of Y for that x

3 of 28
while the slope, i.e. 1 can be interpreted as the change in the
mean of Y for a unit change in x. Also, the variability of Y at a
particular value of x is determined by the error variance 2 .

If the variability of Y is determined by the error variance 2 ,


what does this imply? This implies that when 2 is small, the
observed values of Y will fall close to the regression line; and
when 2 is large, the observed values of Y may be significantly
far away from the regression line. And since 2 is constant, the
variability of Y at any value of x is the same.

II. Simple Linear Regression

As mentioned earlier, the simple linear regression model only


considers a single regressor or predictor x and a dependent or
response variable Y.

We assume that each observation Y can be described by the


model

Y = 0 + 1 x +

where is the random error term with mean zero and


(unknown) variance 2 . The random errors corresponding to
different observations are also assumed to be uncorrelated
random variables.

The estimates of 0 and 1 should result in a line that is a best


fit to the data. A line that is considered as a best fit was
described by a German scientist named Karl Gauss as the line
that minimizes the sum of squares of the vertical deviations of
each actual observed data from the expected value. This
methodology is known as the method of least squares. Figure 2
shows the scatter plot from figure 1 with the vertical deviations
of each actual data from a potential regression line.

4 of 28
Figure 2. Deviations of the data from the estimated regression
model

To estimate the values of 0 and 1 for a set of data values (x,y),


we start with equation 1-1:

y i = 0 + 1 xi + i , i = 1, 2, ,n

The sum of the squares of the deviations of the observations


from the true regression line is:

n n
L= = 2
( y i 0 1 xi )2
i =1 i =1

The least squares estimators of 0 and 1 , signified as 0 and


, must satisfy:
1
L
(y )
n
= 2 i 0 1 xi = 0
0 0 , 1 i =1

L
(y )
n
= 2 i 0 1 xi xi = 0
1 0 , 1 i =1

5 of 28
Simplifying the two preceding partial differential equations, we
get:

n n
n0 + 1 xi = yi
i =` i =1
n n n
(equations 1-2 and1-3)
0 xi + 1 x = 2
i y i xi
i =1 i =1 i =1

Equations 1-2 and 1-3 are referred to as the least squares


normal equations. From these equations, we get the solution to
estimate our regression coefficients:

0 = y 1 x
n n

n
yi xi
i =1 i =1
y i xi
n
1 = i =1
2
n

n
xi
i =1
xi2
i =1 n

where y and x are the average values of your y and x values


respectively.

The fitted or estimated regression line is therefore:

y = 0 + 1 x

with each pair of observations satisfying the relationship:

yi = 0 + 1 xi + ei , i = 1, 2, ,n

6 of 28
where ei = y i y i is called the residual. The residual describes
the error in the fit of the model for each actual observation.

Notationally, it is convenient to give special symbols to the


numerator and denominator of equation for 1 :

n 2

xi
(x )
n n
2 i =1
S xx = i x = x 2
i
i =1 i =1 n
n n
xi yi
( )
n n
2 i =1 i =1
S xy = y i xi x = xi y i
i =1 i =1 n

therefore the equation for 1 can be rewritten as:

S xy
1 =
S xx
EXAMPLE: Fit a simple linear regression model to the oxygen
purity data in Table 1-1.

The following quantities may be computed from the given data


in Table 1-1:

20 20
n = 20 xi = 23.92 y i = 1,843.21 x = 1.196 y = 92.1605
i =1 i =1
20 20 20
y i2 = 170,044.5321 xi2 = 29.2892 xi y i = 2,214.6566
i =1 i =1 i =1

7 of 28
n 2

xi
S xx =
n
x
2 i =1
= 29.2892
(23.92)
2
= 0.68088
i
i =1 n 20
n n
xi yi
S xy =
n
xi yi i =1 i =1
= 2,214.6566
(23.92)(1,843.21) = 10.17744
i =1 n 20

S xy 10.17744
1 = = = 14.94748
S xx 0.68088
0 = y 1 x = 92.1605 (14.94748)1.196 = 74.28331

The fitted simple linear regression model is therefore:

y = 74.283 + 14.947 x
Using the regression model above, we would predict oxygen
purity of y = 89.23% when the hydrocarbon level is x = 1.00% .
The purity y = 89.23% may be interpreted as an estimate of the
true population mean purity when x = 1.00% or as an estimate of
a new observation when x = 1.00% . Such estimate is, of course,
subject to error, i.e. you cant expect that a future observation of
purity at x = 1.00% be exactly equal to 89.23%.

Estimating
2

To estimate , we use the residuals ei = y i y i . The sum of


2

the squares of the residuals, which is often called the error sum
of squares, is equal to

n n
SS E = e =
2
i ( yi y i )2
i =1 i =1

which is used to estimate for :


2

8 of 28
SS E
2 =
n2
However, computing for SS E given using the equation that was
just present can be very mind-numbing. Another way to
compute for SS E is through:

SS E = SST 1 S xy , where SS T =
n
i =1
(y i y )
2
=
n
i =1
y 2
i ny
2

EXERCISES

1. An article in Concrete Research (Near Surface


Characteristics of Concrete: Intrinsic Permeability, Vol. 41,
1989), presented data on compressive strength x and intrinsic
permeability y of various concrete mixes and cures. Summary
quantities are n = 14, yi = 572 , yi2 = 23,530 , xi = 43 ,
xi2 = 157.42 , and xi yi = 1697.80 . Assume that the two
variables are related according to the simple linear regression
model.

a. Calculate the least squares estimates of the slope and


intercept.
b. Use the equation of the fitted line to predict what
permeability would be observed when the compressive
strength is x = 4.3.
c. Give a point estimate of the mean permeability when
compressive strength is x = 3.7.
d. Suppose that the observed value of permeability at x =
3.7 is y = 46.1. Calculate the value of the corresponding
residual.

2. An article in Technometrics by S.C. Narula and J.F.


Wellington (Prediction, Linear Regression, and a Minimum
Sum of Relative Errors, Vol. 19, 1977) presents data on the

9 of 28
selling price and annual taxes for 24 houses. The data is given in
the table immediately proceeding below.

Sale Price (in Taxes (in Sale Price (in Taxes (in
US$k) US$k) US$k) US$k)
25.9 4.9176 30.0 5.0500
29.5 5.0208 36.9 8.2464
27.9 4.5429 41.9 6.6969
25.9 4.5573 40.5 7.7841
29.9 5.0597 43.9 9.0384
29.9 3.8910 37.5 5.9894
30.9 5.8980 37.9 5.7422
28.9 5.6039 44.5 8.7951
35.9 5.8282 37.9 6.0831
31.5 5.3003 38.9 8.3607
31.0 6.2712 36.9 8.1400
30.9 5.9592 45.8 9.1416

a. Assuming that a simple linear regression model is


appropriate, obtain the least squares fit relating selling
price to taxes paid. What is the estimate of ?
2

b. Find the mean selling price given that the taxes paid are
x = 7.50.
c. Calculate the fitted value of y corresponding to x =
5.8980. Find the corresponding residual.

3. Suppose we wish to fit a regression model for which the true


regression line passes through the point (0,0). The appropriate
model is Y = x + . Assume that we have n pairs of data. Find
the least squares estimate of .

III. Properties of the Least Squares Estimators

Recall that we have assumed that the error term in the model
Y = 0 + 1 x + is a random variable with mean zero and

10 of 28
variance . Because the values of x are fixed and Y is a
2

random variable with mean Y x = 0 + 1x and variance ,


2

the values of 0 and 1 depend on the observed ys. Because of


these properties, the estimate of could be used to provide
2

estimates of the variance of the slope and the intercept. The


square roots of such variance estimators are known as the
estimated standard errors of the slope and intercept,
respectively.

In simple linear regression, the estimated standard error of the


slope and estimated standard error of the intercept are

( )
se 1 =
2
S xx and
( )
se 0 = 2
1 x
+
n S xx

respectively, where = SS E (n 2 ) as mentioned previously.


2

IV. Hypothesis Tests in Simple Linear Regression

One of the important parts of assessing the adequacy of a linear


regression model is testing statistical hypotheses about the
model parameters and constructing confidence intervals. To test
the hypotheses about the slope and intercept of the regression
model, the following assumptions need to be made:

1. The errors i are normally and independently distributed


2. The errors i have a mean equal to zero
3. The errors i have a variance equal to
2

The three assumptions above are notationally abbreviated as


NID(0, ).
2

11 of 28
Use of t-Tests

Suppose we wish to test the hypothesis that the slope equals a


constant, say 1, 0 . The appropriate hypotheses are:

H 0 : 1 = 1, 0
H1 : 1 1, 0

Review Question: Given the hypotheses above, do we have a


one-sided test of hypothesis or a two-sided test of hypothesis?

The test statistic to be used for this case is:

1 1,0 1 1,0
T0 = =
S xx
2
( )
se 1

which follows the t-distribution with n 2 degrees of freedom.


The critical region is given as:

t0 > t 2,n2

This means that if the absolute value of the computed test


statistic is greater than the critical value t 2,n2 , we reject the
null hypothesis. Otherwise, we fail to reject the null hypothesis.

NOTE: We NEVER SAY we accept the null hypothesis or we


accept the alternative hypothesis. Saying these statements is a
major sin in Statistics. You will be laughed at and made fun of
by people who know and study Statistics if you say these things.

A similar method can be used to test hypotheses about the


intercept:

12 of 28
H 0 : 0 = 0, 0
H 1 : 0 0, 0

For this case, we use the test statistic:

0 0, 0 0 0, 0
T0 = =
1 x
2 ( )
se 0
2
+
n S xx

The critical region is given as:

t0 > t 2,n2

Same with that of the slope, if the absolute value of the


computed test statistic is greater than the critical value t 2,n2 ,
we reject the null hypothesis. Otherwise, we fail to reject the
null hypothesis.

One special case of the test of hypotheses that we have just


discussed is the case wherein:

H 0 : 1 = 0
H1 : 1 0

The test of hypotheses above relates to what we term as


significance of regression. If we fail to reject the null
hypothesis for this case, this means that there is no linear
relationship between x and Y. This conclusion implies that:

1. x is of little value in explaining the variation in Y


2. the true relationship between x and Y is not linear

13 of 28
EXAMPLE: Let us test the significance of regression using the
model for the oxygen purity data. Use = 0.01

Step 1: Declare the null and alternative hypotheses:

H 0 : 1 = 0
H1 : 1 0

Step 2: Declare the significance level:

= 0.01
Step 3: Declare the test statistic:

1 1,0 1 1,0
T0 = =
S xx
2
( )
se 1

Step 4: Compute for the test statistic:

From the given data, the following estimates were computed:

1 = 14.97 n = 20 S xx = 0.68088 2 = 1.18

therefore, computing for the test statistic:

1 1, 0
T0 =
2 S xx
14.97 0
=
1.18 0.68088
= 11.35

14 of 28
Step 5: Declare the critical region:

t0 > t0.01 2, 202


t0 > t0.005,18
t0 > 2.88

Step 6: State the result of your test:

Since the computed test statistic is within the critical region, i.e.
11.35 > 2.88, we reject the null hypothesis.

Step 7: State your conclusion:

The regression model is significant, i.e. the values of x can


significantly explain the variability in Y.

EXERCISES

4. Consider the data from Exercise # 1.

a. Test for significance of regression using = 0.05 .


b. Estimate and the standard deviation of 1 .
2

c. What is the standard error of the intercept in this


model?

5. Regression methods were used to analyze the data from a


study investigating the relationship between roadway surface
temperature (x) and pavement deflection (y). Summary
quantities were n = 20, yi = 12.75 , yi2 = 8.86 , xi = 1,478 ,
xi2 = 143,215.8 , xi yi = 1,083.67 .

a. Test for significance of regression using = 0.05 .


b. Estimate the standard errors of the slope and intercept.

15 of 28
6. A rocket motor is manufactured by bonding together two
types of propellants, an igniter and a sustainer. The shear
strength of the bond y is thought to be a linear function of the
age of the propellant x when the motor is cast. Data from twenty
observations are shown below:

Observation Strength Age x Observation Strength Age x


Number y (psi) (weeks) Number y (psi) (weeks)
1 2158.70 15.50 11 2165.20 13.00
2 1678.15 23.75 12 2399.55 3.75
3 2316.00 8.00 13 1779.80 25.00
4 2061.30 17.00 14 2336.75 9.75
5 2207.50 5.00 15 1765.30 22.00
6 1708.30 19.00 16 2053.50 18.00
7 1784.70 24.00 17 2414.40 6.00
8 2575.00 2.50 18 2200.50 12.50
9 2357.90 7.50 19 2654.20 2.00
10 2277.70 11.00 20 1753.70 21.50

a. Test for the significance of regression with = 0.01 .


b. Estimate the standard errors of 0 and 1 .
c. Test H 0 : 1 = 30 versus H1 : 1 30 using
= 0.01 .
d. Test H 0 : 0 = 0 versus H1 : 0 0 using = 0.01 .
e. Test H 0 : 0 = 2500 versus H1 : 0 > 2500 using
= 0.01 .

V. Confidence Intervals

Confidence Intervals on the Slope and Intercept

We already learned how to compute for the point estimates of


the slope and intercept, denoted by 1 and 0 respectively.
However, we can also obtain confidence interval estimates of

16 of 28
these parameters. The width of the confidence interval is a
measure on the overall quality of the regression line.

Given the assumption that the observations are NID, a


100(1 )% confidence interval on the slope 1 in simple
linear regression is

2 2
1 t 2,n 2 1 1 + t 2,n 2
S xx S xx

Consequently, a 100(1 )% confidence interval on the


intercept 0 is

2 2
1 x 1 x
0 t 2, n 2 2 + 0 0 t 2, n 2 2 +
n S xx n S xx

EXAMPLE: Using the oxygen purity data, let us construct a


95% confidence interval on the slope of the regression line.
Recalling that:
1 = 14.97 S xx = 0.68088 2 = 1.18
and using the equation for confidence interval on the slope
above, we get:

2 2
1 t0.025,18 1 1 + t0.025,18
S xx S xx

1.18 1.18
14.97 2.101 1 14.97 + 2.101
0.68088 0.68088

12.197 1 17.697

17 of 28
Confidence Interval on the Mean Response

A confidence interval may be constructed on the mean response


at a specified value of x, which can be denoted as x0 . This is
often called a confidence interval about the regression line.

A 100(1 )% confidence interval about the mean response at


the value of x = x0 , say Y x is given by 0

Y x t 2,n 2 2
+
(
1 x0 x )2

Y x0 Y x0 + t 2,n 2 2 (
1 x0 x
+
)2

0
n S xx n S xx

where Y x = 0 + 1 x0 is computed from the given regression


0

model.

EXAMPLE: Let us construct a 95% confidence interval about


the mean response for the oxygen purity data when x = 1.00%.
The fitted model is Y x0 = 74.283 + 14.947 x0 . Using this model,
we compute for Y x : 0

Y x = 74.283 + 14.947(1.00)
0

Y x = 89.23
0

Using the computed value for Y x , we now construct the 95%


0

confidence interval:

1 (1.00 1.196)
2
89.23 2.101 1.18 +
20 0.68088

89.23 0.75
88.48 Y 1.00 89.98

18 of 28
VI. Prediction of New Observations

One important application of a regression model is predicting


new or future observations of Y. Linear regression is one of the
forecasting techniques that IEs usually employ. You will learn
more about other forecasting techniques in your higher IE
subjects.

If x0 is the value of the regressor variable of interest,

Y = 0 + 1 x0

is the point estimator of the new or future value of the response


Y0 .

Consider obtaining an interval estimate for this future


observation Y0 . This new observation is independent of the
observations used to develop the regression model. Thus, the
confidence interval for Y x is inappropriate, since it is based
0

only on the data used to fit the regression model. Thus, we use a
difference equation to construct the confidence interval for
predicting new values.

A 100(1 )% prediction interval on a future observation Y0


at the value x0 is given by:

y 0 t 2,n 2
(
1 x x
1 + + 0
2 ) 2

y 0 y 0 t 2,n 2 2 (
1 x x
1 + + 0
)
2

n S xx n S xx

EXAMPLE: Let us find 95% prediction interval on the next


observation of oxygen purity at x0 = 1.00% . Recalling from the

19 of 28
previous example that y 0 = 89.23 . Constructing the prediction
interval gives us:
1 (1.00 1.196 )
2
89.23 2.101 1.18 1 + +
20 0.68088
86.83 y 0 91.63

EXERCISES

7. Refer to the data in Exercise # 1. Find a 95% confidence


interval on each of the following:
a. Slope
b. Intercept
c. Mean permeability when x = 2.5
d. Find a 95% prediction interval on permeability when
x = 2.5

8. The number of pounds of steam used per month by a chemical


plant is thought to be related to the average ambient temperature
(in oF) for that month. The past years usage and temperature are
shown in the table below:

Month Temp Usage/1000 Month Temp Usage/1000


January 21 185.79 July 68 621.55
February 24 214.47 August 74 675.06
March 32 288.03 September 62 562.03
April 47 424.84 October 50 452.93
May 50 454.84 November 41 369.95
June 59 539.03 December 30 273.98

a. Find a 99% confidence interval for 1 .


b. Find a 99% confidence interval for 0 .
c. Find a 95% confidence interval on mean steam usage
when the average temperature is 55oF.
d. Find a 95% prediction interval on steam usage when
temperature is 55oF.

20 of 28
VII. Adequacy of the Regression Model

In using regression models, IEs should always consider the


validity of the assumptions used and conduct analyses to
examine the adequacy of the regression model.

There are two ways to assess the adequacy of the regression


model:

1. Residual Analysis
2. Coefficient of Determination (R2)

Residual Analysis

The residuals from a regression model are ei = y i y i where yi


is an actual observation and y i is the corresponding fitted value
from the regression model. To check if the regression model is
adequate through residual analysis, the residuals should show
that they approximate the normal distribution with constant
variance.

To check for normality, the experimenter can construct a


frequency histogram of the residuals or a normal probability plot
of residuals.

21 of 28
As can be seen from the two immediately preceding figures, we
can infer that the residuals approximate a normal distribution.
The first graph plots the residuals against the predicted values.
We can see that there is a random pattern evident in the figure
implying the residuals approximating a normal distribution. The
second figure on the other hand is a normal probability plot of
residuals. We can see that the residuals are fall approximately
along a straight line, which implies that there is no severe
departure from normality.

To summarize:

In residual analysis, the residuals approximate the normal


distribution if:

1. in a residual plot against predicted values, there is a


random pattern evident
2. in a normal probability plot of residuals, the residual fall
approximately along a straight line

If the residuals are proven to approximate the normal


distribution, then the regression model is adequate in relation to
its error terms.

22 of 28
Coefficient of Determination (R2)

The quantity

SS R SS
R2 = = 1 E
SS T SS T

is called the coefficient of determination, which is often used to


judge the adequacy of a regression model. The range of values
of the coefficient of determination is:

0 R2 1

R2 is often referred to loosely as the amount of variability in


the data explained or accounted for by the regression model. For
the oxygen purity model, we have R = 0.877 ; this means that
2

the regression model accounts for 87.7% of the variability in the


data.

There are some misconceptions about R2 though:

1. it does not measure the magnitude of the slope of the


regression line
2. a large value of R2 does not imply a steep slope
3. R2 does not measure the appropriateness of the model,
since it can be artificially inflated
2
4. even if R is large, it is not an assurance that the
regression model will provide accurate predictions of
future observations

EXERCISE: 9. Refer to the data in Exercise # 2. Find the


residuals for the least squares model and determine the
proportion of total variability explained by the regression model.

23 of 28
VIII. Transformations to a Straight Line

There are situations wherein the true regression function is


nonlinear. However, there are times that we can express such
nonlinear functions as linear. Nonlinear regression functions that
can be transformed to linear regression functions are called
intrinsically linear.

Consider the exponential function:

Y = 0 e 1 x

This function is intrinsically linear since it can be transformed to


a straight line by a logarithmic transformation:

ln Y = ln 0 + 1 x + ln

The transformed linear regression model above is working with


the assumption that the transformed error terms ln are
NID(0, ).
2

Another intrinsically linear function is:

1
Y = 0 + 1 +
x

1
If we let z = , the model is linearized to:
x

Y = 0 + 1 z +

Sometimes several transformation can be employed jointly to


linearize a function. Consider the function:

24 of 28
1
Y=
exp( 0 + 1 x + )

If we let Y * = 1 Y , we have the linearized form:

ln Y * = 0 + 1 x +

IX. Correlation

The discussion in previous sections had us treating x as a


mathematical variable and Y as a random variable. However,
most regression analyses involve situation in which both X and
Y are random variables!

For example, we wish to develop a regression model relating the


shear strength of spot welds (Y) to the weld diameter (X). In this
example, weld diameter cannot be controlled making X a
random variable. Each pair of observations ( X i , Yi ) then is
treated as jointly distributed random variables.

The estimators of the intercept and slope when both X and Y are
random variables are identical to what was already discussed.
An additional estimator though can be computed from the case
when both X and Y are random variables and that is the
estimator for the correlation coefficient, . Recall that the
correlation coefficient, is defined as:

XY
=
XY

The estimator of is the sample correlation coefficient, R:

25 of 28
( )
n
Yi X i X
S XY SS R SS
R= i =1
= = = 1 E
SST SST
(X ) (Y Y ) S XX ST
n n
2 2
i X i
i =1 i =1

Note that the sample correlation coefficient, R is just the square


root of the coefficient of determination, R2.

It is often useful to test the hypotheses

H0 : = 0
H1 : 0

wherein the appropriate test statistic for such is

R n2
T0 =
1 R2

which has the t-distribution with n 2 degrees of freedom. The


null hypothesis is to be rejected if t 0 > t 2,n 2 .

Another test of hypotheses that we can do is

H 0 : = 0
H1 : 0

For a sample size greater than or equal to 25 (n 25) , the test


statistic to be used is

Z 0 = (arctanh R arctanh 0 ) (n 3)

We reject the null hypothesis if the value of the test statistic falls
within the critical region z0 > z 2 .

26 of 28
It is also possible to construct an approximate 100(1 )%
confidence interval for :

z 2 z 2
tanh arctanh r tanh arctanh r +
n3 n3

EXERCISES

10. The final test and exam averages for 20 randomly selected
students taking a course in engineering statistics and a course in
operations research follow. Assume that the final averages are
jointly normally distributed.

Statistics 86 75 69 75 90
OR 80 81 75 81 82

Statistics 94 83 86 71 65
OR 95 80 81 76 72

Statistics 84 71 62 90 83
OR 85 72 65 93 81

Statistics 75 71 76 84 97
OR 70 73 72 80 98

a. Find the regression line relating the statistics final


average to the OR final average.
b. Test for significance of regression using = 0.05 .
c. Estimate the correlation coefficient.
d. Test the hypothesis that = 0 , using = 0.05 .
e. Test the hypothesis that = 0.5 using = 0.05 .
f. Construct a 95% confidence interval for the correlation
coefficient.

27 of 28
11. A random sample of 50 observations was made on the
diameter of spot welds and the corresponding weld shear
strength.

a. Given that r = 0.62, test the hypothesis that = 0 ,


using = 0.01 .
b. Find a 99% confidence interval for .
c. Based on the confidence interval in letter (b), can you
conclude that = 0.5 at the 0.01 level of significance?

28 of 28

You might also like