Professional Documents
Culture Documents
1. Introduction
Along with the Analysis of Variance, this is likely the most commonly used statistical
methodology in chemical and engineering research. In virtually every issue of a chemical
engineering journal, you will find papers that use a regression analysis. There are HUNDREDS of
books written on regression analysis. Some of the better ones (in my opinion) are:
Draper and Smith. Applied Regression Analysis. Wiley.
Neter, Wasserman, and Kutner. Applied Linear Statistical Models. Irwin.
Kleinbaum, Kupper, Miller. Applied Regression Analysis. Duxbury.
Zar. Biostatistics. Prentice Hall.
Consequently, this set of notes is VERY brief and makes no pretense to be a thorough review of
regression analysis. Please consult the above references for all the gory details.
It turns out that both Analysis of Variance and Regression are special cases of a more general
statistical methodology called General Linear Models which in turn are special cases of
Generalized Linear Models which in turn are special cases of Generalized Additive Models, which
in turn are special cases of .....
The key difference between a Regression analysis and an ANOVA is that the X variable is
nominal scaled in ANOVA, while in regression analysis the X variable is continuous scaled. This
implies that in ANOVA, the shape of the response profile was unspecified (the null hypothesis
was that all means were equal while the alternate was that at least one mean differs), while in
regression, the response profile must be a linear line.
Because both ANOVA and regression are from the same class of statistical models, many of the
assumptions are similar, the fitting methods are similar, hypotheses testing and inference are
similar as well.
Page 1 of 8
and b0 will be made clearer in a few minutes). The use of the subscripts 0 to represent the
intercept and the subscript 1 to represent the coefficient for the X variable then readily
extends to more complex cases.
Review definition of intercept as the value of Y when X=0, and slope as the change in Y per unit
change in X.
Page 2 of 8
This is quite a relaxed assumption! For example, it allows us to deliberately choose values of X
from the extremes and then only at those X value, randomly select from the relevant subset of
the population, rather than having to select at random from the population as a whole.
Once the data points are selected, the estimation process can proceed, but not before assessing
the assumptions!
4. Assumptions
4.1 Linearity
Regression analysis assumes that the relationship between Y and X is linear. Make a scatterplot
between Y and X to assess this assumption. Perhaps a transformation is required (e.g. log(Y ) vs
log(X)). Some caution is required in transformation in dealing with the error structure as you will
see in later examples.
Plot residuals vs the X values. If the scatter is not random around 0 but shows some pattern
(e.g. quadratic curve), this usually indicates that the relationship between Y and X is not linear.
Or, fit a model that includes X and X 2 and test if the coefficient associated with X2 is zero.
Unfortunately, this test could fail to detect a higher order relationship. Third, if there are
multiple readings at some X-values, then a test of goodness-of-fit can be performed where the
variation of the responses at the same X value is compared to the variation around the
regression line.
The response and predictor variables must be both interval or ratio scaled. In particular, using a
numerical value to represent a category and then using this numerical value in a regression is not
valid. For example, suppose that you code hair color as (1=red, 2=brown, and 3=black). Then using
these values in a regression either as predictor variable or as a using these values in a regression
either as predictor variable or as a response variable is not sensible.
4.2 Correct sampling scheme
The Y must be a random sample from the population of Y values for every X value in the sample.
Fortunately, it is not necessary to have a completely random sample from the population as the
regression line is valid even if the X values are deliberately chosen. However, for a given X, the
values from the population must be a simple random sample.
4.3 No outliers or influential points
All the points must belong to the relationship there should be no unusual points. The
scatterplot of Y vs X should be examined. If in doubt, fit the model with the points in and out of
the fit and see if this makes a difference in the fit.
Page 3 of 8
Outliers can have a dramatic effect on the fitted line. For example, in the following graph, the
single point is an outlier and an influential point:
Page 4 of 8
This general problem is called the error in variables problem and has a long history in
statistics.
It turns out that there are two important cases. If the value reported for X is a nominal value
and the actual value of X varies randomly around this nominal value, then there is no bias in the
estimates. This is called the Berkson case after Berkson who first examined this situation. The
most common cases are where the recorded X is a target value (e.g. temperature as set by a
thermostat) while the actual X that occurs would vary randomly around this target value.
However, if the value used for X is an actual measurement of the true underlying X then there is
uncertainty in both the X and Y direction. In this case, estimates of the slope are attenuated
towards zero (i.e. positive slopes are biased downwards, negative slopes biased upwards). More
alarmingly, the estimate are no longer consistent, i.e. as the sample size increases, the estimates
no longer tend to the true population values!
This latter case of error in variables is very difficult to analyze properly and there are not
universally accepted solutions. Refer to the reference books listed at the start of this chapter
for more details.
5. Obtaining Estimates
To distinguish between population parameters and sample estimates, we denote the sample
intercept by b0 and the sample slope as b 1. The equation of a particular sample of points is
expressed :
where b0 is the estimated intercept, and b 1 is the estimated slope. The symbol
we are referring to the estimated line and not to a line in the entire population.
indicates that
How is the best fitting line found when the points are scattered? Many methods have been
proposed (and used) for curve fitting. Some of these methods are:
- least squares
- least absolute deviations
- least median-of-squares
- least maximum absolute deviation
- baeysian regression
- fuzzy regression
We typically use the principle of least squares. The least-squares line is the line that makes the
sum of the squares of the deviations of the data points from the line in the vertical direction as
small as possible.
Mathematically, the least squares line is the line that minimizes :
Page 5 of 8
where
is the point on the line corresponding to each X value. This is also known as the
predicted value of Y for a given value of X. This formal definition of least squares is not that
important - the concept as expressed in the previous paragraph is more important in particular
it is the SQUARED deviation in the VERTICAL direction that is used.
It is possible to write out a formula for the estimated intercept and slope, but who cares - let
the computer do the dirty work.
The estimated intercept (b0) is the estimated value of Y when X = 0. In some cases, it is
meaningless to talk about values of Y when X = 0 because X = 0 is nonsensical. For example, in a
plot of income vs year, it seems kind of silly to investigate income in year 0. In these cases,
there is no clear interpretation of the intercept, and it merely serves as a placeholder for the
line.
The estimated slope (b1) is the estimated change in Y per unit change in X. For every unit
change in the horizontal direction, the fitted line increased by b 1 units. If b1 is negative, the
fitted line points downwards, and the increase in the line is negative, i.e., actually a decrease. As
with all estimates, a measure of precision can be obtained. As before, this is the standard error
of each of the estimates. Again, there are computational formulae, but in this age of computers,
these are not important. As before, approximate 95% confidence intervals for the
corresponding population parameters are found as estimate 2 se.
Formal tests of hypotheses can also be done. Usually, these are only done on the slope
parameter as this is typically of most interest. The null hypothesis is that population slope is 0,
i.e. there is no relationship between Y and X (can you draw a scatterplot showing such a
relationship?). More formally the null hypothesis is:
notice that the null hypothesis is ALWAYS in terms of a population parameter and not in terms
of a sample statistic.
The alternate hypothesis is typically chosen as:
as:
and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value .
This is usually automatically done by most computer packages.
The p-value is interpreted in exactly the same way as in ANOVA, i.e. is measures the probability
of observing this data if the hypothesis of no relationship were true.
Page 6 of 8
The p-value does not tell the whole story, i.e. statistical vs. engineering (non)significance must be
determined and assessed.
6. Obtaining Predictions
Once the best fitting line is found it can be used to make predictions for new values of X.
There are two types of predictions that are commonly made. It is important to distinguish
between them as these two intervals are the source of much confusion in regression problems.
First, the experimenter may be interested in predicting a SINGLE future individual value for a
particular X. Second the experimenter may be interested in predicting the AVERAGE of ALL
future responses at a particular X. The prediction interval for an individual response is
sometimes called a confidence interval for an individual response but this is an unfortunate (and
incorrect) use of the term confidence interval. Strictly speaking confidence intervals are
computed for fixed unknown parameter values; predication intervals are computed for future
random variables.
Both of the above intervals should be distinguished from the confidence interval for the slope.
In both cases, the estimate is found in the same manner substitute the new value of X into the
equation and compute the predicted value
. In most computer packages this is accomplished
by inserting a new dummy observation in the dataset with the value of Y missing, but the value
of X present. The missing Y value prevents this new observation from being used in the fitting
process, but the X value allows the package to compute an estimate for this observation.
What differs between the two predictions are the estimates of uncertainty.
In the first case, there are two sources of uncertainty involved in the prediction. First, there is
the uncertainty caused by the fact that this estimated line is based upon a sample. Then there is
the additional uncertainty that the value could be above or below the predicted line. This
interval is often called a prediction interval at a new X.
In the second case, only the uncertainty caused by estimating the line based on a sample is
relevant. This interval is often called a confidence interval for the mean at a new X.
The prediction interval for an individual response is typically MUCH wider than the confidence
interval for the mean of all future responses because it must account for the uncertainty from
the fitted line plus individual variation around the fitted line. Many textbooks have the formulae
for the se for the two types of predictions, but again, there is little to be gained by examining
them. What is important is that you read the documentation of your software carefully to
ensure that you understand exactly what interval is being given to you.
Page 7 of 8
Page 8 of 8