You are on page 1of 8

Single-variable regression

1. Introduction
Along with the Analysis of Variance, this is likely the most commonly used statistical
methodology in chemical and engineering research. In virtually every issue of a chemical
engineering journal, you will find papers that use a regression analysis. There are HUNDREDS of
books written on regression analysis. Some of the better ones (in my opinion) are:
Draper and Smith. Applied Regression Analysis. Wiley.
Neter, Wasserman, and Kutner. Applied Linear Statistical Models. Irwin.
Kleinbaum, Kupper, Miller. Applied Regression Analysis. Duxbury.
Zar. Biostatistics. Prentice Hall.
Consequently, this set of notes is VERY brief and makes no pretense to be a thorough review of
regression analysis. Please consult the above references for all the gory details.
It turns out that both Analysis of Variance and Regression are special cases of a more general
statistical methodology called General Linear Models which in turn are special cases of
Generalized Linear Models which in turn are special cases of Generalized Additive Models, which
in turn are special cases of .....
The key difference between a Regression analysis and an ANOVA is that the X variable is
nominal scaled in ANOVA, while in regression analysis the X variable is continuous scaled. This
implies that in ANOVA, the shape of the response profile was unspecified (the null hypothesis
was that all means were equal while the alternate was that at least one mean differs), while in
regression, the response profile must be a linear line.
Because both ANOVA and regression are from the same class of statistical models, many of the
assumptions are similar, the fitting methods are similar, hypotheses testing and inference are
similar as well.

2. Equation for a line - getting notation straight


In order to use regression analysis effectively, it is important that you understand the concepts
of slopes and intercepts and how to determine these from data values.
This will be QUICKLY reviewed here in class.
In previous courses at high school or in linear algebra, the equation of a straight line was often
written y = mx + b where m is the slope and b is the intercept. In some popular spreadsheet
programs, the authors decided to write the equation of a line as y = a + bx. Now a is the
intercept, and b is the slope. Statisticians, for good reasons, have rationalized this notation and
usually write the equation of a line as y = o + 1x or as Y = b0 + b1X. (the distinction between o

Page 1 of 8

and b0 will be made clearer in a few minutes). The use of the subscripts 0 to represent the
intercept and the subscript 1 to represent the coefficient for the X variable then readily
extends to more complex cases.
Review definition of intercept as the value of Y when X=0, and slope as the change in Y per unit
change in X.

3. Populations and samples


All of statistics is about detecting signals in the face of noise and in estimating population
parameters from samples. Regression is no different.
First consider the population. The correct definition of the population is important as part of
any study. Conceptually, we can think of the large set of all units of interest. On each unit, there
is conceptually, both an X and Y variable present. We wish to summarize the relationship
between Y and X, and furthermore wish to make predictions of the Y value for future X values
that may be observed from this population. [This is analogous to having different treatment
groups corresponding to different values of X in ANOVA.]
If this were physics, we may conceive of a physical law between X and Y , e.g. F = ma or PV = nRt.
However, in chemical engineering, the relationship between Y and X is often much more tenuous.
If you could draw a scatterplot of Y against X for ALL elements of the population, the points
would NOT fall exactly on a straight line. Rather, the value of Y would fluctuate above or below a
straight line at any given X value.
We denote this relationship as
Y = o + 1X +
where now o and 1 are the POPULATION intercept and slope respectively. We say that
E[Y ] = o + 1X
is the expected or average value of Y at X.
The term represent random variation of individual units in the population above and below the
expected value. It is assumed to have constant standard deviation over the entire regression line
(i.e. the spread of data points in the population is constant over the entire regression line).
Of course, we can never measure all units of the population. So a sample must be taken in order
to estimate the population slope, population intercept, and population standard deviation. Unlike a
correlation analysis, it is NOT necessary to select a simple random sample from the entire
population and more elaborate schemes can be used. The bare minimum that must be achieved is
that for any individual X value found in the sample, the units in the population that share this X
value, must have been selected at random.

Page 2 of 8

This is quite a relaxed assumption! For example, it allows us to deliberately choose values of X
from the extremes and then only at those X value, randomly select from the relevant subset of
the population, rather than having to select at random from the population as a whole.
Once the data points are selected, the estimation process can proceed, but not before assessing
the assumptions!

4. Assumptions
4.1 Linearity
Regression analysis assumes that the relationship between Y and X is linear. Make a scatterplot
between Y and X to assess this assumption. Perhaps a transformation is required (e.g. log(Y ) vs
log(X)). Some caution is required in transformation in dealing with the error structure as you will
see in later examples.
Plot residuals vs the X values. If the scatter is not random around 0 but shows some pattern
(e.g. quadratic curve), this usually indicates that the relationship between Y and X is not linear.
Or, fit a model that includes X and X 2 and test if the coefficient associated with X2 is zero.
Unfortunately, this test could fail to detect a higher order relationship. Third, if there are
multiple readings at some X-values, then a test of goodness-of-fit can be performed where the
variation of the responses at the same X value is compared to the variation around the
regression line.
The response and predictor variables must be both interval or ratio scaled. In particular, using a
numerical value to represent a category and then using this numerical value in a regression is not
valid. For example, suppose that you code hair color as (1=red, 2=brown, and 3=black). Then using
these values in a regression either as predictor variable or as a using these values in a regression
either as predictor variable or as a response variable is not sensible.
4.2 Correct sampling scheme
The Y must be a random sample from the population of Y values for every X value in the sample.
Fortunately, it is not necessary to have a completely random sample from the population as the
regression line is valid even if the X values are deliberately chosen. However, for a given X, the
values from the population must be a simple random sample.
4.3 No outliers or influential points
All the points must belong to the relationship there should be no unusual points. The
scatterplot of Y vs X should be examined. If in doubt, fit the model with the points in and out of
the fit and see if this makes a difference in the fit.

Page 3 of 8

Outliers can have a dramatic effect on the fitted line. For example, in the following graph, the
single point is an outlier and an influential point:

4.4 Equal variation along the line (homoskedacity)


The variability about the regression line is similar for all values of X, i.e. the scatter of the
points above and below the fitted line should be roughly constant over the entire line. This is
assessed by looking at the plots of the residuals against X to see if the scatter is roughly
uniformly scattered around zero with no increase and no decrease in spread over the entire line.
4.5 Independence
Each value of Y is independent of any other value of Y. The most common cases where this fail
are time series data where X is a time measurement. In these cases, time series analysis should
be used. This assumption can be assessed by again looking at residual plots against time or other
variables.
4.6 Normality of errors
The difference between the value of Y and the expected value of Y is assumed to be normally
distributed. This is one of the most misunderstood assumptions. Many people erroneously assume
that the distribution of Y over all X values must be normally distributed, i.e they look simply at
the distribution of the Y s ignoring the Xs. The assumption only states that the residuals, the
difference between the value of Y and the point on the line must be normally distributed.
This can be assessed by looking at normal probability plots of the residuals.
4.7 X measured without error
This is a new assumption for regression as compared to ANOVA. In ANOVA, the group
membership was always exact, i.e. the treatment applied to an experimental unit was known
without ambiguity. However, in regression, it can turn out that that the X value may not be
known exactly.

Page 4 of 8

This general problem is called the error in variables problem and has a long history in
statistics.
It turns out that there are two important cases. If the value reported for X is a nominal value
and the actual value of X varies randomly around this nominal value, then there is no bias in the
estimates. This is called the Berkson case after Berkson who first examined this situation. The
most common cases are where the recorded X is a target value (e.g. temperature as set by a
thermostat) while the actual X that occurs would vary randomly around this target value.
However, if the value used for X is an actual measurement of the true underlying X then there is
uncertainty in both the X and Y direction. In this case, estimates of the slope are attenuated
towards zero (i.e. positive slopes are biased downwards, negative slopes biased upwards). More
alarmingly, the estimate are no longer consistent, i.e. as the sample size increases, the estimates
no longer tend to the true population values!
This latter case of error in variables is very difficult to analyze properly and there are not
universally accepted solutions. Refer to the reference books listed at the start of this chapter
for more details.

5. Obtaining Estimates
To distinguish between population parameters and sample estimates, we denote the sample
intercept by b0 and the sample slope as b 1. The equation of a particular sample of points is
expressed :
where b0 is the estimated intercept, and b 1 is the estimated slope. The symbol
we are referring to the estimated line and not to a line in the entire population.

indicates that

How is the best fitting line found when the points are scattered? Many methods have been
proposed (and used) for curve fitting. Some of these methods are:
- least squares
- least absolute deviations
- least median-of-squares
- least maximum absolute deviation
- baeysian regression
- fuzzy regression
We typically use the principle of least squares. The least-squares line is the line that makes the
sum of the squares of the deviations of the data points from the line in the vertical direction as
small as possible.
Mathematically, the least squares line is the line that minimizes :

Page 5 of 8

where
is the point on the line corresponding to each X value. This is also known as the
predicted value of Y for a given value of X. This formal definition of least squares is not that
important - the concept as expressed in the previous paragraph is more important in particular
it is the SQUARED deviation in the VERTICAL direction that is used.
It is possible to write out a formula for the estimated intercept and slope, but who cares - let
the computer do the dirty work.
The estimated intercept (b0) is the estimated value of Y when X = 0. In some cases, it is
meaningless to talk about values of Y when X = 0 because X = 0 is nonsensical. For example, in a
plot of income vs year, it seems kind of silly to investigate income in year 0. In these cases,
there is no clear interpretation of the intercept, and it merely serves as a placeholder for the
line.
The estimated slope (b1) is the estimated change in Y per unit change in X. For every unit
change in the horizontal direction, the fitted line increased by b 1 units. If b1 is negative, the
fitted line points downwards, and the increase in the line is negative, i.e., actually a decrease. As
with all estimates, a measure of precision can be obtained. As before, this is the standard error
of each of the estimates. Again, there are computational formulae, but in this age of computers,
these are not important. As before, approximate 95% confidence intervals for the
corresponding population parameters are found as estimate 2 se.
Formal tests of hypotheses can also be done. Usually, these are only done on the slope
parameter as this is typically of most interest. The null hypothesis is that population slope is 0,
i.e. there is no relationship between Y and X (can you draw a scatterplot showing such a
relationship?). More formally the null hypothesis is:

notice that the null hypothesis is ALWAYS in terms of a population parameter and not in terms
of a sample statistic.
The alternate hypothesis is typically chosen as:

The test-statistics is found

as:

and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value .
This is usually automatically done by most computer packages.
The p-value is interpreted in exactly the same way as in ANOVA, i.e. is measures the probability
of observing this data if the hypothesis of no relationship were true.

Page 6 of 8

The p-value does not tell the whole story, i.e. statistical vs. engineering (non)significance must be
determined and assessed.

6. Obtaining Predictions
Once the best fitting line is found it can be used to make predictions for new values of X.
There are two types of predictions that are commonly made. It is important to distinguish
between them as these two intervals are the source of much confusion in regression problems.
First, the experimenter may be interested in predicting a SINGLE future individual value for a
particular X. Second the experimenter may be interested in predicting the AVERAGE of ALL
future responses at a particular X. The prediction interval for an individual response is
sometimes called a confidence interval for an individual response but this is an unfortunate (and
incorrect) use of the term confidence interval. Strictly speaking confidence intervals are
computed for fixed unknown parameter values; predication intervals are computed for future
random variables.
Both of the above intervals should be distinguished from the confidence interval for the slope.
In both cases, the estimate is found in the same manner substitute the new value of X into the
equation and compute the predicted value
. In most computer packages this is accomplished
by inserting a new dummy observation in the dataset with the value of Y missing, but the value
of X present. The missing Y value prevents this new observation from being used in the fitting
process, but the X value allows the package to compute an estimate for this observation.
What differs between the two predictions are the estimates of uncertainty.
In the first case, there are two sources of uncertainty involved in the prediction. First, there is
the uncertainty caused by the fact that this estimated line is based upon a sample. Then there is
the additional uncertainty that the value could be above or below the predicted line. This
interval is often called a prediction interval at a new X.
In the second case, only the uncertainty caused by estimating the line based on a sample is
relevant. This interval is often called a confidence interval for the mean at a new X.
The prediction interval for an individual response is typically MUCH wider than the confidence
interval for the mean of all future responses because it must account for the uncertainty from
the fitted line plus individual variation around the fitted line. Many textbooks have the formulae
for the se for the two types of predictions, but again, there is little to be gained by examining
them. What is important is that you read the documentation of your software carefully to
ensure that you understand exactly what interval is being given to you.

Page 7 of 8

Page 8 of 8

You might also like