You are on page 1of 3

Introduction to Regression

The central purpose of regression is to create a linear equation relating the independent
variable X, to the dependent variable Y. It permits us to answer the following kind of
question;
How much additional income does each additional year of education provide

Regression assumes that both variables are measured on at least an interval level and
should only be used if we think that this assumption is close to being met..

The prediction equation in the population

Yi^ = + Xi
where Y^ is being used as Yhat.

Y^ is known as the predicted value of Y for a given value of X.


It is considered the best estimate of Y for a given X.

If the model (equation) is correct. for the population, Y^ equals Y|X, this is known as the
conditional mean of Y given X, and is the population mean of Y for the particular value
of X.

Thus the regression line can be considered the path of mean values of the Y as X changes.
The line is produced by plugging values of X into the linear regression formula and
solving for Y^.

is the regression slope -- the amount of change in Y for each unit change in X. (Note
that one must specify the units)

is the Y intercept. It is the value of Y^ when X =0.


These symbols ( and ) have no relationship to as in level

The population equation with error

Yi = + Xi + i

Where i is the error and


i = Yi - Y^i

and where Yi is the actual observed value of Y for a case

Computing the sample statistics

Yi^ = a + bXi
is the sample prediction equation
Yi = a + bXi + ei

is the sample equation with error


where a, b, and e are sample estimates of , and respectively.

Equations for b and a


These equations are designed to produce the best fitting line for the scatterplot

A scatterplot is a two-dimensional set of the observations in X and y co-ordinates , in


which the location of each point indicates both the values of both Xand Y for that case.

The best fitting line is defined as the one which minimizes the sum of the squared
distances of all points from the regression line.

The procedure for improving best estimates for a dependent variable (Y) by accounting
for its relationship with an independent variable (X) is called simple linear correlation
and regression analysis
Simple Linear Correlation and Regression Analysis
Simple linear correlation and regression analysis is the use of the formula for a straight
line to improve best estimates of an interval/ratio dependent variable (Y) for all values of
an interval/ratio independent variable (X)
Linear means “straight line”
Scatterplots
A linear regression formula is the formula for a straight line
Simple linear correlation and regression statistics apply only to scatterplots with
coordinates in a linear, cigar-shaped pattern
The formula for a straight line to estimate Y is: Ý = a + bX

The Regression Line on the Scatterplot


The regression line is the best-fitting straight line plotted through the X,Y-coordinates of
a scatterplot
Positive and Negative Correlations
A positive correlation is an upward sloping pattern in a scatterplot where an increase in X
is related to an increase in Y
A negative correlation is a downward sloping pattern where an increase in X is related to
a decrease in Y
When the pattern lacks an elongated, sloped cigar shape, there is no correlation, an
increase in X is unrelated to the scores of Y
Computing Correlation and Regression Statistics
To calculate correlation and regression statistics, set up a spreadsheet to obtain the
following sums: ΣX, ΣY, ΣX2, ΣY2, and ΣXY
Pearson’s r Correlation Coefficient
Pearson’s r is a widely used correlation coefficient that measures the tightness of fit of
X,Y-coordinates around the regression line of a scatterplot
Computed values of Pearson’s r can range from -1 to +1
The larger the absolute value of r, the tighter the fit of X,Y-coordinates around the
regression line
The Sign of Pearson’s r
When the regression line slopes upward, we have a positive correlation. Pearson’s r will
be positive up to a value of +1
When the regression line slopes downward, we have a negative correlation. Pearson’s r
will be negative down to a value of -1
When the regression line is flat, we have no correlation and Pearson’s r = 0
Regression Statistics
The coefficients and symbols of the regression line formula, Ý = a + bX
Ý = the predicted Y (an estimate of the dependent variable Y computed for a given value
of the independent variable X)
Recall that the objective of correlation and regression is to use the regression line to make
better estimates of Y
Regression Statistics: The Slope, b
b = slope of the regression line (called the regression coefficient)
b conveys slope in the sense of going up or down a hill. It answers the question: How
far does the line rise for every one-unit run of X?
Regression Statistics: The Y-intercept, a
a = Y-intercept, the point at which the regression line intersects the Y-axis when X = 0
To compute a, calculate the means of X and Y, substitute them into the regression
equation, and solve for a
Plotting the Regression Line
To plot the regression line on the scatterplot, use the regression equation to calculate a
few values of Ý
Do this by inserting chosen values of X and solve for Ý in the regression equation
Ý = a + bX
The Importance of Observing the Scatterplot
The linear regression equation applies only when the pattern of coordinates is linear
The presence of outlier coordinates can cause the attenuation (weakening) or inflation of
the Pearson’s r correlation coefficient

You might also like