You are on page 1of 63

Introduction to Regression Analysis

Instructor: Dr. Tamanna Howlader


Associate Professor
ISRT, Dhaka University

Correlation
Measures the relative strength of the linear
relationship between two variables
Y
X
Y
X
Y
X
r = -1
r = 0 r = +.3
Correlation vs Regression
In correlation, the two variables are treated as equals.

In regression, one variable is considered independent
(=predictor) variable (X) and the other the dependent
(=outcome) variable Y.
A statistical technique that uses a single,
independent variable (X) to estimate a single
dependent variable (Y).

Based on the equation for a line:

Y = mX + B
Simple Linear Regression
What is Linear?
B
Y=mX+B?
What is slope?
A slope of 2 means that every 1-unit change in X yields a
2-unit change in Y.
m
X
Y
Applications of linear regression
Applications: Engineering, the
physical and chemical science,
economics, management, life and
biological science, and the social
science
Use: Prediction and estimation
Motivating example
The distribution of baby weights at a certain
Hospital ~ N(3400, 360000)

Your Best guess at a random babys weight,
given no information about the baby, is what?
3400 grams

But, what if you have relevant information?
Can you make a better guess?
X=gestation time
Assume that babies that gestate for
longer are born heavier, all other things
being equal.
Assume that this relationship is linear.
Example: suppose a one-week increase
in gestation, on average, leads to a
100-gram increase in birth-weight
Predictor variable
Y
=
b
i
r
t
h
-
w
e
i
g
h
t
(
g
)

X=gestation time (weeks)
Y depends on X
Postive correlation between
gestation and birth weight
A new baby is born that had gestated for just 30
weeks. Whats your best guess at the birth-weight?

Are you guessing 3400 g?
Prediction
NO
3000
Y
=
b
i
r
t
h
-
w
e
i
g
h
t

(
g
)

X=gestation time (weeks)
At 30 weeks
Best fit line
30
(x,y)=(30,3000)
The babies that gestate for 30 weeks appear
to center around a weight of 3000 grams and
vary around 3000 with some
2
.

In statistics we say
E(Y|X=30 weeks)=3000 grams

At 30 weeks
Y
=
b
i
r
t
h
-
w
e
i
g
h
t

(
g
)

If, X=20, 30, or 40
X=gestation times (weeks)
20 30 40
Y/X=40 weeks ~ N(4000, o
2
)
Y/X=30 weeks ~ N(3000, o
2
)
Y/X=20 weeks ~ N(2000, o
2
)
E(Y|X=40 weeks)=4000
E(Y|X=30 weeks)=3000
E(Y|X=20 weeks)=2000

E(Y|X)= 100 grams/week * X weeks



Mean values fall on the line
Note that not every Y-value (Yi) sits on the
line. Theres variability.

Y
i
=3000 + random error
i


i c
X
Y X
| |
0 1
+ =
X
i
? (the actual value of Y
i
)

i
error or residual


Random error
Follows a
normal
distribution
Fixed
exactly
on the
line
Ys are modeled


Y
i =
100*X + random error
i

Linear regression model
i i i
Y X | | c
0 1
+ + =
Regression Coefficients for a . . .
Population
Sample
Y = b
0
+ b
1
X
i
+ e

Y = b
0
+ b
1
X
i

Population regression line
Fitted regression line
Linear regression assumes that
1. Relationship between X and Y is linear
2.X is fixed
3. Errors (
i
) are distributed normally with mean 0.
Thus, for each value of X, Y should be distributed
as normal.
4.Errors are uncorrelated.
5. The variance of Y at every value of X is the
same (homogeneity of variances)
Assumptions
Non-homogenous variance
Y=birth-
weight
(100g)
X=gestation time (weeks)
Estimating the regression line
Method of least squares --- fits the regression line by
minimizing the squared errors from the line

i c
Y X
| |
0 1
+ =
Y
i

.
Y
i

Y
i
X
i

=
i
i i
i
i
Y Y
2 2
)

( c Min
Estimating the regression line
Least Squares Estimators

( )( )
( )


=
2
i
i i
1
X X
Y Y X X

1 0
=
Expected value of y:
i
y

i i
x y
1 0

| | + =
Expected value of y at level of x: x
i
=
X
i
Y
X
Y
SST = (Y
i
- Y)
2
SSE =(Y
i
- Y
i
)
2

.
SSR = (Y
i
- Y)
2


.
_
_
_
Linear regression - Variation
Explanatory power of a linear
regression model
Coefficient of determination
R
2 =
SSR
SST
=
Explained variability
Total sample variability

Proportion of sample variability of the dependent variable explained by its
linear relationship with the independent variable.

EXAMPLE

R
2
= 0.92 for gestation data implies 92% of the sample variability in
birth weight is explained by its linear dependence on gestation length.
Tests for parameter
1
Source of variation SS d.f. MS V.R.
Linear regression
Residual
SSR
SSE
1
n-2
MSR = SSR/1
MSE = SSE/(n-2)
F
c
= MSR/MSE
Total SST n-1
H
0
:
1
= 0 (no linear relationship)
H
1
:
1
= 0 (linear relationship does exist)
ANOVA Table
Perform F - test
Reject H
0
if level of significance > p-value = P(F > F
c
)

Rejection region for the F-test
X not a strong enough predictor of Y, or
Relationship between X and Y not linear
Tests for parameter
H
0
:
1
= 0 is not rejected
Relationship between X and Y linear, and
Relationship between X and Y strong enough
to use estimated regression line for predicting Y
H
0
:
1
= 0 is rejected
Example
SPSS printout for advertising-sales
regression
Intercept
slope
X Y 7 . 0 1 . 0

+ =
Residual
Residual = observed value predicted value
At 33.5 weeks gestation,
predicted baby weight is
3350 grams
33.5 weeks
This baby was actually
3380 grams.
His residual is +30
grams:
3350
grams
Residual
)

(

1 0 i i i i i
x y y y e | | = =
We fit the regression coefficients such that sum
of the squared residuals were minimized (least
squares regression).
Residual Analysis: check
assumptions
The residual for observation i, e
i
, is the difference
between its observed and predicted value
Check the assumptions of regression by examining the
residuals
Examine for linearity assumption
Examine for constant variance for all levels of X
(homoscedasticity)
Evaluate normal distribution assumption
Evaluate independence assumption
Graphical Analysis of Residuals
Can plot residuals vs. X
i i i
Y Y e

=
Residual Analysis for
Linearity
Not Linear
Linear

x
r
e
s
i
d
u
a
l
s

x
Y
x
Y
x
r
e
s
i
d
u
a
l
s

Residual Analysis for
Homoscedasticity
Non-constant variance

Constant variance
x x
Y
x
x
Y
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

Residual Analysis for
Independence
Not Independent
Independent
X
X
r
e
s
i
d
u
a
l
s

X
r
e
s
i
d
u
a
l
s


Y
Multiple linear regression
More than one predictor

E(Y|X, W, Z)= o + |1*X + |2 *W + |3 *Z

The intercept is the average value of the dependent
variable when every independent variable takes the
value zero
Each regression coefficient is the amount of change
in the outcome variable that would be expected per
one-unit change of the predictor, if all other variables
in the model were held constant.

Y= o + |
1
*X
1
+ |
2
*X
2
+ |
3
*X
3
+

where
Y= current market value of home
X1=square feet of living area
X2=appraise value last year
X3=quality of construction (price per square foot)

Y= 3 + 2*X1 + 0.25*X2 + 0.75* X3

Interpretation of |
1
:

Controlling for all other variables in the model, every 1 unit increase in
living area causes the average market value of home to increase by 2
units.

Example
^
^
Explanatory power
R
2
/Adjusted R
2


Adjusted R
2
should be used when number of
independent variables is large in relation to
the number of data values n

R
2 =
0.87 87% of the variability in Y can
be explained by the regression line.

Multiple correlation R
Measures strength of relationship between
observed and predicted values of Y


Testing global utility of model
Analysis of Variance F-test
H
0
: |
1
= |
2
= = |
k
= 0
H
1
: Not all |
j
= 0
Source of variation SS d.f. MS V.R.
Linear regression
Residual
SSR
SSE
k
n-k-1
MSR = SSR/k
MSE = SSE/(n-k-1)
F
c
= MSR/MSE
Total SST n-1
Reject H
0
if level of significance > p-value = P(F > F
c
)

Test of individual coefficient
H
0
: |
j
= 0
H
1
: |
j
0

t- test

Reject H
0
if level of significance > p-value
of test

SPSS printout for multiple
regression
2 2 . 198 1 3 . 80 6 . 279

X X Y + + =
Qualitative predictors
Qualitative variables: Variables whose values are
categories and convey the concept of attribute rather
than amount or quantity.

Examples: martial status, gender, race, occupation,
smoking status, etc.

In order to incorporate a qualitative independent
variable in the multiple regression model it must be
quantified in some manner.
Dummy variables
Dummy variable/indicator variable: Takes on the
values 0 or 1

Purpose: To identify the different categories of a
qualitative variable


A qualitative variable with k classes will be represented
by k-1 dummy variables in the regression model



How to use dummy variables
Qualitative variable Dummy variable
Gender (male, female)
Place of residence
(urban, rural , suburban)
Smoking status
[Current smoker, ex-smoker (has not
smoked for 5 years or less), ex-smoker
(has not smoked for more than 5 years),
never smoked]

=
female for
male for
0
1
x
1

=
suburban and rural for
urban for
0
1
x
1

=
suburban and urban for
rural for
0
1
x
2

=
otherwise
smoker current for
0
1
x
1

s
=
otherwise
years) 5 ( smoker - ex for
0
1
x
2

>
=
otherwise
years) 5 ( smoker - ex for
0
1
x
3
Example
In a study of factors thought to be associated with birth
weight (Y), two independent variables were considered:
age of mother (X
1
), which is quantitative; and smoking
status of mother (smoker or nonsmoker).
Qualitative variable Dummy variable
Smoking status
(smoker, nonsmoker)

=
otherwise
smoker current for
x
0
1
2
i 2 i 1 i
x 245 x 143 2390 y

+ =
Fitted regression model:


) 1 ( 245 x 143 2390 y

i 1 i
+ =
i 1
x 143 2635+ =
Smoking mothers
(x
2
= 1)



) 0 ( 245 x 143 2390 y

i 1 i
+ =
i 1
x 143 2390 + =
Nonsmoking mothers
(x
2
= 0)


Example
X1 (Age of mother)
Y

(
B
i
r
t
h

w
e
i
g
h
t
)

Example
A team of mental health researchers wishes to compare three
methods (A, B, and C) of treating severe depression. The
dependent variable Y is treatment effectiveness, the
quantitative independent variable X
1
is patients age, and the
independent variable type of treatment is a qualitative variable
that occurs at three levels.
Qualitative variable Dummy variable

Treatment
(treatment A, treatment B, treatment C )

=
otherwise
A treatment for
x
0
1
2

=
otherwise
B treatment for
x
0
1
3
Example
Fitted regression model:


i 3 i 2 i 1 i
x 7 . 22 x 3 . 41 x 03 . 1 21 . 6 y

+ + + =
The regression equations for the three treatments
are as follows:


i 1 i 1 i
x 03 . 1 51 . 47 x 03 . 1 ) 3 . 41 21 . 6 ( y

+ = + + =
i 1 i 1 i
x 03 . 1 91 . 28 x 03 . 1 ) 7 . 22 21 . 6 ( y

+ = + + =
i 1 i
x 03 . 1 21 . 6 y

+ =
Treatment A


Treatment B


Treatment C


Example
LAB SESSION
Multiple regression using SPSS
Let's look at an example dataset called crime. This dataset appears in
Statistical Methods for Social Sciences, Third Edition by Alan Agresti
and Barbara Finlay (Prentice Hall, 1997). The variables are

state id (sid)
state name (state)
violent crimes per 100,000 people (crime)
murders per 1,000,000 (murder)
the percent of the population living in metropolitan areas (pctmetro)
the percent of the population that is white (pctwhite)
percent of population with a high school education or above (pcths),
percent of population living under poverty line (poverty)
and percent of population that are single parents (single).
Using SPSS commands for
correlation
CORRELATIONS
/VARIABLES= CRIME PCTMETRO
/PRINT=TWOTAIL NOSIG
/MISSING=PAIRWISE.












GRAPH
/SCATTERPLOT(BIVAR)= PCTMETRO WITH CRIME .
Using SPSS commands for
correlation
Using SPSS commands for
regression
REGRESSION VARIABLES = CRIME MURDER PCTMETRO PCTWHITE
PCTHS POVERTY SINGLE
/DEPENDENT = CRIME
/METHOD=ENTER.
Using SPSS commands for
regression
REGRESSION VARIABLES = CRIME MURDER PCTMETRO PCTWHITE
PCTHS POVERTY SINGLE
/STATISTICS= R ANOVA COEFF
/DEPENDENT = CRIME
/METHOD= ENTER PCTMETRO POVERTY SINGLE.
Using SPSS commands for
regression
REGRESSION VARIABLES = CRIME MURDER PCTMETRO PCTWHITE
PCTHS POVERTY SINGLE
/STATISTICS= R ANOVA COEFF
/DEPENDENT = CRIME
/METHOD=FORWARD.
Using SPSS commands for
regression
REGRESSION VARIABLES = CRIME MURDER PCTMETRO PCTWHITE
PCTHS POVERTY SINGLE
/DEPENDENT = CRIME
/METHOD=ENTER PCTMETRO POVERTY SINGLE
/RESIDUALS.
Using SPSS commands for
regression
Residuals appear to be normally distributed
Using SPSS commands for
regression

REGRESSION VARIABLES = CRIME MURDER PCTMETRO PCTWHITE PCTHS
POVERTY SINGLE
/DEPENDENT = CRIME
/METHOD=ENTER PCTMETRO POVERTY SINGLE
/SCATTERPLOT=(*ZRESID, *PRED).
Distribution of residuals appears okay
indicating homogeneous variance
Using SPSS commands for
regression
REGRESSION VARIABLES = CRIME MURDER PCTMETRO PCTWHITE PCTHS
POVERTY SINGLE
/DEPENDENT = CRIME
/METHOD=ENTER PCTMETRO POVERTY SINGLE
/RESIDUALS
/SAVE=PRED(PREDVAL) RESID(RESIDUAL) ICIN.
Using SPSS commands for
regression
Using SPSS commands for
regression


The End

You might also like