You are on page 1of 62

Simple linear

regression
Introduction
AS202 Applied
Statistical
Models
1

What is Linear Regression?


It is a statistical technique that attempts to explore
and model the relationship between two or more
variables using a straight line.

Numerical
Relationship

Causal
Relationship

Functional relation:

Regression Model:

Perfect fit: all values fall


exactly on the straight
line.

Not a perfect fit: values do


not fall exactly on the line.
Errors exist:

Y f ( X ) a bX
Y a bX cZ

Y a bX errors
Y a b1 X 1 b2 X 2 errors

Type of relationship:
1. Linear relationship (refer to the slope, not regressor)
Simple: one regressor
yi 0 1 xi i

Multiple: more than one regressor


2500

Size (feet squared)

yi 0 1 x1 2 x2 L k xk i

2000
1500
1000
500
0
500

400

Polynomial: one or more regressors with higher


degrees, e.g. quadratic, cubic etc.

300

200

Price (RM1000)

100

10

20

30

40

Age of home (years)

y i 0 1 x 2 x 2 k x k i
yi 0 1 x1 2 x2 3 x3 4 x1 x2 5 x1 x3 6 x2 x3
7 x1 x2 x3 i
4

50

2. Non-linear relationship in one or more variables:


exponential, logarithm, logistic etc. Eg:

yi e

xi

xi
yi
i
xi
yi

1
1 e xi i

We use regression for


1. Data description (modeling)
2. Parameter estimation
3. Prediction/ forecasting
4. System control
5. Data reduction

Some applications:
1. An economist wants to investigate the
relationship between the petrol price and
the inflation rate.
2. A sale manager is interested to predict
the total sale in next year based on the
number of staffs and square feet of
space in the store
3. A policy maker wants to identify the main
factors (e.g speed limit, road condition,
weather) contribute to the number of
road accident.
4. A scientist wants to know at which level
of sound pollution will affect human
health.
5. A computer scientist wants to compress
an image for minimum storage.

System
control
Data
description
Forecast
Data
reduction
Parameter
estimation
7

Simple linear
regression
Simple linear
regression model
AS202 Applied
Statistical
Models
8

Example:
You want to know if there is a relationship between the
monthly personal income and the age of a worker, and
then to forecast your monthly income when you are at
50 year-old. There are five workers in your study.
The two variables are:
1. Age of worker (years): Independent variable (X)
2. Monthly personal income (RM): Dependent
variable (Y)
3. Sample size (n): 5 workers
9

Data collected from 5 working adults


Respondent

Age

Personal Income (Monthly)

34

2950

45

4000

29

2430

32

3000

23

1790

10

Age increase => income increase

RM4565.87

?
?
?

What is your income when you reach 50 year-old?


11

Mathematical equation for the straight line

YX

The gap between the points and the line are errors of the model.
12

The Simple Linear Regression model represents the straight


line with errors:
yi 0 1 xi i

i=1,2,..n

(Eq 1.1)

Where

13

Assumptions:
1) The error term i is normally distributed with mean
E(i)=0 and constant variance Var(i)=2 ;
2) The errors are uncorrelated with Cov i , j 0; i j

i ~ NID 0, 2
This implies that the dependent variable Y follows a
normal distribution with

E y | xx
Vari ; y x |

14

Some Properties:

15

Distribution of y at
x=23. The mean
E(y) is +(23) and
the standard
deviation is

y
x

E(y|x)=+x

Distribution of y at
x=45. The mean
E(y) is +(45) and
the standard
deviation is

16

Simple linear
regression
Least square
estimation
AS202 Applied
Statistical
Models
17

Objective: find a line that best fit to the empirical data.

18

Find a line that fit the data best


= Estimate the values of and that minimize the errors

19

Parameter Estimation of and : Least Square Method


yi 0 1 xi i

i=1,2,..n

Re-write the model as

20

Parameter Estimation of and : Least Square Method


To eliminate the negative signs of the error term,

Consider the squared error for n pairs of sample data, we have


the error sum of squares:

21

Label the error sum of squares as S(,) and called it as LS


criterion:

Tutorial Question 2: How to find the estimators

22

Result: The LS estimators for and are

Thus, the fitted simple linear regression model is

23

Re-visit the example:


Respondent

Age (x)

Income (y)

xy

x2

34

2950

100300

1156

45

4000

180000

2025

29

2430

70470

841

32

3000

96000

1024

23

1790

41170

529

Sum

163

14170

487940

5575

xi 163,

yi 14170,

x 32.6,

y 2834

xi yi 487940,

2
x
i 5575

24

Thus, the fitted SLR model is

25

Properties of LSE:
1. The LSEs are linear combinations of the observations yi
n
n
xi x yi

S xy
1

ci yi
S xx i 1
S xx
i 1

0 y 1 x y x ci yi c
i 1

2. The LSEs are unbiased estimators of the parameters 0 an 1.

E 1 1

E 0 0

26

3. The variance of the LSEs are


2
2


1
x

Var 1
Var 0 Var y 1 x

S xx

S xx

Gauss-Markov Theorem: (The best linear unbiased estimators)


Under the conditions of regression model, the least squares
estimators are unbiased and have minimum variance among all
unbiased linear estimators.

27

Summary
1. The SLR model
yi 0 1 xi i

i=1,2,,n

2. The LS estimator of and are

3. The fitted SLR model


28

Simple linear
regression
Forecasting
using SLR
AS202 Applied
Statistical
Models
29

Suppose you constructed the SLR model


using n pairs of sample data (x1,y1), (x2,y2), ,
(xn,yn) and the range of x is [a,b].
Two types of forecasting:
Extrapolation: predict the values of y using
x outside the range [a,b].
Interpolation: predict the values of y using x
inside the range [a,b].

30

Re-visit the example:


Respondent

Age (x)

Income (y)

xy

x2

34

2950

100300

1156

45

4000

180000

2025

29

2430

70470

841

32

3000

96000

1024

23

1790

41170

529

Sum

163

14170

487940

5575

Age range used: [23,45]

31

At age 50 (x=50), your predicted monthly income is

yi 410.7725 99.5329 50 4565.87


RM4565.8725

Extrapolation!
32

At age 30 (x=30), your predicted monthly income is

y i 410.7725 99.5329 30 2575.21

RM2575.21

Interpolation!
33

1) It is intended as interpolation model


over the range of the regressor
variables. We must be careful if we
extrapolate outside of this range.

34

Properties of the Fitted Regression Model:


1. The difference between the observed value yi and the
corresponding fitted value is a residual:

ei yi y i yi 0 1 xi

and the sum of the residuals is always zero:


n

e
i 1

2. The LS regression line always passes through the


centroid x, y of the data.

35

3. The sum of the observed value yi equals the sum of


the fitted values n
n

y y
i 1

i 1

4. The sum of the residuals weighted by the corresponding


value of the regressor variable always equals zero.
n

xe

i i

i 1

5. The sum of the residuals weighted by the corresponding


fitted value always equals zero.
n

y
i 1

ei 0

36

Simple linear
regression
Interval
estimation
AS202 Applied
Statistical
Models
37

Estimation of Variance 2
Method 1: based on several observations (replication) on y for
at least one value of x.
Method 2: when prior information concerning 2 is available.
Method 3: estimate based on the residual or2error sum of
n
n

squares.

SS Re s ei yi y i

i 1
i 1
2

SS Re s

MS Re s
n2

This unbiased estimator of 2 is called the Residual


Mean Square. And its square root is called the
standard error of regression.
38

i i 410.77 99.53
yx
x

SS Re s ei2 yi y i 66063.02

i 1
i 1
2

MSRe s

SS Re s 66063.02

22021.01
n2
52
39

Interval Estimation in Simple Linear Regression:


If the errors are NID, then the 100(1 )% confidence
interval of 1, 0 and 2 are



1 t ,n 2 se 1 1 1 t ,n 2 se 1
2
2



0 t ,n 2 se 0 0 0 t ,n 2 se 0
2
2

n 2 MS Re s

2
2,n 2

n 2 MS Re s

12 2,n 2

40

Interval Estimation of the Mean Response


Let x0 be any value of the regressor variable within the range
of the original data on x used to fit the model. Then, the
mean response E(y|x0) can be estimated by

E y | x0 y| x0 0 1 x0

x0 x

2 1

where Var y| x0
S xx

Then, a 100(1 )% confidence interval on the mean response


at the point x = x0 is

y| x0 t 2,n 2

1 x x
MS Re s 0
n
S xx

E y | x t
0
y | x0
2, n 2

1 x x
MS Re s 0
n
S xx

41

Simple linear
regression
Prediction
interval
AS202 Applied
Statistical
Models
42

UCCM2223: Linear Regression Analysis

Prediction of New Observations


If x0 is the value of the regressor of interest, the point estimate
of the new value of the response y0 is

y 0 0 1 x0
Note that the random variable

1 x x
0, 1 0
n
S xx

y0 y 0 ~ N

The 100(1 )% prediction interval on the future observation


at x0 is

y 0 t 2,n 2

1 x x
MS Re s 1 0

n
S xx

y y t
0
2, n 2
0

1 x x
MS Re s 1 0

n
S xx

43

Simple linear
regression
Hypothesis
testing
AS202 Applied
Statistical
Models
44

Hypothesis Testing on the Parameters


Assumption: the errors i are normally distributed.

i ~ NID 0,
2
This implies that yi ~ NID 0 1 xi ,
2

Since 1 ci yi ~ NID 1 ,
S xx
i 1

To test the hypothesis that the slope equals a constant,


we have

H 0 : 1 10
H1 : 1 10

45

and the test statistic is

Z0

1 10
2

~ N 0,1

S xx

Typically 2 is unknown and the unbiased estimator


MSRes is used. Then, the test statistic becomes

1 10
t0
~ t
,n 2
MS Re s
2
S xx
We reject the null hypothesis if t0 t ,n 2
2

46


MS Re s
se 1
S xx

is called the (estimated) standard error.

To test the hypothesis that the intercept equals a constant,


we have

H 0 : 0 00
H1 : 0 00

47

and the test statistic is

t0

0 00
MS Re s

0 00

~ t

2
,n 2

2
se 0
1 x

n S xx

We reject the null hypothesis if

t0 t
2

,n 2

48

Special Case: Testing Significance of Regression


H 0 : 1 0
H1 : 1 0

Testing: (i) t-statistic, or (ii) analysis of variance (ANOVA)


1) Accept the null hypothesis: no linear relationship
between x and y: (i) x is of little value in explaining the
variation of y, (ii) the true relationship between x and y
is not linear.

49

2) Reject the null hypothesis: (i) x is of value in


explaining the variability of y, (ii) the straight-line
model is adequate or better results could be obtained
with the addition of higher order polynomial terms in
x.

50

Measures of Variation
Total variation is made up of two parts:

SST

SSR

Total Sum of
Squares

Regression Sum
of Squares

SST ( Yi Y )2

SSR ( Yi Y )2

SSE
Error Sum of
Squares

SSE ( Yi Yi )2

where:

= Mean value of the dependent variable

Yi = Observed value of the dependent variable

13-51

Yi = Predicted value of Y for the given X value


i

Measures of Variation
SST = total sum of squares

(Total Variation)

Measures the variation of the Yi values around


their mean Y
SSR = regression sum of squares (Explained Variation)

Variation attributable to the relationship


between X and Y
SSE = error sum of squares (Unexplained Variation)

Variation in Y attributable to factors other than X

13-52

Measures of Variation
(continued)
DCOVA

Y
Yi

SSE = (Yi - Yi )2

SST = (Yi - Y)2


_
SSR = (Yi - Y)2

13-53

Xi

_
Y

F Test for Significance


F Test statistic:
where

MSR
FSTAT
MSE
MSR

SSR
k

MSE

SSE
n k 1

where FSTAT follows an F distribution with k numerator and (n k - 1)


denominator degrees of freedom
(k = the number of independent variables in the regression model)
54

Simple linear
regression

AS202 Applied
Statistical
Models

Linear
Association
between X & Y
55

Coefficient of Determination :
R2

SS R
SS
1 Re s
SST
SST

where SSRES is the residual or error sum of squares, SSR is the


regression or model sum of squares and SST is a measure of
the variability in y without considering the effect of the
regressor variables x.
n

SS Re s yi y i , SS R y i y , SST SS R SS Re s

i 1
i 1

R2 measures the proportion of variation explained by the


regressor x.
56

SUMMARY OUTPUT
Regression Statistics
Multiple R
0.98747
R Square
0.97511
Adjusted R Square 0.96681
Standard Error
148.395
Observations
5
ANOVA

df

SS

MS
2587656.98
2587657 3
22021.0056
66063.02 2
2653720

Regression

Residual
Total

3
4

Coefficien Standard
ts
Error

Intercept

-410.77

Significance
F

F
117.50
9
0.00167965

Upper
t Stat
P-value Lower 95% 95%
1.33977800 0.2727 1386.50525
57
306.5981 1
6
9
564.959
10.8401372 0.0016 70.3120596

Simple linear
regression
Some remarks
(optional)
AS202 Applied
Statistical
Models
58

Some Considerations in the Use of Regression :

1) The disposition of the x values


plays an important role in the
least squares fit.

59

2) Outliers can seriously disturb the


least squares fit.
3) When a regression analysis has
indicated a strong relationship
between two variables, this does not
imply that the variables are related in
any causal sense (cause and effect).
4) In some applications, the value of the regressor variable
x required to predict y is unknown. Thus, to predict y,
we must first predict x. The accuracy of the prediction
on y depends on the accuracy of the prediction on x.
60

Parameters Estimation: Maximum Likelihood Estimation


(MLE):
If the form of the distribution of the errors is known, e.g.
normal, MLE is an alternative way of parameter estimation.
The Likelihood Function from the joint distribution of
the observations is
n

L yi , xi , 0 , 1 , 2
i 1

1
2 2

1
2
exp 2 yi 0 1 xi
2

61

the maximum likelihood estimators can be obtained by


solving

ln L
0

0 , 1 ,

0,

ln L
1

0 , 1 ,

0,

ln L
0
2 2
0 , 1 ,

62

You might also like