Chapter 1 Simple Linear Regression

Simple linear
regression
Introduction
AS202 Applied
Statistical
Models
1
What is Linear Regression?

It is a statistical technique that attempts to explore
and model the relationship between two or more
variables using a straight line.
Numerical
Relationship
Causal
Relationship
Functional relation:
Regression Model:
Perfect fit: all values fall

exactly on the straight
line.
Not a perfect fit: values do

not fall exactly on the line.
Errors exist:
Y f ( X ) a bX
Y a bX cZ
Y a bX errors
Y a b1 X 1 b2 X 2 errors
Type of relationship:
1. Linear relationship (refer to the slope, not regressor)
Simple: one regressor
yi 0 1 xi i
Multiple: more than one regressor

2500
Size (feet squared)
yi 0 1 x1 2 x2 L k xk i
2000
1500
1000
500
0
500
400
Polynomial: one or more regressors with higher

degrees, e.g. quadratic, cubic etc.
300
200
Price (RM1000)
100
10
20
30
40
Age of home (years)
y i 0 1 x 2 x 2 k x k i
yi 0 1 x1 2 x2 3 x3 4 x1 x2 5 x1 x3 6 x2 x3
7 x1 x2 x3 i
4
50
2. Non-linear relationship in one or more variables:

exponential, logarithm, logistic etc. Eg:
yi e
xi
xi
yi
i
xi
yi
1
1 e xi i
We use regression for

1. Data description (modeling)
2. Parameter estimation
3. Prediction/ forecasting
4. System control
5. Data reduction
Some applications:
1. An economist wants to investigate the
relationship between the petrol price and
the inflation rate.
2. A sale manager is interested to predict
the total sale in next year based on the
number of staffs and square feet of
space in the store
3. A policy maker wants to identify the main
factors (e.g speed limit, road condition,
weather) contribute to the number of
road accident.
4. A scientist wants to know at which level
of sound pollution will affect human
health.
5. A computer scientist wants to compress
an image for minimum storage.
System
control
Data
description
Forecast
Data
reduction
Parameter
estimation
7
Simple linear
regression
Simple linear
regression model
AS202 Applied
Statistical
Models
8
Example:
You want to know if there is a relationship between the
monthly personal income and the age of a worker, and
then to forecast your monthly income when you are at
50 year-old. There are five workers in your study.
The two variables are:
1. Age of worker (years): Independent variable (X)
2. Monthly personal income (RM): Dependent
variable (Y)
3. Sample size (n): 5 workers
9
Data collected from 5 working adults

Respondent
Age
Personal Income (Monthly)
34
2950
45
4000
29
2430
32
3000
23
1790
10
Age increase => income increase
RM4565.87
?
?
?
What is your income when you reach 50 year-old?

11
Mathematical equation for the straight line
YX

The gap between the points and the line are errors of the model.
12
The Simple Linear Regression model represents the straight

line with errors:
yi 0 1 xi i
i=1,2,..n
(Eq 1.1)
Where
13
Assumptions:
1) The error term i is normally distributed with mean
E(i)=0 and constant variance Var(i)=2 ;
2) The errors are uncorrelated with Cov i , j 0; i j
i ~ NID 0, 2
This implies that the dependent variable Y follows a
normal distribution with
E y | xx
Vari ; y x |
14
Some Properties:
15
Distribution of y at
x=23. The mean
E(y) is +(23) and
the standard
deviation is
y
x
E(y|x)=+x
Distribution of y at
x=45. The mean
E(y) is +(45) and
the standard
deviation is
16
Simple linear
regression
Least square
estimation
AS202 Applied
Statistical
Models
17
Objective: find a line that best fit to the empirical data.
18
Find a line that fit the data best

= Estimate the values of and that minimize the errors
19
Parameter Estimation of and : Least Square Method

yi 0 1 xi i
i=1,2,..n
Re-write the model as
20
Parameter Estimation of and : Least Square Method

To eliminate the negative signs of the error term,
Consider the squared error for n pairs of sample data, we have

the error sum of squares:
21
Label the error sum of squares as S(,) and called it as LS

criterion:
Tutorial Question 2: How to find the estimators
22
Result: The LS estimators for and are
Thus, the fitted simple linear regression model is
23
Re-visit the example:

Respondent
Age (x)
Income (y)
xy
x2
34
2950
100300
1156
45
4000
180000
2025
29
2430
70470
841
32
3000
96000
1024
23
1790
41170
529
Sum
163
14170
487940
5575
xi 163,
yi 14170,
x 32.6,
y 2834
xi yi 487940,
2
x
i 5575
24
Thus, the fitted SLR model is
25
Properties of LSE:
1. The LSEs are linear combinations of the observations yi
n
n
xi x yi
S xy
1
ci yi
S xx i 1
S xx
i 1
0 y 1 x y x ci yi c
i 1
2. The LSEs are unbiased estimators of the parameters 0 an 1.
E 1 1
E 0 0
26
3. The variance of the LSEs are

2
2

1
x
Var 1
Var 0 Var y 1 x
S xx
S xx
Gauss-Markov Theorem: (The best linear unbiased estimators)

Under the conditions of regression model, the least squares
estimators are unbiased and have minimum variance among all
unbiased linear estimators.
27
Summary
1. The SLR model
yi 0 1 xi i
i=1,2,,n
2. The LS estimator of and are
3. The fitted SLR model

28
Simple linear
regression
Forecasting
using SLR
AS202 Applied
Statistical
Models
29
Suppose you constructed the SLR model

using n pairs of sample data (x1,y1), (x2,y2), ,
(xn,yn) and the range of x is [a,b].
Two types of forecasting:
Extrapolation: predict the values of y using
x outside the range [a,b].
Interpolation: predict the values of y using x
inside the range [a,b].
30
Re-visit the example:

Respondent
Age (x)
Income (y)
xy
x2
34
2950
100300
1156
45
4000
180000
2025
29
2430
70470
841
32
3000
96000
1024
23
1790
41170
529
Sum
163
14170
487940
5575
Age range used: [23,45]
31
At age 50 (x=50), your predicted monthly income is
yi 410.7725 99.5329 50 4565.87

RM4565.8725
Extrapolation!
32
At age 30 (x=30), your predicted monthly income is
y i 410.7725 99.5329 30 2575.21
RM2575.21
Interpolation!
33
1) It is intended as interpolation model

over the range of the regressor
variables. We must be careful if we
extrapolate outside of this range.
34
Properties of the Fitted Regression Model:

1. The difference between the observed value yi and the
corresponding fitted value is a residual:
ei yi y i yi 0 1 xi
and the sum of the residuals is always zero:

n
e
i 1
2. The LS regression line always passes through the

centroid x, y of the data.
35
3. The sum of the observed value yi equals the sum of

the fitted values n
n
y y
i 1
i 1
4. The sum of the residuals weighted by the corresponding

value of the regressor variable always equals zero.
n
xe
i i
i 1
5. The sum of the residuals weighted by the corresponding

fitted value always equals zero.
n
y
i 1
ei 0
36
Simple linear
regression
Interval
estimation
AS202 Applied
Statistical
Models
37
Estimation of Variance 2
Method 1: based on several observations (replication) on y for
at least one value of x.
Method 2: when prior information concerning 2 is available.
Method 3: estimate based on the residual or2error sum of
n
n
squares.
SS Re s ei yi y i
i 1
i 1
2
SS Re s

MS Re s
n2
This unbiased estimator of 2 is called the Residual

Mean Square. And its square root is called the
standard error of regression.
38
i i 410.77 99.53
yx
x
SS Re s ei2 yi y i 66063.02
i 1
i 1
2
MSRe s
SS Re s 66063.02
22021.01
n2
52
39
Interval Estimation in Simple Linear Regression:

If the errors are NID, then the 100(1 )% confidence
interval of 1, 0 and 2 are

1 t ,n 2 se 1 1 1 t ,n 2 se 1
2
2

0 t ,n 2 se 0 0 0 t ,n 2 se 0
2
2
n 2 MS Re s
2
2,n 2
n 2 MS Re s
12 2,n 2
40
Interval Estimation of the Mean Response

Let x0 be any value of the regressor variable within the range
of the original data on x used to fit the model. Then, the
mean response E(y|x0) can be estimated by
E y | x0 y| x0 0 1 x0
x0 x
2 1
where Var y| x0
S xx
Then, a 100(1 )% confidence interval on the mean response

at the point x = x0 is
y| x0 t 2,n 2
1 x x
MS Re s 0
n
S xx
E y | x t
0
y | x0
2, n 2
1 x x
MS Re s 0
n
S xx
41
Simple linear
regression
Prediction
interval
AS202 Applied
Statistical
Models
42
UCCM2223: Linear Regression Analysis
Prediction of New Observations

If x0 is the value of the regressor of interest, the point estimate
of the new value of the response y0 is
y 0 0 1 x0
Note that the random variable
1 x x
0, 1 0
n
S xx
y0 y 0 ~ N
The 100(1 )% prediction interval on the future observation

at x0 is
y 0 t 2,n 2
1 x x
MS Re s 1 0
n
S xx
y y t
0
2, n 2
0
1 x x
MS Re s 1 0
n
S xx
43
Simple linear
regression
Hypothesis
testing
AS202 Applied
Statistical
Models
44
Hypothesis Testing on the Parameters

Assumption: the errors i are normally distributed.
i ~ NID 0,
2
This implies that yi ~ NID 0 1 xi ,
2
Since 1 ci yi ~ NID 1 ,
S xx
i 1
To test the hypothesis that the slope equals a constant,

we have
H 0 : 1 10
H1 : 1 10
45
and the test statistic is
Z0
1 10
2
~ N 0,1
S xx
Typically 2 is unknown and the unbiased estimator

MSRes is used. Then, the test statistic becomes
1 10
t0
~ t
,n 2
MS Re s
2
S xx
We reject the null hypothesis if t0 t ,n 2
2
46

MS Re s
se 1
S xx
is called the (estimated) standard error.
To test the hypothesis that the intercept equals a constant,

we have
H 0 : 0 00
H1 : 0 00
47
and the test statistic is
t0
0 00
MS Re s
0 00
~ t
2
,n 2
2
se 0
1 x
n S xx
We reject the null hypothesis if
t0 t
2
,n 2
48
Special Case: Testing Significance of Regression

H 0 : 1 0
H1 : 1 0
Testing: (i) t-statistic, or (ii) analysis of variance (ANOVA)

1) Accept the null hypothesis: no linear relationship
between x and y: (i) x is of little value in explaining the
variation of y, (ii) the true relationship between x and y
is not linear.
49
2) Reject the null hypothesis: (i) x is of value in

explaining the variability of y, (ii) the straight-line
model is adequate or better results could be obtained
with the addition of higher order polynomial terms in
x.
50
Measures of Variation
Total variation is made up of two parts:
SST
SSR
Total Sum of
Squares
Regression Sum
of Squares
SST ( Yi Y )2
SSR ( Yi Y )2
SSE
Error Sum of
Squares
SSE ( Yi Yi )2
where:
= Mean value of the dependent variable
Yi = Observed value of the dependent variable
13-51
Yi = Predicted value of Y for the given X value

i
SST = total sum of squares
(Total Variation)
Measures the variation of the Yi values around

their mean Y
SSR = regression sum of squares (Explained Variation)
Variation attributable to the relationship

between X and Y
SSE = error sum of squares (Unexplained Variation)
Variation in Y attributable to factors other than X
13-52
(continued)
DCOVA
Y
Yi
SSE = (Yi - Yi )2
SST = (Yi - Y)2

_
SSR = (Yi - Y)2
13-53
Xi
_
Y
F Test for Significance

F Test statistic:
where
MSR
FSTAT
MSE
MSR
SSR
k
MSE
SSE
n k 1
where FSTAT follows an F distribution with k numerator and (n k - 1)

denominator degrees of freedom
(k = the number of independent variables in the regression model)
54
Simple linear
regression
AS202 Applied
Statistical
Models
Linear
Association
between X & Y
55
Coefficient of Determination :
R2
SS R
SS
1 Re s
SST
SST
where SSRES is the residual or error sum of squares, SSR is the

regression or model sum of squares and SST is a measure of
the variability in y without considering the effect of the
regressor variables x.
n
SS Re s yi y i , SS R y i y , SST SS R SS Re s
i 1
i 1
R2 measures the proportion of variation explained by the

regressor x.
56
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.98747
R Square
0.97511
Adjusted R Square 0.96681
Standard Error
148.395
Observations
5
ANOVA
df
SS
MS
2587656.98
2587657 3
22021.0056
66063.02 2
2653720
Regression
Residual
Total
3
4
Coefficien Standard
ts
Error
Intercept
-410.77
Significance
F
F
117.50
9
0.00167965
Upper
t Stat
P-value Lower 95% 95%
1.33977800 0.2727 1386.50525
57
306.5981 1
6
9
564.959
10.8401372 0.0016 70.3120596
Simple linear
regression
Some remarks
(optional)
AS202 Applied
Statistical
Models
58
Some Considerations in the Use of Regression :
1) The disposition of the x values

plays an important role in the
least squares fit.
59
2) Outliers can seriously disturb the

least squares fit.
3) When a regression analysis has
indicated a strong relationship
between two variables, this does not
imply that the variables are related in
any causal sense (cause and effect).
4) In some applications, the value of the regressor variable
x required to predict y is unknown. Thus, to predict y,
we must first predict x. The accuracy of the prediction
on y depends on the accuracy of the prediction on x.
60
Parameters Estimation: Maximum Likelihood Estimation

(MLE):
If the form of the distribution of the errors is known, e.g.
normal, MLE is an alternative way of parameter estimation.
The Likelihood Function from the joint distribution of
the observations is
n
L yi , xi , 0 , 1 , 2
i 1
1
2 2
1
2
exp 2 yi 0 1 xi
2
61
the maximum likelihood estimators can be obtained by

solving
ln L
0
0 , 1 ,
0,
ln L
1
0 , 1 ,
0,
ln L
0
2 2
0 , 1 ,
62

Chapter 1 Simple Linear Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1 Simple Linear Regression

Uploaded by

Copyright:

Available Formats

Simple linear

What is Linear Regression?

Perfect fit: all values fall

Not a perfect fit: values do

Multiple: more than one regressor

Size (feet squared)

Polynomial: one or more regressors with higher

Age of home (years)

2. Non-linear relationship in one or more variables:

We use regression for

Data collected from 5 working adults

Personal Income (Monthly)

Age increase => income increase

What is your income when you reach 50 year-old?

Mathematical equation for the straight line

The Simple Linear Regression model represents the straight

Objective: find a line that best fit to the empirical data.

Find a line that fit the data best

Parameter Estimation of and : Least Square Method

Re-write the model as

Parameter Estimation of and : Least Square Method

Consider the squared error for n pairs of sample data, we have

Label the error sum of squares as S(,) and called it as LS

Tutorial Question 2: How to find the estimators

Result: The LS estimators for and are

Thus, the fitted simple linear regression model is

Re-visit the example:

Thus, the fitted SLR model is

2. The LSEs are unbiased estimators of the parameters 0 an 1.

3. The variance of the LSEs are

Gauss-Markov Theorem: (The best linear unbiased estimators)

2. The LS estimator of and are

3. The fitted SLR model

Suppose you constructed the SLR model

Re-visit the example:

Age range used: [23,45]

At age 50 (x=50), your predicted monthly income is

yi 410.7725 99.5329 50 4565.87

At age 30 (x=30), your predicted monthly income is

y i 410.7725 99.5329 30 2575.21

1) It is intended as interpolation model

Properties of the Fitted Regression Model:

and the sum of the residuals is always zero:

2. The LS regression line always passes through the

3. The sum of the observed value yi equals the sum of

4. The sum of the residuals weighted by the corresponding

5. The sum of the residuals weighted by the corresponding

This unbiased estimator of 2 is called the Residual

Interval Estimation in Simple Linear Regression:

Interval Estimation of the Mean Response

Then, a 100(1 )% confidence interval on the mean response

UCCM2223: Linear Regression Analysis

Prediction of New Observations

The 100(1 )% prediction interval on the future observation

Hypothesis Testing on the Parameters

To test the hypothesis that the slope equals a constant,

and the test statistic is

Typically 2 is unknown and the unbiased estimator

is called the (estimated) standard error.

To test the hypothesis that the intercept equals a constant,

and the test statistic is

We reject the null hypothesis if