Professional Documents
Culture Documents
Week 10
Week 2
Week 3
Week 4
Probability:
Week 6
Review
Estimation: Week 5
Week
7
Week
8
Week 9
Hypothesis testing:
Week
11
Week
12
Linear regression:
Week 2 VL
Week 3 VL
Week 4 VL
Video lectures: Week 1 VL
Week 1
Week 5 VL
3101/3173
This week
Simple linear regression:
- Idea;
- Estimating using LSE (& BLUE estimator & relation MLE);
- Partition of variability of the variable;
- Testing:
i) Slope;
ii) Intercept;
iii) Regression line;
iv) Correlation coefficient.
3103/3173
Basic idea
Suppose observe data y = [y1 , . . . , yn ]> ;
Assume that Y is affected by X with x = [x1 , . . . , xn ]> ;
What can we say about the relationship between X and Y ?
To do so we fit:
y = 0 + 1 x + ,
to the data:
(xi , yi )
for i = 1, 2, . . . , n.
Basic idea
Regression, with E[i ] = 0:
yi = 0 + 1 xi + i .
We determine 0 and 1 by minimizing S (0 , 1 ) =
Pn
2
i=1 i .
0 ,1
i=1
0 ,1
i=1
S (0 , 1 )
= 2
1
3105/3173
n
X
i=1
xi (yi (0 + 1 xi )) .
xi yi =b0
i=1
n
X
xi + b1
i=1
Next step: b0 P
and b1 as functions of
P
n
n
2
i=1 xi yi .
i=1 xi , and
n
X
xi2 .
i=1
Pn
i=1 yi ,
Pn
i=1 xi ,
b =
=
yi n y
2
2
n2
n2
i=1 xi n x
i=1
3106/3173
Correlation Coefficient
Regression: find the dependency of Y and X , i.e., they have a
joint distribution.
1 give the marginal effect of a change in X , XY measures
the strength of the dependence.
E [(X X ) (Y Y )]
Cov (X , Y )
=p
.
X Y
E [(X X )2 ] E [(Y Y )2 ]
Correlation Coefficient
We know that the correlation coefficient has the following
interpretations:
- A correlation of 1 indicates a perfect negative linear
relationship;
- A correlation of 1 indicates a perfect positive linear
relationship;
- A correlation of 0 implies no linear relationship;
- The larger the correlation in absolute value, the stronger the
(positive/negative) linear relationship.
3108/3173
Correlation Coefficient
Correlations are indications of linear relationships - it is
possible that two variables have zero correlation, but are
strongly dependent (non-linear).
The correlation coefficient is a population parameter that can
be estimated from data.
Suppose we have n pairs of observations denoted by:
(x1 , y1 ) , (x2 , y2 ) , . . . , (xn , yn )
Estimate the population correlation XY using (week 3):
v
v
u
u
n
n
u 1 X
u 1 X
2
t
t
(xi x)
and
sy =
(yi y )2
sx =
n1
n1
i=1
3109/3173
i=1
Correlation Coefficient
Similarly, the sample covariance is given by:
n
sX ,Y
1 X
(xi x) (yi y ) .
=
n1
i=1
=
=
=
3110/3173
1 Pn
(xi x) (yi y )
n 1 i=1
sx sy
Pn
(x x) (yi y )
qP i=1 i
n
2 Pn
2
i=1 (xi x)
i=1 (yi y )
Pn
i=1 xi yi n (x y )
q P
Pn
.
n
2 nx 2
2 ny 2
y
x
i=1 i
i=1 i
Effect correlation
=0
=0.8
=0.3
2
1.5
1
0.5
0
0.5
1
1.5
2
2
3111/3173
1.5
0.5
0
x
0.5
1.5
Effect variance
x=4, y=1
x=1, y=1
x=1, y=4
6
4
2
0
2
4
6
8
3112/3173
0
x
Effect mean
x=3, y=0
x=0, y=0
x=0, y=3
4
3
2
1
0
1
2
3
4
5
3113/3173
0
x
linear
quadratic
5
4
3
2
1
0
1
2
3
3
3115/3173
1
x
4
1.5
3116/3173
0.5
0.5
x
1.5
0.6
0.4
0.2
0
0.2
3117/3173
1.5
0.5
0.5
x
1.5
for i = 1, . . . , n.
3118/3173
i
i=1
i=1 i
=
.
s2 =
n2
n2
Proof: we use that b0 and b1 are unbiased and E[i ] = 0:
h i
h i
ybi = b0 + b1 xi + i
E [b
yi ] = E b0 + E b1 xi .
x=1, y=1
x=4, y=4
10
10
10
3119/3173
0
x
10
x=1, y=1
x=4, y=4
6
4
2
0
2
4
6
10
3120/3173
0
x
10
x=1, y=0
x=3, y=0
3
2
1
0
1
2
3
5
3121/3173
0
x
exp
L y ; 0 , 1 , =
2 2
2
i=1
!
n
1 X
2
=
exp 2
(yi (0 + 1 xi ))
2
(2)n/2 n
i=1
n
1 X
` y ; 0 , 1 , = n log
2 2
(yi (0 + 1 xi ))2 .
2
1
i=1
3122/3173
b2 =
2
n2
1 X
yi b0 + b1 xi
= s2
.
n
n
i=1
BLUE estimator
A point estimator b() is called linear if b() = Ax + b.
A point estimator b() is called Best Linear Unbiased
Estimator (BLUE):
unbiased;
E [b
()] = (),
Var (b
()) Var ( ? ()) , minimum variance,
for any linear unbiased estimator ? .
One can show that the LS-estimator b0 + X b1 is BLUE for
= 0 + X 1 under the weak assumptions (prove not
required in this course);
One can show that the LS-estimator b0 + X b1 is UMVUE for
= 0 + X 1 under the strong assumptions (prove not
required in this course).
3124/3173
n
X
(yi y )2
i=1
(y yb )
| i {z i }
total deviation
unexplained deviation
(b
y y) .
| i {z }
explained deviation
We then obtain:
n
X
n
X
|i=1 {z
(yi y ) =
|i=1 {z
SST
n
X
|i=1 {z
(yi ybi ) +
SSE
(b
yi y )2 ,
SSM
where
- SSE: sum of squares error (sometime called residual);
- SSM: sum of squares model (sometime called regression).
Sum of squares
SSM=
Degrees
of freedom
Mean
square
(b
yi y )2
DFM=1
SSM
MSM= DFM
MSM
MSE
(yi ybi )2
DFE=n 2
SSE
MSE= DFE
(yi y )2
DFT=n 1
SST
MST= DFT
n
P
i=1
Error
Total
3127/3173
SSE=
n
P
SST=
i=1
n
P
i=1
Coefficient of Determination
Notice that the square of the correlation coefficient occurs in
the denominator of the t statistic used to test hypotheses
concerning the population correlation coefficient. The statistic
R 2 is called the coefficient of determination and provides
useful information.
Noting (prove: slide 3173, notation: slide3163):
SSE = Syy b1 Sxy ,
|{z} | {z }
=SST
=SSM
Exercise
A car insurance company is interested in how large adverse
selection effect is in his sample, i.e. how large the difference in
claim size relative to the premium is for different groups.
The insurance premium depends on the coverage (Gold
Comprehensive Car Insurance, Standard Comprehensive Car
Insurance and Third Party Property Car Insurance) and the
price of the insured vehicle (five categories).
a. Explain why there might be difference in the claim sizes for
the different groups.
Solution: high coverage reckless behavior (example:
airbags). Expensive car more wealthy drivers better
drivers? Other explanation is also possible.
3129/3173
Exercise
Each of the 15 categories has a different premiums and
number of contracts for the insurance contract.
The insurance company has the total claim sizes in the groups.
b. Give the linear regression model.
Solution: Let:
yi be the average claim size for group i = 1, . . . , 15;
xi be the average MVI premium for group i = 1, . . . , 15;
Then the regression is:
yi = 0 + 1 x1 + i ,
3130/3173
Exercise
c. Are the weak assumptions and strong assumptions reasonable
in this regression model?
Solution: Weak assumptions:
Residual has a mean of zero: yes, the mean is captured in 0
and 1 . Note: assumed linear relation!
Variance independent of explanatory variable: debatable
(increasing?), have to check using data.
Residuals are independent: yes.
3131/3173
Exercise data
1000
900
800
700
600
500
400
300
200
100
200
3132/3173
300
400
500
600
700
Premium (x)
800
900
1000
Exercise
The observed values for the 15 groups are:
i
xi
yi
i
xi
yi
1
210
189
9
380
323
2
230
267
10
410
313
3
235
234
11
460
456
4
250
142
12
540
528
5
260
302
13
720
768
6
280
149
14
880
963
7
320
308
15
910
954
8
360
392
P
P
Summary statistics: 15
xi = 6445, 15
i=1
i=1 yi = 6288,
P
P
15
15
2 = 3, 529, 325,
2 = 3, 660, 190, and
x
y
i=1 i
i
Pi=1
15
i=1 xi yi = 3, 566, 000.
d. Find the LS estimates of the regression model.
3133/3173
P15
i=1 xi yi n x y
P15 2
2
i=1 xi n x
3, 566, 000 64456288
15
=
64452
3, 529, 325 15
b1 =
= 1.137.
b0 =y b1 x
6288
6445
=
1.137
= 69.329.
15
15
P
2
15
15
x
y
x
i
i=1 i
1
X 2
b2 =
yi n y 2
P15 2
2
n2
i=1 xi n x
i=1
64452
15
3, 529, 325 15
=3200
b = 3200 = 56.57.
1
=
13
3134/3173
2 !
Exercise
e. Find the Correlation coefficient. Relate the (sign) of the
correlation coefficient to the estimates.
Solution:
Pn
n (x y )
Pn
2
2
2
2
i=1 yi ny
i=1 xi nx
r =q P
n
= r
i=1 xi yi
62282
15
=0.9795.
Positive sample correlation (r > 0) b1 is positive.
3135/3173
Exercise
f. Partition the variability.
Solution:
n
X
SST =
yi2 n y 2
i=1
62882
= 1, 024, 260
15
SSE =(n 2)
b2
=13 3200 = 41, 606
SSM =SST SSE = 1, 024, 260 41, 606 = 982, 654.
3136/3173
Residual
20
0
20
40
60
80
100
120
200
3137/3173
300
400
500
600
700
Premium (x)
800
900
1000
(n 2) s 2 X
=
( i / )2 =
|{z}
2
i=1
3138/3173
N(0,1)
Pn
i=1 (yi
b0 b1 xi )2
2 (n2)
2
x e
P
Notation:
x =x x e
x >e
x = ni=1 (xi x)2 and
e
P
Var b1 = 2 / ni=1 (xi x)2 (see slide 3168).
is usually unknown, and estimated by s so:
b
b1 1
1 q 1 =
q
>
e
x >e
x
e
s
x e
x
| {z }
N(0,1)
3139/3173
, s (n2)s 2
2
|
2
n2
{z }
(n2)/(n2)
t (n 2)
s
se b1 = q
e
x >e
x
t b1 > t1,n2
1 < e1
t b1 < t1,n2
3141/3173
Note (see slide 3169): Var b0 = 2 n1 +
2
P x
(xi x)2
3142/3173
, s (n2)s 2
b0 0
b0 0
2
q
q
=
t (n 2) .
2
2
|x|
|x|
n
2
s n1 + n2ex >ex
n1 + n2ex >ex
| {z }
|
{z
} 2 (n2)/(n2)
N(0,1)
3143/3173
2
2
+
+
x
+
2x
=
0
0
n (n 1) sx2
(n 1) sx2
(n 1) sx2
1 x 2 2x0 x + x02
=
+
2
n
(n 1) sx2
!
1
(x x0 )2
+
2.
=
n (n 1) sx2
* see slide 3169 for Var b0 slide 3168 for Var b0 and slide
3170 for Cov b0 , b1 .
3145/3173
2
x
n
and we have: yb0 (0 + 1 x0 )
s
t (n 2) .
(x x0 )2
1
+
s
n (n 1) sx2
This pivot can therefore be used to construct the
100 (1 ) % confidence interval for y0 :
s
(x x0 )2
1
b0 + b1 x0 t1/2,n2 s
+
.
n (n 1) sx2
3146/3173
Example
x=4, y=1
x=1, y=1
x=4, y=4
10
10
10
3147/3173
0
x
10
= 0 + 1 xi .
3149/3173
Pn
2
i=1 xi .
Prediction Intervals
It then follows that:
1 (xi x)2
(Yi ybi |X = x, X = xi ) N 0, 1 + +
n
Sxx
T =
s
s
1+
x)2
tn2 .
1 (xi
+
n
Sxx
Prediction Intervals
3151/3173
Example
x=4, y=1
x=1, y=1
x=4, y=4
10
10
10
3152/3173
0
x
10
R
R n2
T =p
=
,
1 R2
(1 R 2 ) /(n 2)
where R is the random variable denoting correlation
coefficient, but with the x and y replaced by X and Y .
3153/3173
r n2
Reject H0 if the observed t =
> t1,n2 .
1 r2
3154/3173
r n2
< t1,n2 .
Reject H0 if the observed t =
1 r2
3. And to test H0 : XY = 0 against the alternative
H1 : XY 6= 0, the decision rule is:
r n 2
> t1/2,n2 .
Reject H0 if the observed |t| =
1 r2
3155/3173
3156/3173
against
H1 : 6= 0 .
Exercise
Consider the previous exercise on slides 3129-3137 with the
data on slide 3133.
i. Test whether the correlation coefficient is positive.
Solution: r = 0.9795 (see slide 3135).
n2
13
Method 1: T = r 1r
= 0.9795
= 17.55. Using F&T
2
10.97952
page 163: t1p (13) = 17.55 for p = 0, i.e., p-value is almost
0, reject null hypothesis.
.
1.9795
Method 2: z = log 1+r
2 = log 0.0205
2 = 2.2845
1r
Exercise
ii. Test whether the slope parameter is larger than one.
Solution: test H0 : 1 = 1 v.s. H1 : 1 6= 1.
Test statistic:
T =
b 1
qP 1
n
2
2
s/
i=1 xi n x
1.137 1
q
56.57/ 3, 529, 325
64452
15
0.137
= 2.11
0.06488809
* using
b = 56.57 and b1 = 1.137, see slide 3134.
=
3158/3173
Exercise
iii. Test whether the intercept parameter is non-negative.
Solution: test H0 : 0 0 v.s. H1 : 0 > 0.
Test statistic:
b0 0
T =
s
1
n
i=1
x2
xi2 nx 2
69.329
r
56.57
Pn
1
15
( 6445
15 )
3,529,325 6445
15
69.329
= 2.20.
31.475
* using
b = 56.57 and b0 = 69.329, see slide 3134.
3159/3173
Exercise
iv. Calculate the 95% Confidence interval of Y given that
X = 350.
Solution: We have:
ybi |xi = 350 = 69.329 + 1.137 350 =328.619, s = 56.57.
s
1
(x x0 )2
+ Pn
2
2
n
i=1 xi n x
v
u
2
6445
u
1
t
15 350
=56.57 1 +
+
2
15 3, 529, 325 6445
15
Var (yi ) =s
1+
=60.449
b0
b1
b0
!
P
( ni=1 xi )2 b
0
1 Pn
n i=1 xi2
b0
P
b1 ni=1 xi
=
n
Pn
P
yi b0 ni=1 xi
i=1 xiP
=
n
2
i=1 xi
P
n
Pn
Pn
b Pn xi
i yi 0
i=1
i=1 xP
y
n
2
i=1 i
i=1 xi
i=1 xi
=
n
P
Pn
Pn
Pn
Pn
n
2
b
x
y
x
y
x
0
i=1 i
i=1 i
i=1 i i
i=1 i
i=1 xi
Pn
=
n i=1 xi2
Pn
Pn
Pn
Pn
2
i=1 yi
i=1 xi
i=1 xi yi
i=1 xi
=
.
Pn
Pn
2
2
n i=1 xi ( i=1 xi )
i=1 yi
3161/3173
i
i=1
i=1
i=1
Pn
b1 =
2
n i=1 xi
!
Pn
Pn
Pn
P
2
( i=1 xi )
n i=1 xi yi i=1 yi ni=1 xi
b
P
1 Pn
1 =
n i=1 xi2
n ni=1 xi2
Pn
Pn
Pn
n
i=1 xi yi
i=1 yi
i=1 xi
b
1 =
.
Pn
Pn
2
n i=1 xi ( i=1 xi )2
*: (1 a/b)c = d/b (bc ac)/b = d/b c = d/(b a).
3162/3173
n
b1 =
=
Pn
Pn
Pn
nSxy Sx Sy
i=1 xi yi
i=1 yi
i=1 xi
=
Pn
Pn
2
2
nSxx Sx2
n
x (
xi )
Pn i=1 i Pn i=1 Pn
n
i=1 xi yi
i=1 yi
i=1 xi n2
P
P
n
2 ( n x )2 1
n
x
i=1 i
i=1 i
n2
Pn
xi yi nxy
= Pi=1
n
x 2 nx 2
Pni=1 i
P
P
xi yi ni=1 xi y ni=1 xi y + nxy
i=1
Pn
Pn
Pn
=
2
2
i=1 xi +
i=1 x 2
i=1 xi x
Pn
(xi x) (yi y )
= i=1Pn
2
i=1 (xi x)
syy
e
sxy
sxy
sy
=
=
=r .
e
sxx
sxx sxx
syy
sx
3164/3173
Parameter estimates II
Thus we have that b1 is the sample correlation coefficient
times the quotient of the sample standard deviation of Y and
X.
We refer to b1 as the slope of the regression line.
3165/3173
(x
x)
x)
i
i
i=1
i=1
Then we have that:
Var b1
= Var
Pn
(xi x) yi
Pi=1
2
n
i=1 (xi x)
2
i=1 (xi x) Var (yi )
P
2
n
2
(x
x)
i=1 i
P
n
2 i=1 (xi x)2
= P
2
n
2
(x
x)
i
i=1
.X
2
n
= 2
i=1 (xi x) .
3168/3173
Pn
n
i=1 Var
(yi ) /n2 + x 2 2
P
n n (xi x)2 + nx 2
= 2 Pni=1
P
n i=1 xi2 ( ni=1 xi )2
P
2 ni=1 xi2
= Pn
.
P
n i=1 xi2 ( ni=1 xi )2
3169/3173
.X
n
i=1 (xi
x)2
n
n
X
X
(yi y )2 =
yi2 + y 2 2y yi
i=1
n
X
i=1
n
X
i=1
i=1
(yi ybi )2 +
n
X
(b
yi y )2
i=1
n
X
i=1
n
X
yi2 + 2b
yi2 2(b
yi + i )b
yi + y 2 2y (yi i )
i=1
3171/3173
Proof cont.:
SST =
n
X
i=1
SSE + SSM =
n
X
(yi y )2 =
n
X
yi2 + y 2 2y yi
i=1
yi2 + 2b
yi2 2(b
yi + i )b
yi + y 2 2y (yi i )
i=1
n
X
yi i + y 2 2y yi + 2y i
yi2 2b
i=1
n
X
=
yi2 + y 2 2y yi = SST
i=1
Pn
3172/3173
P
** using i=1 2y i = 2y ni=1 i = 0 and
Pn
yi i = 2nE[b
yi i ] = 2nE[b
yi ] E[i ] = 0, using ***
i=1 2b
independence.
Used on slide 3126.
SSM =
=
n
X
(b
yi y )2
i=1
n
X
b0 + b1 xi y
2
i=1
n
X
(y b1 x) + b1 xi y
i=1
n
X
i=1
=b12 sxx
=b1 sxy
using b1 =
3173/3173
sxy
sxx
2