You are on page 1of 11

Stat 431 Assignment 1 Winter 2017

Solution Key

Question 1 [10 marks]


An article in Technometrics (1974, Vol. 16, pp. 523–531) considered the following stack-loss data from a
plant oxidizing ammonia to nitric acid. Twenty-one daily responses of stack loss y (the amount of ammonia
escaping) were measured with air flow x1 , temperature x2 , and acid concentration x3 . R code to input the
data is given below.

y = c(42, 37, 37, 28, 18, 18, 19, 20, 15, 14, 14, 13, 11, 12, 8, 7, 8, 8, 9, 15, 15)
x1 = c(80, 80, 75, 62, 62, 62, 62, 62, 58, 58, 58, 58, 58, 58, 50, 50, 50, 50, 50, 56, 70)
x2 = c(27, 27, 25, 24, 22, 23, 24, 24, 23, 18, 18, 17, 18, 19, 18, 18, 19, 19, 20, 20, 20)
x3 = c(89, 88, 90, 87, 87, 87, 93, 93, 87, 80, 89, 88, 82, 93, 89, 86, 72, 79, 80, 82, 91)

(a) [2 marks] Fit a linear regression model relating the results of the stack loss to the three regressor
varilables. Provide an summary output of your fitted model. Use the model to predict stack loss when
x1 = 60, x2 = 26, and x3 = 85.

m1 = lm(y~x1+x2+x3)
summary(m1)

##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2377 -1.7117 -0.4551 2.3614 5.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.9197 11.8960 -3.356 0.00375 **
## x1 0.7156 0.1349 5.307 5.8e-05 ***
## x2 1.2953 0.3680 3.520 0.00263 **
## x3 -0.1521 0.1563 -0.973 0.34405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
## F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09

sum( m1$coeff * c(1,60,26,85) )

## [1] 23.76576

Above is the code and ouptput for the main effects linear regression model. The predicted stack
loss as the specified levels of x is 23.77.

1
(b) [3 marks] Conduct a t-test for the null hypothesis H0 : β3 = 0 at the α = 0.05 level. Show the
calculation of the test statistic and p-value. What conclusion to do you draw regarding the relationship
between stack loss and acid concentration?

The test statistic for H0 : β3 = 0 vs HA : β3 6= 0 is

β̂3 − 0 −0.1521225
t∗ = = = −0.9733098
se(
ˆ β̂3 ) 0.156294

and the coresponding p-value is

p = P [|t∗ | > t21−4 ] = P [| − 0.9733098| > t17 ] = 0.3440461

Therefore we do not reject the null hypothesis that stack-loss is unrelated to acid concentartion.

summary(m1)$coeff[4,] # This is the relevant line from the summary output

## Estimate Std. Error t value Pr(>|t|)


## -0.1521225 0.1562940 -0.9733098 0.3440461

summary(m1)$coeff[4,1]/summary(m1)$coeff[4,2] # Calculation of test statistic

## [1] -0.9733098

2*(1-pt(abs(summary(m1)$coeff[4,1]/summary(m1)$coeff[4,2]),21-4)) # Calculation of p-value

## [1] 0.3440461

(c) [2 marks] Calculate a 90% confidence interval for β2 (show your work) and provide a written interpre-
tation of this regression coefficient.

To calculate a 90% (α = 0.10) confidence interval we use

β̂2 ± t21−4,α/2 se(


ˆ β̂j ) = 1.2952861 ± 1.7396067 ∗ 0.3680243

For a one degree increase in temperature we expect that the daily stack-loss will increase by
1.2953 units. A 90% confidence interval for this estimate is (0.655, 1.936).

summary(m1)$coeff[3,] # This is the relevant line from the summary output

## Estimate Std. Error t value Pr(>|t|)


## 1.295286124 0.368024265 3.519567177 0.002630054

summary(m1)$coeff[3,1] + c(-1,1)*qt(.95,21-4)*summary(m1)$coeff[3,2] # 90% CI calculation

## [1] 0.6550686 1.9355036

2
(d) [3 marks] Conduct a residual analysis of the fitted model using various residual plots. What conclusions
do you draw about the overal fit of the model?

Based on the residual analysis I conclude that the fit of this model fairly good. Recall, for a well
fit model we’d expect the standardized residuals to be indepenent N (0, 1). From the scatterplots
we see that the residuals appear to be approximately centred at zero and all but one of them
are within ±1.96. The scatterplots in x2 and ŷ show a potential quadratic relationship (large
residuals at the extremes, small residuals for moderate values) which may be of concern. Finally,
the quantile plot shows good agreement between the standardized residuals and the standard
normal distribution.

# Scatterplots (with loess curves) of residuals vs explanatory variables & fitted values
# Normal quantile plot

par(mfrow=c(2,3))
scatter.smooth(x1,rstandard(m1),main="Residauls vs X1"); abline(h=c(-1.95,0,1.96), lty=2)
scatter.smooth(x2,rstandard(m1),main="Residuals vs X2"); abline(h=c(-1.95,0,1.96), lty=2)
scatter.smooth(x3,rstandard(m1),main="Residuals vs X3"); abline(h=c(-1.95,0,1.96), lty=2)
qqnorm(rstandard(m1)); abline(0,1)
scatter.smooth(m1$fitted.value,rstandard(m1),main="Residuals vs Fitted Y")
abline(h=c(-1.95,0,1.96), lty=2)

Residauls vs X1 Residuals vs X2 Residuals vs X3


2

2
rstandard(m1)

rstandard(m1)

rstandard(m1)
1

1
0

0
−2

−2

−2

50 60 70 80 18 20 22 24 26 75 80 85 90

x1 x2 x3

Normal Q−Q Plot Residuals vs Fitted Y


2

2
Sample Quantiles

rstandard(m1)
1

1
0

0
−2

−2

−2 −1 0 1 2 5 10 20 30 40

Theoretical Quantiles m1$fitted.value

Note: Students may have slightly different residual plots if they chose to include interaction terms or remove
terms from their model. At minimum two plots should be considered (preferrable a Normal quantile plot
and scatter plot of residuals versus fitted values). Conclusions drawn must be consistent with the student’s
plots. Note that the residuals supplied directly from the fitted model are not standardized which makes
interpretation of the plots more difficult.

3
Question 2 [10 marks]
The angle θ at which electrons are emitted in muon decay has a distribution with the density:
1 + αx
f (x|α) = , −1 ≤ x ≤ 1, −1 ≤ α ≤ 1
2
where x = cos θ.

(a) [3 marks] Find the likelihood, log-likelihood, score, and information functions for a sample of n
independent observations from this distribtuion.

n
Y 1 + αxi
Likelihood L(α) =
i=1
2
Xn
log-likelihood `(α) = log L(α) = log(1 + αxi ) − n log 2
i=1
n
∂` X xi
Score S(α) = =
∂α i=1
(1 + αxi )
2 n
∂ ` X x2i
Information I(α) = − =
∂α2 i=1
(1 + αxi )2

(b) [2 marks] Use the Newton Raphson algorithm to find the maximum likelihood estimate of α for the
data given below. Note: you must code the algorithm yourself instead of using any built-in optimization
or root finding functions.

x = c(0.164747403, 0.106092128, 0.855715027, 0.221426789, 0.177047372, -0.684621760,


0.194327486, 0.745426807, 0.375342389, -0.176311307, 0.604868366, 0.291522420,
0.145012995, -0.682037664, -0.004203192, 0.998613873, 0.334344244, -0.463665374,
0.255391879, -0.308331904, 0.549739806, 0.143395894, 0.660216568, 0.260438615,
0.365576435, -0.988310236, 0.317882172, -0.710406476, -0.805007831, 0.643207268,
-0.256027985, 0.256180027, 0.325371336, 0.072878236, -0.428863335, 0.184964353,
-0.701840279, 0.729145080, -0.191107998, 0.286108217, -0.309805516, -0.451841456,
-0.463702736, 0.045797852, 0.982804115, -0.957954171, 0.985425250, 0.479191423)

# Score and information functions for this distribtuion


Score = function(a,x){ sum( x/(1+a*x) )}
Info = function(a,x){ sum( (x^2)/((1+a*x)^2) )}

# set up initial alpha estimate and tolerace for convergance


alpha.old = alpha.new = 0
delta = 1
epsilon = 10^{-5}
trace = c(alpha.new,Score(alpha.new,x))

# run Newton-Raphson, save alpha estimates as we go


while( delta>epsilon ){
alpha.new = alpha.old + Score(alpha.old,x)/Info(alpha.old,x)
trace = rbind(trace,c(alpha.new,Score(alpha.new,x)))
delta = abs(alpha.new-alpha.old)
alpha.old = alpha.new
}

4
alpha.hat = alpha.new
print(trace)

## [,1] [,2]
## trace 0.0000000 4.174163e+00
## 0.3134817 -1.019505e-01
## 0.3066253 -3.409367e-04
## 0.3066022 -3.765120e-09
## 0.3066022 -2.870967e-16

Newton-Raphson is implemented in the code above. The maximum likelihood estimate of α is


0.3066.

(c) [3 marks] Calculate 95% confidence intervals for α using the likelihood ratio, score, and Wald asymptotic
results. Which of the three intervals do you prefer, and why?

# Define the log realtive likelihood function


# Find the likelihood ratio based 95% CI {alpha: -2r(alpha) < chisq(0.95)}
lr = function(a,x,a.hat) { sum( log(1+a*x) ) - sum( log(1+a.hat*x)) }
lr.int = function(a,x,a.hat) {-2*lr(a,x,a.hat) - qchisq(0.95,1)}

lr.lower = uniroot(lr.int,interval=c(-1,alpha.hat),x=x,a.hat=alpha.hat)$root
lr.upper = uniroot(lr.int,interval=c(alpha.hat,1),x=x,a.hat=alpha.hat)$root
print(c(lr.lower,lr.upper))

## [1] -0.2225538 0.7541042

# Find the score based 95% CI {alpha: S(alpha)^2/I(alpha) < chisq(0.95)}


sr = function(a,x) {Score(a,x)^2/Info(a,x)}
sr.int = function(a,x) {sr(a,x) - qchisq(0.95,1)}

sr.lower = uniroot(sr.int,interval=c(-.5,alpha.hat),x=x)$root
sr.upper = uniroot(sr.int,interval=c(alpha.hat,1),x=x)$root

## Error in uniroot(sr.int, interval = c(alpha.hat, 1), x = x): f() values at end points not of opposite

print(c(sr.lower,1)) # couldn't find upper limit, see plot below

## [1] -0.2554168 1.0000000

# Another root for the score result?


sr.lower2 = uniroot(sr.int,interval=c(-1,-0.5),x=x)$root
print(c(-1,sr.lower2))

## [1] -1.0000000 -0.9803179

# Find the Wald based 95% CI {\alpha: (alpha.hat - alpha)^2 I(alpha.hat < chisq(0.95))}
wr = function(a,x,a.hat){ (a.hat-a)^2*Info(a.hat,x)}
wr.int = function(a,x,a.hat) {wr(a,x,a.hat) - qchisq(0.95,1)}

wr.lower = uniroot(wr.int,interval=c(-0.99,alpha.hat),x=x,a.hat=alpha.hat)$root
wr.upper = uniroot(wr.int,interval=c(alpha.hat,1.3),x=x,a.hat=alpha.hat)$root
print(c(wr.lower,wr.upper))

5
## [1] -0.2033615 0.8166075

alpha.hat + c(-1,1)*qnorm(0.975)/sqrt(Info(alpha.hat,x)) #alternative method for wald CI

## [1] -0.2033799 0.8165843

The three 95% confidence intervals are:

Likelihood Ratio (-0.223, 0.754)


Score (-0.255, 1)
Wald (-0.203, 0.817)

In this case I prefer either the likelihood ratio based or Wald based interval. The Score statistic
does not seem particularly well behaved in this case (see plot below). Perhaps we do not have a
large enough sample size to use the that asymptotic result. In this case calculation of the MLE
and Information are relatively easy so there is no problem using the Wald result however it does
impose a symmetric interval which we may not prefer.

Note: As seen in the plot below there is another part of the domain of α for which the score
statistic is less then χ2(1) (0.95). So one could argue that we should have a composite confidence
interval of (-1, -0.98) ∪ (-0.255, 1). In my mind, given our MLE I would not include this region
and would instead not rely on a Score statistic based interval in this setting.

# Plot the log relative likelihood statistic and score statistic and wald statistic
# over -1<alpha<1 (CI are alphas with curve below chisq(0.95)=3.84)

alphas=seq(-0.99,.99,.01)

par(mfrow=c(1,3))
lr.out = rep(0,length(alphas))
for(i in 1:length(alphas)){lr.out[i]=lr(alphas[i],x,alpha.hat)}
plot(alphas,-2*lr.out,type='l',main="LR Statistic")
abline(h=qchisq(0.95,1), lty=2); abline(v=alpha.hat,lty=3)

sr.out = rep(0,length(alphas))
for(i in 1:length(alphas)){sr.out[i]=sr(alphas[i],x)}
plot(alphas,sr.out,type='l', main="Score Statistic")
abline(h=qchisq(0.95,1), lty=2); abline(v=alpha.hat,lty=3)

wr.out = rep(0,length(alphas))
for(i in 1:length(alphas)){wr.out[i]=wr(alphas[i],x,alpha.hat)}
plot(alphas,wr.out,type='l', main="Wald Statistic")
abline(h=qchisq(0.95,1), lty=2); abline(v=alpha.hat,lty=3)

6
LR Statistic Score Statistic Wald Statistic

25
40

20
6
30

15
−2 * lr.out

wr.out
sr.out

4
20

10
2
10

5
0

0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

alphas alphas alphas


(d) [2 marks] Use each of the likelihood ratio, score, and Wald results to test the null hypothesis that
α = 0.25.

We wish to test H0 : α = 0.25 versus Ha : α 6= 0.25. The three tests are conducted below using R.
In all three cases we do not reject the null hypothesis that α = 0.25.

# Likelihood ratio test statistic and p-value


c( -2*lr(0.25,x,alpha.hat),1-pchisq(-2*lr(0.25,x,alpha.hat),1))

## [1] 0.04653247 0.82921079

# Score statistic test statistic and p-value


c( sr(0.25,x), 1-pchisq(sr(0.25,x),1))

## [1] 0.04722016 0.82797299

# Wald statistic test statistic and p-value


c( wr(0.25,x,alpha.hat), 1-pchisq(wr(0.25,x,alpha.hat),1))

## [1] 0.04732089 0.82779249

7
Question 3 [10 marks]
Suppose that Y is a random variable from the exponential distribtuion with rate parameter λ > 0 and
probability density function:
f (y; λ) = λe−λy

(a) [2 marks] Show that the distribution of Y is a member of the exponential family by identifying the
canonical parameter, the dispersion parameter, and the functions a(φ), b(θ), c(y; φ).

Recall a distribtuion is a member of the exponential family if its pdf/pmf can be written in the
form:  
yθ − b(θ)
f (y; θ, φ) = exp + c(y; φ)
a(φ)
For the exponential distribtion we can write

f (y; λ) = λe−λy = exp {−(yλ − log λ)}

Therefore the exponential distribtuion is a member of the expoential family with

θ=λ b(θ) = log λ = log θ

φ=1 a(φ) = −1 c(y; φ) = 0


Note: The above is not unique. Students could also set θ = −λ and proceed from there.

(b) [2 marks] Obtain an expression for the mean and variance of Y and identify the canonical link.

The mean and variance of Y are given by:

E[Y ] = b0 (θ) = 1/θ = λ−1

V ar[Y ] = b“(θ)a(φ) = −1/θ2 (−1) = λ−2


To find the canonical link we set g(µ) = θ = η = xT β. Here:

µ = 1/θ therefore g(µ) = 1/µ

This is the reciprocal or inverse link.

(c) [3 marks] Suppose Yi , i = 1, . . . , n are iid and for each Yi there is vector of explanatory variables
xi = (1, xi1 , . . . , xi,p−1 )0 . Consider the linear predictor ηi = x0i β and the canonical link found in (b).
Find the specific form of the score vector and information matrix for β.

To find the Score and Information we can either substitute into the Likelihood using the parameter
relationship defined by the canonical link: x0 β = η = θ = 1/µ = λ i.e. λ = x0 β or we can use the
general result discussed in class (course notes pages 6-8). Since we have a random sample we can
omit the subscript i and find the contributions from a single observation.

Using direct subsitution λ = x0 β

`(λ) = −yλ + log λ


`(β) = −yx0 β + log x0 β
∂` xj
Sj (β) = = −yxj + 0
∂βj xβ
∂2` xj xk
Ijk (β) = − =
∂βj ∂βk x0 β

8
Using general result for the exponential family. Note in this case we have:
 2  2
∂η −1 ∂η −1 1
= 2 W −1 = V ar(Y ) = µ2 =
∂µ µ ∂µ µ2 µ2

This implies:

∂` ∂η
Sj (β) = = (y − µ) · W · xj
∂βj ∂µ
 
−1
= (y − µ)µ2 xj
µ2
 
1
= − y− 0 xj

xj
= −yxj + 0

Since we are using the canonical link the Information and expected Information are equal and
have have:  2
1 xj xk
Ijk (β) = Ijk (β) = xj W −1 xk = xj xk = 0 2
µ2 (x β)

Therefore using either method, the j th element of the Score vector and the (j, k)th element of the
Information matrix are given by:

n   n
X xij X xij xik
Sj (β) = −yi xij + Ijk (β) =
i=1
xi 0 β i=1
(xi 0 β)2

(d) [3 marks] The R code below gives data on y the time in years until a first claim for 25 insurance
policies and x a proprietary measure of risk. Use Newton Raphson to estimate β = (β0 , β1 ) from an
exponential generalized linear model with the canonical link. Again, you must code your own Newton
Raphson algorithm rather than relying on any built-in functions in R.

y = c(0.9683, 0.4515, 17.4488, 0.6287, 2.2330, 2.6467, 3.9589, 0.0782, 5.4717, 4.1161,
0.6715, 1.6350, 0.1640, 0.3331, 0.7501, 3.0846, 0.6889, 6.3826, 7.0869, 0.7967,
3.2684, 0.1373, 2.8698, 1.5126, 0.9055)
x = c(0.1036, 2.1824, 0.1745, 2.0089, 1.2317, 0.6166, 0.4675, 3.2074, 0.0277, 1.2962,
0.6812, 0.1946, 1.3291, 0.4381, 0.2984, 0.3018, 0.7928, 0.2021, 1.0280, 0.0121,
1.2043, 2.9322, 1.4526, 0.6444, 0.1849)

# Define the Score vector and Information Matrix


Score = function(Beta,Y,X)
{
xTB = t(X)%*%Beta
c( sum(X[1,]/xTB - Y*X[1,]), sum(X[2,]/xTB - Y*X[2,]) )
}
Info = function(Beta,Y,X)
{
xTB = t(X)%*%Beta
I11 = sum(1/xTB^2)
I12 = I21 = sum(X[2,]/xTB^2)

9
I22 = sum(X[2,]^2/xTB^2)
rbind( c(I11,I12), c(I21,I22))
}

# Put data in matrix form, set up initial beta estimate and tolerace for convergance
Y = y
X = rbind(rep(1,length(x)),x)
Beta.old = Beta.new = c(0,1)
delta = 1
epsilon = 10^{-5}
trace = c(Beta.new,Score(Beta.new,Y,X))

# run Newton-Raphson, save beta estimates as we go


while( delta>epsilon ){
Beta.new = Beta.old + solve(Info(Beta.old,Y,X))%*%t(t(Score(Beta.old,Y,X)))
trace = rbind(trace,c(Beta.new,Score(Beta.new,Y,X)))
delta = sum((Beta.new-Beta.old)^2)
Beta.old = Beta.new
}
Beta.hat = Beta.new
print(trace)

## [,1] [,2] [,3] [,4]


## trace 0.00000000 1.0000000 1.047605e+02 -1.343914e+01
## 0.02751952 0.2719448 1.339567e+02 3.302498e+01
## 0.05678020 0.3663916 5.521278e+01 1.065465e+01
## 0.10722527 0.3828644 1.928898e+01 2.331006e+00
## 0.16184545 0.3381554 4.787001e+00 5.163189e-01
## 0.18624249 0.3167595 4.575398e-01 6.480697e-02
## 0.18894507 0.3146805 4.655981e-03 7.057240e-04
## 0.18897277 0.3146599 4.831006e-07 7.370007e-08

After seven iterations of Newton Raphson (equivalently Fisher Scoring here since we’re using the canonical
link) we find the maximum likelihood estimate β̂ = (0.189, 0.315).
Note: It appears that the convergence of Newton Raphson is sensitive to the starting values used. It is
possible that the algorithm may converge to some local maxima instead of the true MLE.

# Note that in the future we will use the GLM function to estimate beta
# Here's the appropriate GLM call for an exponential regression

fit = glm(y~x,family=Gamma)
summary(fit,dispersion=1)

##
## Call:
## glm(formula = y ~ x, family = Gamma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7096 -1.2278 -0.5957 0.2998 1.9012
##
## Coefficients:

10
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.18897 0.08441 2.239 0.0252 *
## x 0.31466 0.14812 2.124 0.0336 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1)
##
## Null deviance: 36.062 on 24 degrees of freedom
## Residual deviance: 29.788 on 23 degrees of freedom
## AIC: 100.29
##
## Number of Fisher Scoring iterations: 7

11

You might also like