Inference in Regression: Brian Caffo, Jeff Leek and Roger Peng Johns Hopkins Bloomberg School of Public Health

Inference in regression
Brian Caffo, Jeff Leek and Roger Peng

Johns Hopkins Bloomberg School of Public Health
Recall our model and fitted values

Considerthemodel
Yi = 0 + 1 Xi + i
N(0, 2 ) .
Weassumethatthetruemodelisknown.
Weassumethatyou'veseenconfidenceintervalsandhypothesistestsbefore.
0 = Y 1 X
1 = Cor(Y, X)
Sd(Y)
Sd(X)
2/14
Review
Statisticslike oftenhavethefollowingproperties.
1. IsnormallydistributedandhasafinitesampleStudent'sTdistributioniftheestimatedvariance
isreplacedwithasampleestimate(undernormalityassumptions).
2. CanbeusedtotestH0 : = 0 versusHa : >, <, 0 .
3. Canbeusedtocreateaconfidenceintervalfor via Q1/2 where Q1/2 istherelevant
quantilefromeitheranormalorTdistribution.
Inthecaseofregressionwithiidsamplingassumptionsandnormalerrors,ourinferenceswillfollow
verysimilarilytowhatyousawinyourinferenceclass.
Wewon'tcoverasymptoticsforregressionanalysis,butsufficeittosaythatunderassumptionson
thewaysinwhichthe X valuesarecollected,theiidsamplingmodel,andmeanmodel,thenormal
resultsholdtocreateintervalsandconfidenceintervals
3/14
Standard errors (conditioned on X)

ni=1 (Yi Y )(Xi X )
Var( 1 ) = Var
(
n (Xi X )2
i=1
n
Var (i=1 Yi (Xi X ))
2
n
i=1 (Xi X )2 )
n 2 (Xi X )2
(
i=1
2
n
i=1 (Xi X) 2 )
2
ni=1 (Xi X) 2
4/14
Results
2 = Var( 1 ) = 2 / i=1 ( Xi X ) 2
n
2 = Var( ) =
0
1
n
(
2
X
ni=1 (X iX ) 2
Inpractice, isreplacedbyitsestimate.
It'sprobablynotsurprisingthatunderiidGaussianerrors
j j

followsat distributionwithn 2 degreesoffreedomandanormaldistributionforlargen.

Thiscanbeusedtocreateconfidenceintervalsandperformhypothesistests.
5/14
Example diamond data set

library(UsingR); data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2) / (n-2))
ssx <- sum((x - mean(x))^2)
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0; tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")
6/14
Example continued
coefTable
Estimate Std. Error t value P(>|t|)

(Intercept) -259.6
17.32 -14.99 2.523e-19
x
3721.0
81.79 45.50 6.751e-40
fit <- lm(y ~ x);

summary(fit)$coefficients
Estimate Std. Error t value Pr(>|t|)

(Intercept) -259.6
17.32 -14.99 2.523e-19
x
3721.0
81.79 45.50 6.751e-40
7/14
Getting a confidence interval

sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
[1] -294.5 -224.8
sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]
[1] 3556 3886
With95%confidence,weestimatethata0.1caratincreaseindiamondsizeresultsina355.6to388.6
increaseinpricein(Singapore)dollars.
8/14
Prediction of outcomes
ConsiderpredictingY atavalueofX
Predictingthepriceofadiamondgiventhecarat
Predictingtheheightofachildgiventheheightoftheparents
Theobviousestimateforpredictionatpointx 0 is
0 + 1 x 0
Astandarderrorisneededtocreateapredictioninterval.
There'sadistinctionbetweenintervalsfortheregressionlineatpoint x 0 andthepredictionofwhat
aywouldbeatpointx 0 .
Lineatx se,
0
1
(x0 X ) 2
+
n
2
n
i=1 (X iX )
Predictionintervalseatx ,
0
(x0
X ) 2
1 + 1n + n
2
i=1 (X iX )
9/14
Plotting the prediction intervals

plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
xVals <- seq(min(x), max(x), by = .01)
yVals <- beta0 + beta1 * xVals
se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)
se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)
lines(xVals, yVals + 2 * se1)
lines(xVals, yVals - 2 * se1)
lines(xVals, yVals + 2 * se2)
lines(xVals, yVals - 2 * se2)
10/14
Plotting the prediction intervals
11/14
Discussion
Bothintervalshavevaryingwidths.
LeastwidthatthemeanoftheXs.
Wearequiteconfidentintheregressionline,sothatintervalisverynarrow.
Ifweknew 0 and 1 thisintervalwouldhavezerowidth.
Thepredictionintervalmustincorporatethevariabilibityinthedataaroundtheline.
Evenifweknew 0 and 1 thisintervalwouldstillhavewidth.
12/14
In R
newdata <- data.frame(x = xVals)
p1 <- predict(fit, newdata, interval = ("confidence"))
p2 <- predict(fit, newdata, interval = ("prediction"))
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
lines(xVals, p1[,2]); lines(xVals, p1[,3])
lines(xVals, p2[,2]); lines(xVals, p2[,3])
13/14
In R
14/14

Inference in Regression: Brian Caffo, Jeff Leek and Roger Peng Johns Hopkins Bloomberg School of Public Health

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inference in Regression: Brian Caffo, Jeff Leek and Roger Peng Johns Hopkins Bloomberg School of Public Health

Uploaded by

Copyright:

Available Formats

Inference in regression

Brian Caffo, Jeff Leek and Roger Peng

Recall our model and fitted values

Standard errors (conditioned on X)

followsat distributionwithn 2 degreesoffreedomandanormaldistributionforlargen.

Example diamond data set

Estimate Std. Error t value P(>|t|)

fit <- lm(y ~ x);

Estimate Std. Error t value Pr(>|t|)

Getting a confidence interval

[1] -294.5 -224.8

sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]

[1] 3556 3886

Plotting the prediction intervals

Plotting the prediction intervals

You might also like