Professional Documents
Culture Documents
2 www.globsynfinishingschool.com
3 www.globsynfinishingschool.com
Linear correlation and
linear regression
4 www.globsynfinishingschool.com
Continuous outcome (means)
Are the observations independent or correlated?
Outcome Alternatives if the
Variable independent correlated normality assumption is
violated (and small sample
size):
Continuous Ttest: compares means Paired ttest: compares means Non-parametric statistics
(e.g. pain between two independent between two related groups (e.g., Wilcoxon sign-rank test:
groups the same subjects before and non-parametric alternative to the
scale, after) paired ttest
cognitive
function) ANOVA: compares means
between more than two Repeated-measures Wilcoxon sum-rank test
independent groups ANOVA: compares changes (=Mann-Whitney U test): non-
over time in the means of two or parametric alternative to the ttest
Pearsons correlation more groups (repeated
measurements)
coefficient (linear Kruskal-Wallis test: non-
correlation): shows linear parametric alternative to ANOVA
correlation between two Mixed models/GEE
continuous variables modeling: multivariate
regression techniques to compare Spearman rank correlation
changes over time between two coefficient: non-parametric
Linear regression: or more groups; gives rate of alternative to Pearsons correlation
multivariate regression technique change over time coefficient
used when the outcome is
continuous; gives slopes
5 www.globsynfinishingschool.com
Recall: Covariance
( x X )( y Y )
i i
cov ( x , y ) i 1
n 1
6 www.globsynfinishingschool.com
Interpreting Covariance
7 www.globsynfinishingschool.com
Correlation coefficient
Pearsons Correlation Coefficient is standardized covariance
(unitless):
cov ariance ( x, y)
r
var x var y
8 www.globsynfinishingschool.com
Correlation
Measures the relative strength of the linear relationship between two
variables
Unit-less
Ranges between 1 and 1
The closer to 1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker any positive linear relationship
9 www.globsynfinishingschool.com
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Linear Correlation
Y Y
X X
Y Y
X X
11 www.globsynfinishingschool.com
Linear Correlation
Y Y
X X
Y Y
X X
12 www.globsynfinishingschool.com
Linear Correlation
No relationship
X
13 www.globsynfinishingschool.com
Calculating by hand
( x x )( y
i 1
i i y)
r
cov ariance ( x, y )
n 1
n n
var x var y
i
( x
i 1
x ) 2
i
( y
i 1
y ) 2
n 1 n 1
14 www.globsynfinishingschool.com
Simpler calculation formula
Numerator of
n covariance
( x x )( y
i 1
i i y)
r n 1
n n SS xy
(x x) ( y
i
2
i y)2 r
i 1 i 1 SS x SS y
n 1 n 1
n
( x x )( y
i i y)
SS xy
Numerators of
variance
i 1
n n
SS x SS y
(x x) ( y
i 1
i
2
i 1
i y) 2
15 www.globsynfinishingschool.com
Distribution of the correlation
coefficient:
1 r 2
SE(r)
n2
The sample correlation coefficient follows a T-distribution with
n-2 degrees of freedom (since you have to estimate the
standard error).
16 www.globsynfinishingschool.com
Continuous outcome (means)
Are the observations independent or correlated?
Outcome Alternatives if the normality
Variable independent correlated assumption is violated (and
small sample size):
Continuous Ttest: compares means Paired ttest: compares means Non-parametric statistics
(e.g. pain between two independent between two related groups (e.g., Wilcoxon sign-rank test:
groups the same subjects before and non-parametric alternative to the
scale, after) paired ttest
cognitive
function) ANOVA: compares means
between more than two Repeated-measures Wilcoxon sum-rank test
independent groups ANOVA: compares changes (=Mann-Whitney U test): non-
over time in the means of two or parametric alternative to the ttest
Pearsons correlation more groups (repeated
measurements)
coefficient (linear Kruskal-Wallis test: non-
correlation): shows linear parametric alternative to ANOVA
correlation between two Mixed models/GEE
continuous variables modeling: multivariate
regression techniques to compare Spearman rank correlation
changes over time between two coefficient: non-parametric
Linear regression: or more groups; gives rate of alternative to Pearsons correlation
multivariate regression technique change over time coefficient
used when the outcome is
continuous; gives slopes
17 www.globsynfinishingschool.com
Linear regression
18 www.globsynfinishingschool.com
What is Linear?
Remember this:
Y=mX+B?
19 www.globsynfinishingschool.com
Whats Slope?
20 www.globsynfinishingschool.com
Prediction
If you know something about X, this knowledge helps you predict something
about Y. (Sound familiar?sound like conditional probabilities?)
21 www.globsynfinishingschool.com
Regression equation
E ( yi / xi ) xi
22 www.globsynfinishingschool.com
Predicted value for an individual
23 www.globsynfinishingschool.com
Assumptions (or the fine print)
24 www.globsynfinishingschool.com
The standard error of Y given X is the average variability around the regression
line at any given value of X. It is assumed to be equal at all values of X.
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
25 www.globsynfinishingschool.com
Regression Picture
yi
yi xi
C A
B
y A
B y
C
yi
n n n
(y
i 1
i y) 2
( y
i 1
i y) 2
( y
i 1
i yi ) 2
A2 B2 C2
R2=SSreg/SStotal
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to nave mean of y Variance around the regression line
observations from nave mean Variability due to x (regression) Additional variability not explained
of y by xwhat least squares method aims
Total variation to minimize
26 www.globsynfinishingschool.com
Recall example: cognitive
function and vitamin D
Hypothetical data loosely based on [1];
cross-sectional study of 100 middle-
aged and older European men.
Cognitive function is measured by the Digit
Symbol Substitution Test (DSST).
1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged
and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.
27 www.globsynfinishingschool.com
Distribution of vitamin D
Mean= 63 nmol/L
Standard deviation = 33 nmol/L
28 www.globsynfinishingschool.com
Distribution of DSST
Normally distributed
Mean = 28 points
Standard deviation = 10 points
29 www.globsynfinishingschool.com
Four hypothetical datasets
I generated four hypothetical datasets,
with increasing TRUE slopes (between
vit D and DSST):
0
0.5 points per 10 nmol/L
1.0 points per 10 nmol/L
1.5 points per 10 nmol/L
30 www.globsynfinishingschool.com
Dataset 1: no relationship
31 www.globsynfinishingschool.com
Dataset 2: weak relationship
32 www.globsynfinishingschool.com
Dataset 3: weak to moderate
relationship
33 www.globsynfinishingschool.com
Dataset 4: moderate
relationship
34 www.globsynfinishingschool.com
The Best fit line
Regression
equation:
E(Yi) = 28 + 0*vit
Di (in 10 nmol/L)
35 www.globsynfinishingschool.com
The Best fit line
Regression
equation:
E(Yi) = 26 + 0.5*vit
36 www.globsynfinishingschool
Di (in 10 nmol/L) .com
The Best fit line
Regression equation:
E(Yi) = 22 + 1.0*vit
Di (in 10 nmol/L)
37 www.globsynfinishingschool.com
The Best fit line
Regression equation:
E(Yi) = 20 + 1.5*vit Di
(in 10 nmol/L)
38 www.globsynfinishingschool.com
Estimating the intercept and
slope: least squares estimation
** Least Squares Estimation
A little calculus.
What are we trying to estimate? , the slope, from
Whats the constraint? We are trying to minimize the squared distance (hence the least squares) between the
observations themselves and the predicted values , or (also called the residuals, or left-over unexplained variability)
Find the that gives the minimum sum of the squared differences. How do you maximize a function? Take the
derivative; set it equal to zero; and solve. Typical max/min problem from calculus.
n n
(y (y
d
( xi )) 2(
2
xi )( xi ))
d
i i
i 1 i 1
n
2(
i 1
( y i xi xi xi )) 0...
2
Cov( x, y)
Slope (beta coefficient) =
Var ( x)
Intercept= Calculate : y - x
40 www.globsynfinishingschool.com
Relationship with correlation
SDx
r
SDy
In correlation, the two variables are treated as equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
41 www.globsynfinishingschool.com
Example: dataset 4
SDx = 33 nmol/L
SDy= 10 points
Cov(X,Y) = 163
points*nmol/L
Beta = 163/332 = 0.15
points per nmol/L
SS x
SS y = 1.5 points per 10
nmol/L
r = 163/(10*33) = 0.49
Or
r = 0.15 * (33/10) = 0.49
42 www.globsynfinishingschool.com
Significance testing
Slope
Distribution of slope ~ Tn-2(,s.e.( ))
Tn-2=
0
s.e.( )
43 www.globsynfinishingschool.com
Formula for the standard error of
beta (you will not have to calculate
by hand!):
n
( y y )
i 1
i i
2
2
n2 sy / x
s
SS x SS x
n
where SSx ( xi x ) 2
i 1
and y i xi
44 www.globsynfinishingschool.com
Example: dataset 4
Standard error (beta) = 0.03
T98 = 0.15/0.03 = 5, p<.0001
45 www.globsynfinishingschool.com
Residual Analysis: check
assumptions
ei Yi Yi
The residual for observation i, ei, is the difference
between its observed and predicted value
Check the assumptions of regression by examining the
residuals
Examine for linearity assumption
Examine for constant variance for all levels of X
(homoscedasticity)
Evaluate normal distribution assumption
Evaluate independence assumption
yi 20 1.5xi
For Vitamin D = 95 nmol/L (or 9.5 in 10 nmol/L):
yi 20 1.5(9.5) 34
47 www.globsynfinishingschool.com
Residual =
observed - predicted
X=95
yi 48
nmol/L
34
y i 34
yi y i 14
48 www.globsynfinishingschool.com
Residual Analysis for
Linearity
Y Y
x x
residuals
x residuals x
Not Linear
Slide from: Statistics for Managers Using Microsoft
49
Linear
2004 globsynfinishingschool
Excel 4th Edition, www. Prentice-Hall .com
Residual Analysis for
Homoscedasticity
Y Y
x x
residuals
x residuals x
Not Independent
Independent
residuals
residuals
X
residuals
52 www.globsynfinishingschool.com
Multiple linear regression
What if age is a confounder here?
Older men have lower vitamin D
Older men have poorer cognition
Adjust for age by putting age in the
model:
DSST score = intercept + slope1xvitamin D
+ slope2 xage
53 www.globsynfinishingschool.com
2 predictors: age and vit D
54 www.globsynfinishingschool.com
Different 3D view
55 www.globsynfinishingschool.com
Fit a plane rather than a line
56 www.globsynfinishingschool.com
Equation of the Best fit
plane
DSST score = 53 + 0.0039xvitamin D
(in 10 nmol/L) - 0.46 xage (in years)
E(y)= + 1*X + 2 *W + 3 *Z
59 www.globsynfinishingschool.com
A ttest is linear regression!
Divide vitamin D into two groups:
Insufficient vitamin D (<50 nmol/L)
Sufficient vitamin D (>=50 nmol/L), reference
group
We can evaluate these data with a ttest or a
linear regression
40 32.5 7.5
T98 3.46; p .0008
2 2
10.8 10.8
54 46
60 www.globsynfinishingschool.com
As a linear regression
Parameter ````````````````Standard
Variable Estimate Error t Value Pr > |t|
Sufficient vs.
Deficient
63 www.globsynfinishingschool.com
Results
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Interpretation:
The deficient group has a mean DSST 9.87 points
lower than the reference (sufficient) group.
The insufficient group has a mean DSST 6.87
points lower than the reference (sufficient) group.
64 www.globsynfinishingschool.com
Other types of multivariate
regression
Multiple linear regression is for normally
distributed outcomes
65 www.globsynfinishingschool.com
Common multivariate regression models.
Example Appropriate Example equation What do the coefficients give
Outcome outcome multivariate you?
(dependent variable regression
variable) model
Continuous Blood Linear blood pressure (mmHg) = slopestells you how much
pressure regression + salt*salt consumption (tsp/day) + the outcome variable
age*age (years) + smoker*ever increases for every 1-unit
smoker (yes=1/no=0) increase in each predictor.
Binary High blood Logistic ln (odds of high blood pressure) = odds ratiostells you how
pressure regression + salt*salt consumption (tsp/day) + much the odds of the
(yes/no) age*age (years) + smoker*ever outcome increase for every
smoker (yes=1/no=0) 1-unit increase in each
predictor.
Time-to-event Time-to- Cox regression ln (rate of death) = hazard ratiostells you how
death + salt*salt consumption (tsp/day) + much the rate of the outcome
age*age (years) + smoker*ever increases for every 1-unit
smoker (yes=1/no=0) increase in each predictor.
Multivariate regression pitfalls
Multi-collinearity
Residual confounding
Overfitting
67 www.globsynfinishingschool.com
Multicollinearity
Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.
68 www.globsynfinishingschool.com
Residual confounding
You cannot completely wipe out
confounding simply by adjusting for
variables in multiple regression unless
variables are measured with zero error
(which is usually impossible).
Example: meat eating and mortality
69 www.globsynfinishingschool.com
Men who eat a lot of meat are
unhealthier for many reasons!
Sinha R, Cross AJ, Graubard BI, Leitzmann MF, Schatzkin A. Meat intake and mortality: a prospective
www.study of over half a million people. Arch
globsynfinishingschool .com
70
Intern Med 2009;169:562-71
Mortality risks
Sinha R, Cross AJ, Graubard BI, Leitzmann MF, Schatzkin A. Meat intake and mortality: a prospective
www.study of over half a million people. Arch
globsynfinishingschool .com
71
Intern Med 2009;169:562-71
Overfitting
In multivariate modeling, you can get
highly significant but meaningless
results if you put too many predictors in
the model.
The model is fit perfectly to the quirks
of your particular sample, but has no
predictive ability in a new sample.
72 www.globsynfinishingschool.com
Overfitting: class data
example
I asked SAS to automatically find
predictors of optimism in our class
dataset. Heres the resulting linear
regression model:
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Exercise, sleep, and high ratings for Clinton are negatively related to optimism (highly
significant!) and high ratings for Obama and high love of math are positively related to
optimism (highly significant!). 73 www.globsynfinishingschool.com
If something seems to good to
be true
Clinton, univariate:
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Obama, Univariate:
Parameter Standard Compare
Variable Label DF Estimate Error t Value Pr > |t| with
multivariate
Intercept Intercept 1 0.82107 2.43137 0.34 0.7389 result;
obama obama 1 0.87276 0.31973 2.73 0.0126 p<.0001
81 www.globsynfinishingschool.com
Time-to-event outcome
(survival data); HRP 262
Are the observation groups independent or correlated? Modifications to
Outcome Cox regression
Variable if proportional-
independent correlated
hazards is
violated:
Time-to- Kaplan-Meier statistics: estimates survival functions for n/a (already over Time-dependent
event (e.g., each group (usually displayed graphically); compares survival time) predictors or time-
functions with log-rank test dependent hazard
time to
ratios (tricky!)
fracture)
Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios
82 www.globsynfinishingschool.com