Professional Documents
Culture Documents
Oh yeah!
Outline
Basics
Visualization
Covariance
Significance testing and interval estimation
Effect size
Bias
Factors affecting correlation
Issues with correlational studies
Correlation
Research question: What is the relationship
between two variables?
Correlation is a measure of the direction and
degree of linear association between 2 variables.
Correlation is the standardized covariance
between two variables
Questions to be asked
Is there a linear relationship between x and y?
What is the strength of this relationship?
Pearson Product Moment Correlation Coefficient r
Can we describe this relationship and use this to predict
y from x?
y=bx+a
Is the relationship we have described statistically
significant?
Not a very interesting one if tested against a null of r = 0
Other stuff
Check scatterplots to see whether a Pearson r makes
sense
Use both r and R2 to understand the situation
If data is non-metric or non-normal, use non-
parametric correlations
Correlation does not prove causation
True relationship may be in opposite direction, co-
causal, or due to other variables
However, correlation is the primary statistic used in
making an assessment of causality
Potential Causation
Possible outcomes
-1 to +1
As one variable increases/decreases, the other
variable increases/decreases
Positive covariance
As one variable increases/decreases, another
decreases/increases
Negative covariance
No relationship (independence)
r=0
Non-linear relationship?
Scatterplots
As we discussed previously, scatterplots provide a pictorial
examination of the relationship between two quantitative variables
120
100
80
r =1
60
40
20
0
40 60 80 100 120 140
140
120
100
80
60 r = 0.95
40
20
0
40 60 80 100 120 140
140
120
100
80
60 r = 0.7
40
20
0
40 60 80 100 120 140
160
140
120
100
80 r = 0.4
60
40
20
0
40 60 80 100 120 140
140
120
100
80
60 r = -0.4
40
20
0
40 60 80 100 120 140
140
120
100
80
r = -0.7
60
40
20
0
40 60 80 100 120 140
140
120
100
80
60 r = -0.95
40
20
0
40 60 80 100 120 140
140
120
100
80
60 r = -1
40
20
0
40 60 80 100 120 140
Linear Correlation / Covariance
How do we obtain a quantitative measure of the
linear association between X and Y?
The Pearson Product-Moment Correlation
Coefficient, r, comes from the covariance
statistic, it reflects the degree to which the two
variables vary together
Covariance n
( x x)( y y)
i i
cov( x, y ) i 1
n 1
The variance shared by two variables
When X and Y move in the same direction (i.e. their
deviations from the mean are similarly pos or neg)
cov (x,y) = pos.
When X and Y move in opposite directions
cov (x,y) = neg.
When no constant relationship
cov (x,y) = 0
Covariance
cov( x, y )
Covariance is not easily interpreted on its own
and cannot be compared across different scales
of measurement
Solution: standardize this measure
Pearsons r:
cov( x, y)
rxy
sx s y
Significance test for correlation
All correlations in a practical setting will be non-
zero
A significance test can be conducted in an effort
to infer to a population
Key Question: Is the r large enough that it is
unlikely to have come from a population in
which the two variables are unrelated?
Testing the null hypothesis that
H0: r = 0 vs. alternative hypothesis H1: r 0
r=population product-moment correlation
coefficient
Significance test for correlation
However with larger N, small,
df
N-2
critical
=.05
possibly non-meaningful, 5 .67
correlations can be deemed 10 .50
significant
15
20
.41
.36
So the better question is: Is a 25 .32
test against zero useful? 30 .30
50 .23
Tests of significance for r have 200 .11
typically have limited utility if 500 .07
testing against a zero value 1000 .05
(1 r 2 )( N 1)
radj 1
N 2
Factors affecting correlation
Linearity
Heterogeneous subsamples
Range restrictions
Outliers
Linearity
Nonlinear relationships will have an adverse
effect on a measure designed to find a linear
relationship
Heterogeneous subsamples
Sub-samples may artificially increase or decrease
overall r, or in a corollary to Simpsons paradox,
produce opposite sign relations for the aggregated
data compared to the groups
Solution - calculate r separately for sub-samples &
overall, look for differences
Heterogeneous subsamples
Range restriction
Limiting the variability of your data can in turn
limit the possibility for covariability between two
variables, thus attenuating r.
Common example occurs with Likert scales
E.g. 1 - 4 vs. 1 - 9
Options
Compute r with and without outliers
Conduct robustified R!
For example, recode outliers as having more
conservative scores (winsorize)
Transform variables (last resort)
Advantages of correlational studies
Show the amount (strength) of relationship
present
Can be used to make predictions about the
variables studied
Often easier to collect correlational data, and
interpretation is fairly straightforward.
Disadvantages of correlational studies
Cant assume that a cause-effect relationship
exists
Little or no control (experimental manipulation)
of the variables is usually seen
Relationships may be accidental or due to a third
variable, unmeasured factor
Common causes
Spurious correlations and Mediators