You are on page 1of 37

Correlation

Oh yeah!
Outline
Basics
Visualization
Covariance
Significance testing and interval estimation
Effect size
Bias
Factors affecting correlation
Issues with correlational studies
Correlation
Research question: What is the relationship
between two variables?
Correlation is a measure of the direction and
degree of linear association between 2 variables.
Correlation is the standardized covariance
between two variables
Questions to be asked
Is there a linear relationship between x and y?
What is the strength of this relationship?
Pearson Product Moment Correlation Coefficient r
Can we describe this relationship and use this to predict
y from x?
y=bx+a
Is the relationship we have described statistically
significant?
Not a very interesting one if tested against a null of r = 0
Other stuff
Check scatterplots to see whether a Pearson r makes
sense
Use both r and R2 to understand the situation
If data is non-metric or non-normal, use non-
parametric correlations
Correlation does not prove causation
True relationship may be in opposite direction, co-
causal, or due to other variables
However, correlation is the primary statistic used in
making an assessment of causality
Potential Causation
Possible outcomes
-1 to +1
As one variable increases/decreases, the other
variable increases/decreases
Positive covariance
As one variable increases/decreases, another
decreases/increases
Negative covariance
No relationship (independence)
r=0
Non-linear relationship?
Scatterplots
As we discussed previously, scatterplots provide a pictorial
examination of the relationship between two quantitative variables

Predictor variable on the X-axis (abscissa); Criterion variable on the


Y-axis (ordinate)
Each subject is located in the scatterplot by means of a pair of scores
(score on the X variable and score on the Y variable)
Plot each pair of observations (X, Y)
X = predictor variable (independent)
Y = criterion variable (dependent)
Check for linear relationship
Line of best fit
y = a + bx
Check for outliers
Example of a Scatterplot
The relationship between
scores on a test of quantitative
skills taken by students on the
first day of a stats course (X-
axis) and their combined
scores on two semester exams
(Y-axis)
Example of a Scatterplot
The two variables are positively related
As quantitative skill increases, so does performance on the
two midterm exams
Linear relationship between the variables
Line of best fit drawn on the graph - the regression line
The strength or degree of the liner relationship is
measured by a correlation coefficient i.e. how tightly the
data points cluster around the regression line
We can use this information to determine whether the
linear relationship represents a true relationship in the
population or is due entirely to chance factors
What do we look for in a Scatterplot?
Overall pattern: Ellipse
Any striking deviations (outliers)
Form: is it linear? (curved? clustered?)
Direction: is it positive
high values of the two variables tend to occur
together)
Or negative
high values of one variable tend to occur with low
values of the other variable)?
Strength: how close the points lie to the line of
best fit (if a linear relationship)
140

120

100

80

r =1
60

40

20

0
40 60 80 100 120 140
140

120

100

80

60 r = 0.95
40

20

0
40 60 80 100 120 140
140

120

100

80

60 r = 0.7
40

20

0
40 60 80 100 120 140
160

140

120

100

80 r = 0.4
60

40

20

0
40 60 80 100 120 140
140

120

100

80

60 r = -0.4
40

20

0
40 60 80 100 120 140
140

120

100

80
r = -0.7
60

40

20

0
40 60 80 100 120 140
140

120

100

80

60 r = -0.95
40

20

0
40 60 80 100 120 140
140

120

100

80

60 r = -1
40

20

0
40 60 80 100 120 140
Linear Correlation / Covariance
How do we obtain a quantitative measure of the
linear association between X and Y?
The Pearson Product-Moment Correlation
Coefficient, r, comes from the covariance
statistic, it reflects the degree to which the two
variables vary together
Covariance n

( x x)( y y)
i i
cov( x, y ) i 1
n 1
The variance shared by two variables
When X and Y move in the same direction (i.e. their
deviations from the mean are similarly pos or neg)
cov (x,y) = pos.
When X and Y move in opposite directions
cov (x,y) = neg.
When no constant relationship
cov (x,y) = 0
Covariance
cov( x, y )
Covariance is not easily interpreted on its own
and cannot be compared across different scales
of measurement
Solution: standardize this measure
Pearsons r:
cov( x, y)
rxy
sx s y
Significance test for correlation
All correlations in a practical setting will be non-
zero
A significance test can be conducted in an effort
to infer to a population
Key Question: Is the r large enough that it is
unlikely to have come from a population in
which the two variables are unrelated?
Testing the null hypothesis that
H0: r = 0 vs. alternative hypothesis H1: r 0
r=population product-moment correlation
coefficient
Significance test for correlation
However with larger N, small,

df
N-2
critical
=.05
possibly non-meaningful, 5 .67
correlations can be deemed 10 .50
significant

15
20
.41
.36
So the better question is: Is a 25 .32
test against zero useful? 30 .30
50 .23
Tests of significance for r have 200 .11
typically have limited utility if 500 .07
testing against a zero value 1000 .05

Go by the size1 and judge


worth by what is seen in the
relevant literature
Significance test for correlation
Furthermore, using the
approaches outlined in r N 2
t df = N - 2
Howell, while standard, are 1 r2
really not necessary
Using the t-distribution as 1 r
r (.5) log e
1
se
described we would only really 1 r
N 3

be able to test a null


hypothesis of zero r r
If we want to test against some z
1
specific value1, we have to
convert r in some odd fashion N 3
and test using these new
values CI ( r ) r zcv
1
N 3
Fisher transformation
Test of the difference between two rs
While those new values create an r that approximates a
normal distribution, why do we have to do it?
The reason for this transformation is that since r has
limits of +1, the larger the absolute value of r, the more
skewed its sampling distribution about the population r
(rho)
Sampling distribution of a correlation
Via the bootstrap, we can a see
for ourselves that the sampling
distribution becomes more
and more skewed as we
deviate from a null value of
zero
The better approach
Nowadays, we can bootstrap the r or difference
between two rs and do hypothesis tests without
unnecessary (and most likely problematic)
transformations and assumptions
Even for small samples of about 30 it performs as
well as the transformation in ideal situations
(Efron, 1988)
Furthermore, it can be applied to other
correlation metrics.
Correlation
Typically though, for a single sample correlations among the
variables should be considered descriptive statistics1, and
often the correlation matrix is the data set that forms the basis
of an analysis
A correlation can also be thought of as an effect size in and of
itself
Standardized measure of amount of covariation
The strength and degree of a linear relationship between
variables
The amount some variable moves in standard deviation units
with a 1 standard deviation change in another variable
R2 is also an effect size
Amount of variability seen in y that can be explained by the
variability seen in x
Amount of variance they share2
Biased estimate- Adjusted r
r turns out to be upwardly biased, and the
smaller the sample size, the greater the bias
With large samples the difference will be
negligible
With smaller samples one should report
adjusted r or R2

(1 r 2 )( N 1)
radj 1
N 2
Factors affecting correlation
Linearity
Heterogeneous subsamples
Range restrictions
Outliers
Linearity
Nonlinear relationships will have an adverse
effect on a measure designed to find a linear
relationship
Heterogeneous subsamples
Sub-samples may artificially increase or decrease
overall r, or in a corollary to Simpsons paradox,
produce opposite sign relations for the aggregated
data compared to the groups
Solution - calculate r separately for sub-samples &
overall, look for differences
Heterogeneous subsamples
Range restriction
Limiting the variability of your data can in turn
limit the possibility for covariability between two
variables, thus attenuating r.
Common example occurs with Likert scales
E.g. 1 - 4 vs. 1 - 9

However it is also the case that restricting the


range can actually increase r if by doing so,
highly influential data points would be kept out
Wilcox 2001
Effect of Outliers
Outliers can artificially increase or decrease r

Options
Compute r with and without outliers
Conduct robustified R!
For example, recode outliers as having more
conservative scores (winsorize)
Transform variables (last resort)
Advantages of correlational studies
Show the amount (strength) of relationship
present
Can be used to make predictions about the
variables studied
Often easier to collect correlational data, and
interpretation is fairly straightforward.
Disadvantages of correlational studies
Cant assume that a cause-effect relationship
exists
Little or no control (experimental manipulation)
of the variables is usually seen
Relationships may be accidental or due to a third
variable, unmeasured factor
Common causes
Spurious correlations and Mediators

You might also like