Professional Documents
Culture Documents
Correlation
Page
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
X1
X1
-4 -4
-4 -3 -2 -1 0 1 2 3 -4 -3 -2 -1 0 1 2 3 4
X2 X2
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
X1
X1
-4 -4
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
X3 X3
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
X4 -4 X4 -4
-4 -3 -2 -1 0 1 2 3 4 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0
X5 X5
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
X4 -4
X4
-4
-4 -3 -2 -1 0 1 2 3 4 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0
X6 X6
N = 1000 n = 20
4 4
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
X7 -4
X7
-4
-4 -3 -2 -1 0 1 2 3 4 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0
X8 X8
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
X7
X7
-4 -4
-4 -3 -2 -1 0 1 2 3 4 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0
X9 X9
• A match occurs when both variables are on the same side of the mean.
• A mismatch occurs when the score on one variable is above the mean
and the score on the other variable is below the mean (or vice versa).
Cov( X , Y )
rXY =
Var ( X ) Var (Y ))
SS XY
N −1 SS XY
rXY = =
SS XX SS YY SS XX SS YY
N −1 N −1
• Effect size
o rXY is a measure of effect size. It measures the strength of the linear
association between two variables.
-1
Y
-2 -1 0 1 2
X1
-1
-2
-3
-4
Y
-3 -2 -1 0 1 2
r = -.35 r = .27
2 4
1 3
0 2
-1 1
-2 0
-3 -1
-4 -2
Y
Y
-3 -2 -1 0 1 2 3 4 5 -3 -2 -1 0 1 2 3 4 5
X X
estimate
t~
standard error of the estimate
Without going into all the details, we can compute the standard
error of rXY with the following formula:
1 − rXY
2
N −2
rXY
t ( N − 2) ~
1 − rXY
2
N −2
• Look-up p-value.
Notice that the t-test for the correlation only depends on the size of the
correlation and the sample size. Thus, we can work backwards and
determine critical r-values for significance at α = .05 . However, it
is usually preferable to report exact p-value.
• SPSS outputs a correlation matrix. You only need to examine the top
half or the bottom half.
Correlations
X1 X2 X3
X1 Pearson Correlation 1 .954 .903
Sig. (2-tailed) . .000 .000
N 50 50 50
X2 Pearson Correlation .954 1 .958
Sig. (2-tailed) .000 . .000
N 50 50 50
X3 Pearson Correlation .903 .958 1
Sig. (2-tailed) .000 .000 .
N 50 50 50
For the correlation between X1 and X2: r (48) = .95, p < .001
For the correlation between X1 and X3: r (48) = .90, p < .001
For the correlation between X2 and X3: r (48) = .96, p < .001
A A
A
A A A A
A A A
1 A A
A 1 A
A A
AA
AA
A AA
A A
A A A A
A AA A A A
A
A AA AA
A
AA AA
AA AA
0 A A
AA
0
A A AAA A
A A
A A
A A
A
x2
x3
A
A A
AA
-1 A A A A
AA
A A A -1
A A
A A
A A A
A A
AA A
-2
-2 A
A
A
AA
-3
-3
A A
-3 -2 -1 0 1 -3 -2 -1 0 1
x1 x1
Z f − Z null
Z=
1
N −3
If Z obs > 1.96 then robs differs from rnull with a two-tailed α = .05
Or look up exact p-value on z-table
H0 : ρ12 = .80
H 0 : ρ12 ≠ .80
1 1 + .954 1 1 + .8
Zf = ln = 1.8745 Z null = ln = 1.0986
2 1 − .954 2 1 − .8
H 0 : ρ1 = ρ 2 H 1 : ρ1 ≠ ρ 2
H 0 : ρ1 − ρ 2 = 0 H 1 : ρ1 − ρ 2 > 0
Or more generally
H 0 : ρ1 − ρ 2 = k H 1 : ρ1 − ρ 2 > k
If Z obs > 1.96 then reject the two-tailed null hypothesis at α = .05
H0 : ρ1 = ρ 2
H 0 : ρ1 − ρ 2 = 0
1 1 + .2 1 1 + .8
Z f1 = ln = 0.2027 Zf2 = ln = 1.0986
2 1 − .2 2 1 − .8
3. Correlational assumptions
• The assumptions of this of the test for a correlation are that both X and Y are
normally distributed (Actually X and Y must jointly follow a bivariate
normal distribution).
For the correlation between X1 and X2: ρ (48) = .94, p < .001
For the correlation between X1 and X3: ρ (48) = .89, p < .001
For the correlation between X1 and X3: ρ (48) = .95, p < .001
X1 X2 X3
Spearman's rho X1 Correlation Coefficient 1.000 .935 .885
Sig. (2-tailed) . .000 .000
N 50 50 50
X2 Correlation Coefficient .935 1.000 .954
Sig. (2-tailed) .000 . .000
N 50 50 50
X3 Correlation Coefficient .885 .954 1.000
Sig. (2-tailed) .000 .000 .
N 50 50 50
• If both of X and Y are dichotomous, then you can conduct a compute phi, φ,
as a measure of the strength of association between dichotomous variables
(see p. 2-65).
χ2
φ=
N
o An example:
Candidate Preference
Candidate U Candidate V
Homeowners 19 54
Non-Homeowners 60 52
CROSSTABS
/TABLES = prefer by homeown
/STAT = CHISQ.
Chi-Square Tests
Asymp. Sig.
Value df (2-sided)
Pearson Chi-Square 13.704 1 .000
N of Valid Cases 185 χ2 13.704
φ= = = .272
N 185
• We would like to look for relationships between each of the variables and
the average length of stay in hospitals.
• For a correlational analysis to be valid, we need:
o The relationships between all the variables to be linear.
o Each variable to be normally (or symmetrically distributed).
o No outlying observations influencing the analysis.
• Let’s start by checking our key variable, average length of stay, for
normality and outliers:
EXAMINE VARIABLES=length
/PLOT BOXPLOT HISTOGRAM NPPLOT.
Descriptives
Histogram
20
22
20
47
18 112
16
10
14 104
12
10
8
0
6
4
N= 113
Histogram
30 10
8 54
13
53
20
10
4
2
0 107
40
93
1.50 2.50 3.50 4.50 5.50 6.50 7.50
2.00 3.00 4.00 5.00 6.00 7.00 8.00 0
N= 113
• The next step is to check for non-linearity in the relationship between length
of stay and infection risk
o The relationship looks linear, but those two outliers are menacing!
INFRISK
LENGTH Pearson Correlation .533
Sig. (2-tailed) .000
N 113
• But we want to make sure that those outliers are not influencing our
conclusions. We have two methods to examine their influence:
o Conduct a sensitivity analysis.
o Conduct a rank regression (Spearman’s rho).
INFRISK
LENGTH Pearson Correlation .549
Sig. (2-tailed) .000
N 111
• There are two ways to compute Spearman’s Rho. First, we can ask for
it directly in SPSS.
NONPAR CORR /VARIABLES=length WITH infrisk age
/PRINT=SPEARMAN .
Correlations
INFRISK
Spearman's rho LENGTH Correlation Coefficient .549
Sig. (2-tailed) .000
N 113
• The advantage of this method is that we can look at the ranked data
GRAPH /SCATTERPLOT(BIVAR)=rinfrisk WITH rlength.
RANK of
INFRISK
RANK of LENGTH Pearson Correlation .549
Sig. (2-tailed) .000
N 113
• In this case we find nearly identical results for the two methods (Pearson vs.
Spearman) of computing the correlation:
ρ(111) = .549, p < .001
r(111) = .533, p < .001
Descriptives
60
20
50
10
21
40
75
0
38.0 42.0 46.0 50.0 54.0 58.0 62.0 66.0
40.0 44.0 48.0 52.0 56.0 60.0 64.0 30
N= 113
o Let’s look at the correlation between length of stay and average age of
patients.
CORRELATIONS /VARIABLES=length WITH age.
Correlations
AGE
LENGTH Pearson Correlation .189
Sig. (2-tailed) .045
N 113
• A sensitivity analysis:
TEMPORARY.
SELECT IF length < 15.
CORRELATIONS /VARIABLES=length WITH age.
Correlations
AGE
LENGTH Pearson Correlation .122
Sig. (2-tailed) .201
N 111
• This time, we find a very different result (in terms of both magnitude
and significance) when we omit the outliers.
• Rank-based correlation:
NONPAR CORR /VARIABLES=length WITH age
/PRINT=SPEARMAN .
Correlations
AGE
Spearman's rho LENGTH Correlation Coefficient .113
Sig. (2-tailed) .232
N 113
• Again, we find a very different result (in terms of both magnitude and
significance) when we conduct Spearman’s Rho
o In this case, the two outliers exert a large influence on the conclusions we
draw. We should be very cautious about reporting and interpreting the
Pearson correlation on the complete data.
Histogram
30 1000
112
20
46
800
11
78
20 8
600
400
10
200
0 0
RANK VARIABLES=beds.
GRAPH /SCATTERPLOT(BIVAR)=rbeds WITH rlength.
RANK of
BEDS
RANK of LENGTH Pearson Correlation .503
Sig. (2-tailed) .000
N 113
TEMPORARY. TEMPORARY.
SELECT IF medsch=1. SELECT IF medsch=2.
CORRELATIONS CORRELATIONS
/VARIABLES=length WITH infrisk. /VARIABLES=length WITH infrisk.
Correlations Correlations
INFRISK INFRISK
LENGTH Pearson Correlation .463 LENGTH Pearson Correlation .510
Sig. (2-tailed) .061 Sig. (2-tailed) .000
N 17 N 96
H0 : ρ No = ρ Yes
H1 : ρ No ≠ ρYes
1 1+ r 1 1+ .463 1 1+ r 1 1+ .510
ZYes = ln = ln = .501 Z No = ln = ln = .563
2 1− r 2 1− .463 2 1− r 2 1− .510
CORRELATIONS
/VARIABLES=group WITH memory.
Correlations
GROUP
MEMORY Pearson Correlation -.157
Sig. (2-tailed) .462
N 24
o What is reliability?
• The correlation between two equivalent measures of the same variable
• The extent to which all items on a scale assess the same construct
(internal consistency)
X = X t + error
• Image you find a correlation of rXY = .44 , but each variable is measured
with error: α X = .70 and α Y = .80 . What correlation should you have
observed between X and Y?
rXY .44
rX t Yt = = = .78
α X * α Y .70 * .80
• Image you know the true correlation between X and Y is ρ XY = .45 , but
each variable is measured with error: α X = .73 and α Y = .66 . What
correlation are you likely to observe between X and Y?
rXY rXY
rX t Yt = .45 = rXY = .22
α X * αY .73* .66
• Restriction of range
o When the range of either X or Y is restricted by the sampling procedure,
the correlation between X and Y may be underestimated.
A A
4 0.0 0 4 0.0 0
A
A
A A
3 0.0 0 3 0.0 0
A A
pub
pub
A A
A A
2 0.0 0 2 0.0 0
A
A
A A
1 0.0 0 A 1 0.0 0 A
A
A A
A
A
A A
tim e tim e
Correlations Correlations
pub pub
time Pearson Correlation .635 time Pearson Correlation .345
Sig. (2-tailed) .011 Sig. (2-tailed) .328
N 15 N 10