You are on page 1of 38

Regression

and Correlation Analysis


Violeta Bartolome
Senior Associate Scientiest
PBGB-CRIL
v.bartolome@cgiar.org
Correlation Analysis
• A measure of association between
two numerical variables.
• Example (positive correlation)
o As soil fertility increases, rice grain yield
also increases

IRRI-PBGB-CRIL 2
Example
Nitrogen Grain Yield
Content (%) (kg/ha)
For seven 0.12 1652
randomly
selected plots, 0.14 2056
nitrogen content 0.15 2598
in the soil and 0.16 2734
the grain yield 0.19 3238
were recorded.
0.22 4824
0.23
IRRI-PBGB-CRIL
4858 3
How would you describe the graph?
Grain Yield of Rice at differnt levels of
Soil Nitrogen Content

6000

5000
Grain Yield (kg/ha)

4000

3000

2000

1000
0.1 0.15 0.2 0.25
Nitrogen Content (% )

How “strong” is the linear relationship?


IRRI-PBGB-CRIL 4
Measuring the Relationship

Pearson’s Sample Correlation


Coefficient, r

measures the direction and the


strength of the linear association
between two numerical paired
variables.

IRRI-PBGB-CRIL 5
Direction of Association

Positive Correlation Negative Correlation

IRRI-PBGB-CRIL 6
Strength of Linear Association

r
value
Interpretation

1 perfect positive linear relationship

0 no linear relationship

-1 perfect negative linear relationship


IRRI-PBGB-CRIL 7
Strength of Linear Association

Perfect Linear Positive


Correlation

No Linear Correlation

IRRI-PBGB-CRIL 8
Other Strengths of Association

r value Interpretation

0.9 strong association

0.5 moderate association

0.25 weak association

IRRI-PBGB-CRIL 9
Other Strengths of Association

Strong Positive Linear


Correlation

Moderate Negative
Linear Correlation

IRRI-PBGB-CRIL 10
Formula

= the sum
n = number of paired
items
xi = input variable yi = output variable
x = x-bar = mean of y = y-bar = mean of
x’s y’s
sx= standard sy= standard
deviation of x’s deviation of y’s
IRRI-PBGB-CRIL 11
Correlation Coefficient (r)

r=0 does not necessarily mean no


relationship. Relationship may be
nonlinear.

IRRI-PBGB-CRIL 12
Correlation Coefficient

IRRI-PBGB-CRIL 13
Correlation Coefficient (r)

A significant r does not necessarily


mean a strong linear relationship

IRRI-PBGB-CRIL 14
Correlation Coefficient
500

r = .25**
450
n = 234

400

350
When no. of
observations is
Yield/plot

300
large, a low r-value
250
may still be
200
significant.
150

100
0 5 10 15 20
Tiller/plant

IRRI-PBGB-CRIL 15
Correlation Coefficient (r)
To be able to conclude that 2
variables have a strong linear
relationship, r should be both high
and significant

IRRI-PBGB-CRIL 16
Correlation Coefficient
6

5
r = .90**
n = 60

4
Yield (t/ha)

0
20 30 40 50 60 70 80 90 100 110
No. of spikelet/panicle

IRRI-PBGB-CRIL 17
Test of significance for r
Degrees of
Probability, p
Freedom
0.05 0.01 0.001
1 0.997 1.000 1.000

2 0.950 0.990 0.999

3 0.878 0.959 0.991

4 0.811 0.917 0.974

5 0.755 0.875 0.951


r is significant if the absolute
6 0.707 0.834 0.925

7 0.666 0.798 0.898 value is greater that the tabular


8 0.632 0.765 0.872 value.
9 0.602 0.735 0.847

10 0.576 0.708 0.823

11 0.553 0.684 0.801

12 0.532 0.661 0.780

13 0.514 0.641 0.760

14 0.497 0.623 0.742

15 0.482 0.606 0.725

16 0.468 0.590 0.708

17 0.456 0.575 0.693

18 0.444 0.561 0.679

19 0.433 0.549 0.665

20 0.423 0.457 0.652

IRRI-PBGB-CRIL 18
CORRELATION ANALYSIS

PEARSON CORRELATION ANALYSIS


Nitrogen.Content Grain.Yield
Nitrogen.Content Coef 1 0.99
P-value 1 1e-04
Grain.Yield Coef 0.99 1
P-value 1e-04 1

IRRI-PBGB-CRIL 19
Regression Analysis

IRRI-PBGB-CRIL 20
Scientific Question
What is the growth rate of a rice plant?

Growth rate can be defined as the change in height


per unit of time.

IRRI-PBGB-CRIL 21
Data Collection
DAS Height (cm)
0 0
10 12
30 55
60 80
90 110

IRRI-PBGB-CRIL 22
Statistical Questions
• What is the relationship
120 between age and height?
100
Linear
P la n t H e ig h t (c m )

80
60
• How do I describe or
40
quantify the relationship?
20 Regression
0 • Is the association
0 20 40 60 80 100
Days after Seeding
significant?
Statistical Test

IRRI-PBGB-CRIL 23
Linear Regression

• A general method for estimating or


describing association between a
continuous outcome variable
(dependent) and one or multiple
predictors in one equation.
o One predictor: Simple linear regression
o Multiple predictors: Multiple linear regression

IRRI-PBGB-CRIL 24
Statistical Model

56
Data = Model Fit + Residual
54

52
Y Yi = Yˆi + ε i
50

48 Yˆi = β 0 + β1 X i
46
Intercept Slope
X

Yi = µ + α i + εi
IRRI-PBGB-CRIL 25
Least Squares Estimates

Yi = Yˆi + ε i Yˆi = β 0 + β1 X i
To estimate the intercept and slope,
minimize residual sum of squares (RSS)

RSS = ∑ εi =∑ (Yi − Yˆi ) 2 =∑ (Yi − β 0 − β1 X i ) 2


2

We don’t have to do the


∂RSS ∑ (Yi − β 0 − β1 X i )
2

∂β 0
=
∂β 0
= −2∑ (Yi − β 0 − β1 X i ) = 0 estimation by hand.
==> βˆ 0 = Y − βˆ1 X R/CropStat or other
∂RSS ∑ (Yi − Y + β1 X − β1 X i ) statistical packages can
2

= = −2∑ (X i − X )(Yi − Y + β1 X − β1 X i ) = 0
∂β1 ∂β1
∑ (X − X )(Y − Y ) do the work for us.
βˆ1 =
i i
==>
∑ (X − X )
i
2

IRRI-PBGB-CRIL 26
LINEAR REGRESSION ANALYSIS
Dependent Variable: Height

Analysis of Variance
SV Df Sum Square Mean Square F value Pr (>F)
DAS 1 8201.389781 8201.389781 95.435198 0.002279
Residuals 3 257.810219 85.93674

Model Summary
R Squared 0.969523
Adj. R Squared 0.959364

Parameter Estimates
Parameter Estimate Std. Error t value Pr (> |t|)

(Intercept) 4.912409 6.311259 0.778356 0.493109


DAS 1.223358 0.125227 9.769094 0.002279

IRRI-PBGB-CRIL 27
Example: Growth Rate Data
Parameter Estimates
Parameter Estimate Std. Error t value Pr (> |t|)
(Intercept) 4.912409 6.311259 0.778356 0.493109
DAS 1.223358 0.125227 9.769094 0.002279

140

120
Height =4.9+ 1.223DAS
100 r = 0.98 Intercept: The height at age 0 is 4.9 cm.
Plant Height (cm)

80 Slope: The height increase per day after


60 seeding is 1.223 cm.
40

20

0
0 20 40 60 80 100
Days after Seeding

IRRI-PBGB-CRIL 28
Prediction

140

120
Given the regression line, it
100
Height =4.9+ 1.223DAS
r = 0.98
can be predicted that the
Plant Height (cm)

80 height at 40 days after


60 seeding will be 53.8 cm.
40

20

0
0 20 40 60 80 100
Days after Seeding

IRRI-PBGB-CRIL 29
Example: Growth Rate Data
Analysis of Variance
SV Df Sum Square Mean Square F value Pr (>F)
DAS 1 8201.389781 8201.389781 95.435198 0.002279
Residuals 3 257.810219 85.93674

Model Summary
R Squared 0.969523
Adj. R Squared 0.959364

Sums of Squares
∑ (Y − Y ) =∑ (Y − Yˆ + Yˆ − Y ) =∑ (Yˆ − Y ) + ∑ (Y − Yˆ )
i
2
i i i
2
i
2
i i
2

SST SSM SSE


Degrees of freedom
n-1 1 n-2

R2 =
SSM
=
∑ (Yˆ − Y )
i
2
R2 is IRRI-PBGB-CRIL
the fraction of variation in Y explained by
30 X.
SST ∑ (Y − Y )
i
2
Linear Regression vs. ANOVA

ANOVA Linear regression


Dependent: Continuous Dependent: Continuous
Independent: Categorical Independent: Continuous

Linear models
ANOVA and regression are the same thing!!!

IRRI-PBGB-CRIL 31
Misuse of Regression
and Correlation Analysis
• Performing regression and correlation on spurious
data could give significant results. But this is not a
valid indication of a linear relationship.

IRRI-PBGB-CRIL 32
Misuse of Regression
and Correlation Analysis
• Extrapolation of results
o scope of data is extended. Example
§ If the relationship of yield IR8 and stemborer
incidence is extended to cover all rice varieties
§ If the relationship between grain yield and protein
content from varietal trials is assumed to be
applicable to other types of experiments such as
fertilizer trials
o functional relationship is assumed to hold beyond

the range of X values tested


IRRI-PBGB-CRIL 33
Misuse of Regression
and Correlation Analysis
11000

10000

y = 23.751x + 4307.2
9000 r = 0.987** There is no evidence if a
linear relationship still holds
Grain Yield (kg/ha)

8000
above N = 180 kg/ha
7000

6000

5000

4000
0 30 60 90 120 150 180 210 240
N-rate (k g/ha)

IRRI-PBGB-CRIL 34
Coefficient of Determination (R2)

• Percentage of the total variation that is


explained by the linear function.

For example, with an R2 value of 0.64, the


implication is 64% [(0.64)(100) = 64] of
the variation in the variable Y can be
explained by the linear function of the
variable X.

IRRI-PBGB-CRIL 35
Problems with R2
• R2 tends to increase as additional variables are included
to a regression equation, regardless of their true
importance in determining the values of the dependent
variable
The adjusted R2 (Ra2) compensates for this effect
n −1
Ra2 = 1 − (1 − R 2 )
n − ( p + 1)
where n = no . of observatio ns
p = no . of independen t var iables

• Gives no information on the appropriateness of the model


IRRI-PBGB-CRIL 36
Problems with R2

Curvilinear data fitted by a straight Segregated data fitted by a


line with high R2 straight line with high R2

For detecting these kinds of departures from the regression


model there is no substitute to plotting the data
IRRI-PBGB-CRIL 37
Thank you!

IRRI-PBGB-CRIL 38

You might also like