You are on page 1of 68

PROBABILITY & STATISTICAL

INFERENCE LECTURE 9
MSc in Computing (Data Analytics)
Lecture Outline
! ANOVA versus Regression
! Correlations
! Simple Linear Regression
! Multiple Regression
! Section Takeaways
Type of
Analysis
Factor Response
Continuous
Categorical T-test/ANOVA
Continuous
Simple Linear
Regression
AVOVA vs Simple Linear Regression
Scatter Plot
" A scatter plot
is a type of
chart using
Cartesian
coordinates to
display values
for two
continuous
variables for a
set of data
Y
x
Describing Linear Relationships
! Correlation we can quantify the relationship
between two variables with correlation statistics
! Two variables are correlated if there is a linear
relationship between them
! We can further classify correlated variables according
to the type of correlation:
! Positive: One variable tends to increase in value as
the other increases in value
! Negative: One variable tends to decrease in value as
the other increases in value
! Zero: No linear relationship between the two
variables (uncorrelated)
Pearson Correlation Coefficient
How to Calculate Correlation?
! The correlation coefficient between two samples x
1,
x
2,
x
3,
.... x
n
and y
1,
y
2,
y
3,
.... y
n
is calculated with the
following formula:
Caution Using Correlation
lour seLs of daLa wlLh Lhe same correlauon
coemclenL of !"#$%
Example: The Fast Mile Test
! You have been tasked by Team Ireland to
analyse data from a study conducted to
investigate how fast athletes bodies can
absorb and use up oxygen
! The results of this study will be used to help
trainers devise custom regimes for their
athletes
! A dataset has been gathered from 31
athletes, each of whom performed a fast-
mile-test for which their maximum pulse rate,
rest pulse rate, run pulse rate, run time and
oxygen consumption were measured
Example: The Fast Mile Test
Oxygen
Consumption
Gender Age Weight Runtime
Rest
Pulse
Run
Pulse
Max
Pulse
44.609 Male 44 89.47 11.37 62 178 182
45.313 Male 40 75.07 10.07 62 185 185
54.297 Female 44 85.84 8.65 45 156 168
59.571 Male 42 68.15 8.17 40 166 172
49.874 Female 38 89.02 9.22 55 178 180
44.811 Female 47 77.45 11.63 58 176 176
45.681 Male 40 75.98 11.95 70 176 180
49.091 Male 43 81.19 10.85 64 162 170
39.442 Female 44 81.42 13.08 63 174 176
Example: Runtime vs Oxygen
Consumption
Demo
Regression Model
Y
x
" Can we
capture the
relationship
between two
variables in
the scatter
plot?
Regression Model
! 8ased on Lhe scauer ploL, lL ls probably reasonable Lo
assume LhaL Lhe random varlable ! ls relaLed Lo # by a
sLralghL-llne relauonshlp
! We use Lhe equauon of a llne Lo model Lhe
relauonshlp
! 1he '()*+, +(-,./ /,0/,''(1- )12,+ ls glven by:
where Lhe slope and lnLercepL of Lhe llne are called
/,0/,''(1- 31,43(,-5' and where ! ls Lhe /.-21)
,//1/ 5,/)
Regression Model
Y
Y
Regression Model
One unit
change in
x
!
1
Simple Linear Regression
! The case of simple linear regression considers a
single regressor (or predictor), x, and a dependent
(or response) variable, Y
! The expected value of Y at each level of x is a
random variable:
! We assume LhaL each observauon, !, can be
descrlbed by Lhe model:
Simple Linear Regression
! Suppose that we have n pairs of observations (x
1
, y
1
), (x
2
,
y
2
), , (x
n
, y
n
)
Simple Linear Regression
! Suppose that we have n pairs of observations (x
1
, y
1
), (x
2
,
y
2
), , (x
n
, y
n
)
" Deviations of the data
from the estimated
regression model
Simple Linear Regression
" Suppose that we have n pairs of observations (x1,
y1), (x2, y2), , (xn, yn)
" Deviations of the data
from the estimated
regression model
Observed
value (y)
Estimated
regression
line
Simple Linear Regression
! Suppose that we have n pairs of observations (x
1
, y
1
), (x
2
,
y
2
), , (x
n
, y
n
)
" The method of least
squares is used to
estimate the
parameters, !0 and !1
by minimizing the sum
of the squares of the
vertical deviations in
diagram below
Observed
value (y)
Estimated
regression
line
Example: Oxygen Consumption vs
Runtime for Team Ireland
" Can we capture
the relationship
between Oxygen
Consumption and
Runtime in the
Team Ireland
fitness study?
Example: Oxygen Consumption vs
Runtime for Team Ireland Regression
Model
" Yes, using the
regression model:
where Y is the
Oxygen
Consumption and
x is the Runtime
for an athlete
Model Assumptions
! Fitting a regression model requires several
assumptions:
! Errors are uncorrelated random variables with zero mean
! Errors have constant variance
! Errors are normally distributed
! The analyst should always consider the validity of
these assumptions to be doubtful and conduct analyses
to examine the adequacy of the model
Testing Assumptions Residual Analysis
! The residuals from a regression model are:
where y
i
is an actual observation and
i
is the
corresponding fitted value from the regression
model
! Analysis of the residuals is frequently helpful in
checking the assumption that the errors are
approximately normally distributed with constant
variance, and in determining whether additional
terms in the model would be useful
Interpreting Residual Plots
SausfacLory
e
i

0
ei
0
ei
0
ei
0
lunnel
uouble 8ow non-llnear
Example: Oxygen Consumption vs
Runtime for Team Ireland Residual Plot
" What do we
think?
Adequacy of the Regression Model
! The quantity:
is called the coefficient of determination and is often
used to judge the adequacy of a regression model (0
! R
2
! 1)
! We often refer (loosely) to R
2
as the amount of
variability in the data explained or accounted for by
the regression model
Example: Oxygen Consumption vs
Runtime for Team Ireland R
2
! For the oxygen consumption regression model
R
2
= SS
M
/ SS
T

= 632.9 / 851.38
= 0.7434
! Thus, the model accounts for 74.34% of the
variability in the data
Adjusted R-squared Value
! The Adjusted R-squared Value is calculated as
follows:
! The figure is adjusted for to take into consideration
the number of factors in the model
Demo
Multiple Regression Models
! Many applications of regression analysis involve
situations in which there is more than one regressor
variable
! A regression model that contains more than one
regressor variable is called a multiple regression
model
! 1he muluple llnear regresslon model ls glven
by:
Example: Oxygen Consumption vs Runtime for Team
Ireland Regression Model
! For example, suppose that we want to test the
affect of both age and runtime on oxygen
consumption in the Team Ireland example
where:
Y : Oxygen Consumption
x
1
: Runtime
x
2
: Age
Example: Oxygen Consumption vs Runtime for Team
Ireland Regression Model
" This is a 3d scatter
plot of Oxygen
Consumption versus
Runtime and Age
Example: Oxygen Consumption vs Runtime for Team
Ireland Regression Model
1he regresslon plane for
Lhe model:
$(!) = 30 + 10x
1
+ 7x
2

" The multi-variable
regression model is:
where Y is the
Oxygen
Consumption, x1 is
the Runtime and x2
is Age for an athlete
Demo
Regression & Variable Selection
! How do we select the best variable for use in a
regression model
! Perform a search to see which variable are the
most effective
! Three search schemes:
! Forward sequential selection
! Backward sequential selection
! Stepwise sequential selection
Sequential Selection Forward
Entry Cutoff
Input p-value
Sequential Selection Forward
Entry Cutoff
Input p-value
Sequential Selection Forward
Entry Cutoff
Input p-value
Sequential Selection Forward
Sequential Selection Backward
Stay Cutoff
Input p-value
Sequential Selection Backward
Stay Cutoff
Input p-value
Sequential Selection Backward
Stay Cutoff
Input p-value
Sequential Selection Backward
Stay Cutoff
Input p-value
Sequential Selection Backward
Stay Cutoff
Input p-value
Sequential Selection Backward
Stay Cutoff
Input p-value
Sequential Selection Backward
Stay Cutoff
Input p-value
Sequential Selection Backward
Stay Cutoff
Input p-value
Sequential Selection Stepwise
Input p-value
Entry Cutoff
Stay Cutoff
Sequential Selection Stepwise
Input p-value
Entry Cutoff
Stay Cutoff
Sequential Selection Stepwise
Input p-value
Entry Cutoff
Stay Cutoff
Sequential Selection Stepwise
Input p-value
Entry Cutoff
Stay Cutoff
Sequential Selection Stepwise
Input p-value
Entry Cutoff
Stay Cutoff
Sequential Selection Stepwise
Input p-value
Entry Cutoff
Stay Cutoff
Sequential Selection Stepwise
Input p-value
Entry Cutoff
Stay Cutoff
Demo
Multi-Collinearity
! Multi-Collinearity exists when two or more
independent variables are used in regression
are correlated.
X
2

Demo
Regression Bits and Pieces
! Polynomial regression
! Logistic Regression
! Categorical Factors in Regression
Polynomial Regression
! Polynomial regression models are widely used
when the response in curve-linear
! The general principles of
multiple regression will apply
! The second degree
polynomial in one variable is:
Logistic Regression
! Logistic regression is used
if the response variable (target)
is a discrete binary variable
! Logistic regression uses the idea of
odds in its calculations
! Odds of an event occurring is:
Where is the probability of an event occurring
Logistic Regression
! The equation for a logistic regression model is:
! Choose intercept and parameter estimates to
maximize
! This function is known as the log-likelihood function
! log(p
i
) + ! log(1 p
i
)
Categorical Factors in Regression
# Many problems may involve categorical
variables.
# The usual method for the different levels of a
qualitative variable is to use indicator
variables.
# For example, to introduce the variable gender
into the model , we could define an indicator
variable as follows:
Section Takeaways
! Regression models allow us model between
variables
! Regression models can be used to evaluate the
variation between variables but are also excellent
to use as prediction models

You might also like