You are on page 1of 36

Chaohui Guo &Friederike Meyer

PATTERNRECOGNITION
AND MACHINELEARNING
CHAPTER3:LINEARMODELSFORREGRESSION
Purposesoflinearregression
Specifyingrelationshipbetweenindependentvariables
(inputvectorx)anddependentvariables(targetvaluest)
Assumption:Targetsarenoisyrealizationofan
underlyingfunctionalrelationship
Goal:modelthepredictivedistributionp(t|x)
Agenerallinearmodel(GLM)explainsthisrelationshipin
termsofalinearcombinationoftheIV+error:
t
n
=y(x
n
,)+
n
GLM: Illustration of matrix form
=
|
e
+
y X
N
1
N N
1 1
p
p
e X y + = | e X y + = |
) , 0 ( ~
2
I N e o ) , 0 ( ~
2
I N e o
N: number of scans
p: number of regressors
Parameter estimation
e X y + = |
= +
e
(

2
1
|
|
Ordinary least
squares estimation
(OLS) (assuming i.i.d.
error):
y X X X
T T 1
) (


= |
Objective:
estimate to
minimize

=
N
t
t
e
1
2
y X
y
e
Design space
defined by X
x
1
x
2
GLM: A geometric perspective
P I R
Ry e
=
=
P I R
Ry e
=
=
|

X y =
y X X X
T T 1
) (


= | y X X X
T T 1
) (


= |
T T
X X X X P
Py y
1
) (

=
=
T T
X X X X P
Py y
1
) (

=
=
Residual
forming matrix
R
Projection matrix
P
OLS estimates
LSesmateofthedataprojecon
ofdatavectorontodesignmatrixspace
LinearBasisFunctionModels:PolynomialCurve
Fitting
(x)=Basisfunctions
Typesofnonlinearbasisfunctions
PolynomialGaussianSigmoidal
Estimationof:Classical(frequentist)
techniques
Usinganestimatortodeterminespecific
valueforparametervector
e.g.Sumofsquareserrorfunction(SSQ):
Minimizingfunctioninrespectto:*
t=y(x, *)
Reducingoverfitting:RegularizedLeastSquares
Controlofoverfitting:Regularizationoferror
function
whichisminimizedby
Datadependenterror+Regularizationterm
:regularization
coefficient.
OLS
solution+
extension
(/2)
T

RegularizedLeastSquares
Applyingmoregeneralregularizer:
How to choose appropriate value?
Classicaltechniques:MaximumLikelihoodand
LeastSquares
Assumeobservationsfromadeterministicfunction
withaddedGaussiannoise:
whichimpliesaGaussianconditionaldistribution:
Givenobservedinputs,
(independentlydrawn),andtargets,,
weobtainthelikelihoodfunction
where
Classicaltechniques:MaximumLikelihoodand
LeastSquares
Takingthelogarithm,weget
Computingthegradientandsettingittozeroyields
Solvingforw ,weget
OLSestimate
Conclusion:Frequentist approach&forecaston
Bayesianmethods
Frequentist approach
Seekingpointestimateof
unknownparameter
maximizinglikelihood
inappropriatelycomplex
modelsandoverfitting
(solution:Regularization,
but:number/typeofBFstill
important)
Bayesianapproach:
Characterizinguncertaintyin
through probability
distribution p()
averagingofmultiple
posteriorprobability
parameterdistributions
[p(|t)]
Bayesianlinearregression:Parameterdistribution
:assumed known constant
Likelihood function:p(t|)with Gaussian noise
exponential of quadratic function of
Gaussian prior:p()=N(|m
0,
S
0
)
Gaussian posterior:p(|t)=[p(D|)*p()]/p(D)
p(|t)=N(|m
N
,S
N
)
m
N
=S
N
(S
0
1
m
0
+
T
t)
S
N
=S
0
1
+
T

BayesianLinearRegression:Commoncases
Acommonchoiceforthepriorisazeromean
isotropicGaussian:
forwhich

Maximization of posterior distribution =minimization of


SSQwith addition of quadratic error term ( =/)
BayesianLinearRegression:AnExampleof
sequentiallearning
Linearmodel:y(x,w)=
0
+
1
x(straight
linefitting)
Generationofsyntheticdata:f(y,a)=a
0
+a
1
x
a
0
=0.3,a
1
=0.5
Gaussiannoise(std,
1
) =0.2
=(1/0.2)
2
=25
=2.0
U(x|1,1)
BayesianLinearRegression:AnExampleof
sequentiallearning
Likelihood
p(t|x,)
Prior[p()] (+p(t|x,)gives)
/Posteriorp(|t)
DataSpace(6samplesy(x,w)drawn
fromposteriordistributionof)
BayesianLinearRegression:AnExampleof
sequentiallearning
Likelihood Prior/Posterior DataSpace
Posteriordistribution
Prior:
Likelihoodfunction:
Posterior:
where
PredictiveDistribution(1)
Predictt fornewvaluesx:
where
Sumrule
Productrule
PredictiveDistribution(2)
Example:Sinusoidaldata,9Gaussianbasisfunctions
PredictiveDistribution(3)
Example:Sinusoidaldata,9Gaussianbasisfunctions
25datapoints
BayesianModelComparison(1)
Howdowechoosetherightmodel?
Lmodels
Bayes Factor:ratioofevidencefortwomodels
Posterior Prior Modelevidence
BayesianModelComparison(2)
Simplesituation:
Foragivenmodelwithasingle
parameter, ,considerthe
approximation
wheretheposteriorisassumed
tobesharplypeaked,andprior
isflat.
BayesianModelComparison(3)
Takinglogarithms,weobtain
WithMparameters,allassumedtohavethesame
ratio,weget
Negative
NegativeandlinearinM.
BayesianModelComparison(4)
datafitandmodelcomplexity,favourintermediatecomplexity:
D
1
D
2
Practicaluse
Howgoodaretheseassumptions?
Whatotherfunctionstobeuse?
Bayesianframeworkavoidstheproblemofoverfittingandallowsmodelstobe
comparedonthebasisofthetrainingdataalone.However,duetothe
dependenceonthepriors,apracticalapplicationistokeepanindependent
testsetofdata.
Predictivedistribution
Predictivedistribution
Asimplerapproximation,knownasmodelselection,istouse
themodelwiththehighestevidence.
TheEvidenceApproximation(1)
ThefullyBayesianpredictivedistributionisgivenby
butthisintegralisintractable.Approximatewith
whereisthemodeof,whichisassumedto
besharplypeaked;a.k.a.evidenceapproximation.
TheEvidenceApproximation(2)
FromBayestheoremwehave
andifweassumepriortobeflatweseethat
Finalevidencefunction:
TheEvidenceApproximation(3)
Tomaximise,firstderivativerespectto
DerivativeRespectto
Iterativeprocedure:Initialchoiceforand,calculateformN andr.these
valuesarereestimateand,untilconvergence.
TheEvidenceApproximation(4)
Inthelimit,andwecanconsiderusingthe
easytocomputeapproximation
TheEvidenceApproximation(4)
Example:sinusoidaldata,
Discussion
Aretheyalwaysconsistent?
Bayesianvs.Frequencist
Sequentiallearning
Modelselection
Howtheydifferentwithaflatprior?
ModelSelection(1.3)
CrossValidation
Akaike informationcriterion(AIC):
SequentialLearning
Dataitemsconsideredoneatatime(a.k.a.
onlinelearning);usestochastic(sequential)
gradientdescent:
Thisisknownastheleastmeansquares(LMS)
algorithm.Issue:howtochoose?

You might also like