Linear Regression Models CGFM

Chaohui Guo &Friederike Meyer
PATTERNRECOGNITION
AND MACHINELEARNING
CHAPTER3:LINEARMODELSFORREGRESSION
Purposesoflinearregression
Specifyingrelationshipbetweenindependentvariables
(inputvectorx)anddependentvariables(targetvaluest)
Assumption:Targetsarenoisyrealizationofan
underlyingfunctionalrelationship
Goal:modelthepredictivedistributionp(t|x)
Agenerallinearmodel(GLM)explainsthisrelationshipin
termsofalinearcombinationoftheIV+error:
t
n
=y(x
n
,)+
n
GLM: Illustration of matrix form
=
|
e
+
y X
N
1
N N
1 1
p
p
e X y + = | e X y + = |
) , 0 ( ~
2
I N e o ) , 0 ( ~
2
I N e o
N: number of scans
p: number of regressors
Parameter estimation
e X y + = |
= +
e
(
2
1
|
|
Ordinary least
squares estimation
(OLS) (assuming i.i.d.
error):
y X X X
T T 1
) (

= |
Objective:
estimate to
minimize

=
N
t
t
e
1
2
y X
y
e
Design space
defined by X
x
1
x
2
GLM: A geometric perspective
P I R
Ry e
=
=
P I R
Ry e
=
=
|
X y =
y X X X
T T 1
) (

= | y X X X
T T 1
) (

= |
T T
X X X X P
Py y
1
) (
=
=
T T
X X X X P
Py y
1
) (
=
=
Residual
forming matrix
R
Projection matrix
P
OLS estimates
LSesmateofthedataprojecon
ofdatavectorontodesignmatrixspace
LinearBasisFunctionModels:PolynomialCurve
Fitting
(x)=Basisfunctions
Typesofnonlinearbasisfunctions
PolynomialGaussianSigmoidal
Estimationof:Classical(frequentist)
techniques
Usinganestimatortodeterminespecific
valueforparametervector
e.g.Sumofsquareserrorfunction(SSQ):
Minimizingfunctioninrespectto:*
t=y(x, *)
Reducingoverfitting:RegularizedLeastSquares
Controlofoverfitting:Regularizationoferror
function
whichisminimizedby
Datadependenterror+Regularizationterm
:regularization
coefficient.
OLS
solution+
extension
(/2)
T
RegularizedLeastSquares
Applyingmoregeneralregularizer:
How to choose appropriate value?
Classicaltechniques:MaximumLikelihoodand
LeastSquares
Assumeobservationsfromadeterministicfunction
withaddedGaussiannoise:
whichimpliesaGaussianconditionaldistribution:
Givenobservedinputs,
(independentlydrawn),andtargets,,
weobtainthelikelihoodfunction
where
Classicaltechniques:MaximumLikelihoodand
LeastSquares
Takingthelogarithm,weget
Computingthegradientandsettingittozeroyields
Solvingforw ,weget
OLSestimate
Conclusion:Frequentist approach&forecaston
Bayesianmethods
Frequentist approach
Seekingpointestimateof
unknownparameter
maximizinglikelihood
inappropriatelycomplex
modelsandoverfitting
(solution:Regularization,
but:number/typeofBFstill
important)
Bayesianapproach:
Characterizinguncertaintyin
through probability
distribution p()
averagingofmultiple
posteriorprobability
parameterdistributions
[p(|t)]
Bayesianlinearregression:Parameterdistribution
:assumed known constant
Likelihood function:p(t|)with Gaussian noise
exponential of quadratic function of
Gaussian prior:p()=N(|m
0,
S
0
)
Gaussian posterior:p(|t)=[p(D|)*p()]/p(D)
p(|t)=N(|m
N
,S
N
)
m
N
=S
N
(S
0
1
m
0
+
T
t)
S
N
=S
0
1
+
T
BayesianLinearRegression:Commoncases
Acommonchoiceforthepriorisazeromean
isotropicGaussian:
forwhich
Maximization of posterior distribution =minimization of

SSQwith addition of quadratic error term ( =/)
BayesianLinearRegression:AnExampleof
sequentiallearning
Linearmodel:y(x,w)=
0
+
1
x(straight
linefitting)
Generationofsyntheticdata:f(y,a)=a
0
+a
1
x
a
0
=0.3,a
1
=0.5
Gaussiannoise(std,
1
) =0.2
=(1/0.2)
2
=25
=2.0
U(x|1,1)
sequentiallearning
Likelihood
p(t|x,)
Prior[p()] (+p(t|x,)gives)
/Posteriorp(|t)
DataSpace(6samplesy(x,w)drawn
fromposteriordistributionof)
sequentiallearning
Likelihood Prior/Posterior DataSpace
Posteriordistribution
Prior:
Likelihoodfunction:
Posterior:
where
PredictiveDistribution(1)
Predictt fornewvaluesx:
where
Sumrule
Productrule
Example:Sinusoidaldata,9Gaussianbasisfunctions
Example:Sinusoidaldata,9Gaussianbasisfunctions
25datapoints
BayesianModelComparison(1)
Howdowechoosetherightmodel?
Lmodels
Bayes Factor:ratioofevidencefortwomodels
Posterior Prior Modelevidence
Simplesituation:
Foragivenmodelwithasingle
parameter, ,considerthe
approximation
wheretheposteriorisassumed
tobesharplypeaked,andprior
isflat.
Takinglogarithms,weobtain
WithMparameters,allassumedtohavethesame
ratio,weget
Negative
NegativeandlinearinM.
datafitandmodelcomplexity,favourintermediatecomplexity:
D
1
D
2
Practicaluse
Howgoodaretheseassumptions?
Whatotherfunctionstobeuse?
Bayesianframeworkavoidstheproblemofoverfittingandallowsmodelstobe
comparedonthebasisofthetrainingdataalone.However,duetothe
dependenceonthepriors,apracticalapplicationistokeepanindependent
testsetofdata.
Predictivedistribution
Predictivedistribution
Asimplerapproximation,knownasmodelselection,istouse
themodelwiththehighestevidence.
TheEvidenceApproximation(1)
ThefullyBayesianpredictivedistributionisgivenby
butthisintegralisintractable.Approximatewith
whereisthemodeof,whichisassumedto
besharplypeaked;a.k.a.evidenceapproximation.
FromBayestheoremwehave
andifweassumepriortobeflatweseethat
Finalevidencefunction:
Tomaximise,firstderivativerespectto
DerivativeRespectto
Iterativeprocedure:Initialchoiceforand,calculateformN andr.these
valuesarereestimateand,untilconvergence.
Inthelimit,andwecanconsiderusingthe
easytocomputeapproximation
Example:sinusoidaldata,
Discussion
Aretheyalwaysconsistent?
Bayesianvs.Frequencist
Sequentiallearning
Modelselection
Howtheydifferentwithaflatprior?
ModelSelection(1.3)
CrossValidation
Akaike informationcriterion(AIC):
SequentialLearning
Dataitemsconsideredoneatatime(a.k.a.
onlinelearning);usestochastic(sequential)
gradientdescent:
Thisisknownastheleastmeansquares(LMS)
algorithm.Issue:howtochoose?

Linear Regression Models CGFM

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression Models CGFM

Uploaded by

Copyright:

Available Formats

Chaohui Guo &Friederike Meyer

Maximization of posterior distribution =minimization of

You might also like