Professional Documents
Culture Documents
UnderstandinggradientdescentEliBendersky'swebsite
Understandinggradientdescent
(http://eli.thegreenplace.net/2016/understanding
gradientdescent/)
August05,2016at05:38 Tags Math(http://eli.thegreenplace.net/tag/math),MachineLearning
(http://eli.thegreenplace.net/tag/machinelearning)
Gradientdescentisastandardtoolforoptimizingcomplexfunctionsiterativelywithinacomputer
program.Itsgoalis:givensomearbitraryfunction,findaminumum.Forsomesmallsubsetoffunctions
thosethatareconvexthere'sjustasingleminumumwhichalsohappenstobeglobal.Formostrealistic
functions,theremaybemanyminima,somostminimaarelocal.Makingsuretheoptimizationfindsthe
"best"minumumanddoesn'tgetstuckinsuboptimialminimaisoutofthescopeofthisarticle.Herewe'll
justbedealingwiththecoregradientdescentalgorithmforfindingsomeminumumfromagivenstarting
point.
Themainpremiseofgradientdescentis:givensomecurrentlocationxinthesearchspace(thedomain
oftheoptimizedfunction)weoughttoupdatexforthenextstepinthedirectionoppositetothegradientof
thefunctioncomputedatx.Butwhyisthisthecase?Theaimofthisarticleistoexplainwhy,
mathematically.
http://eli.thegreenplace.net/2016/understandinggradientdescent/
1/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
Thisisalsotheplaceforadisclaimer:theexamplesusedthroughoutthearticlearetrivial,low
dimensional,convexfunctions.Wedon'treallyneedanalgorithmicproceduretofindtheirglobal
minumumaquickcomputationwoulddo,orreallyjusteyeballingthefunction'splot.Inrealitywewillbe
dealingwithnonlinear,1000dimensionalfunctionswhereit'sutterlyimpossibletovisualizeanything,or
solveanythinganalytically.Theapproachworksjustthesamethere,however.
Buildingintuitionwithsinglevariablefunctions
Thegradientisformallydefinedformultivariatefunctions.However,tostartbuildingintuition,it'susefulto
beginwiththetwodimensionalcase,asinglevariablefunction
.
Insinglevariablefunctions,thesimplederivativeplaystheroleofagradient.So"gradientdescent"would
reallybe"derivativedescent"let'sseewhatthatmeans.
Asanexample,let'stakethefunction
http://eli.thegreenplace.net/2016/understandinggradientdescent/
.Here'sitsplot,inred:
2/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
Imarkedacoupleofpointsontheplot,inblue,anddrewthetangentstothefunctionatthesepoints.
Remember,ourgoalistofindtheminimumofthefunction.Todothat,we'llstartwithaguessforanx,
andcontinouslyupdateittoimproveourguessbasedonsomecomputation.Howdoweknowhowto
updatex?Theupdatehasonlytwopossibledirections:increasexordecreasex.Wehavetodecide
whichofthetwodirectionstotake.
Wedothatbasedonthederivativeof
http://eli.thegreenplace.net/2016/understandinggradientdescent/
.Thederivativeatsomepoint
isdefinedasthelimit[1]:
3/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
Intuitively,thistellsuswhathappensto
above,at
wehave:
whenweaddaverysmallvaluetox.Forexampleintheplot
Thismeansthattheslopeof at
is4foraverysmallpositivechangehtoxatthatpoint,the
valueof
willincreaseby4h.Therefore,togetclosertotheminimumof
weshouldrather
decrease abit.
Let'stakeanotherexamplepoint,
.Atthatpoint,ifweaddalittlebitto ,
4xthatlittlebit.Sothat'sexactlywhatweshoulddotogetclosertotheminimum.
willdecreaseby
Where iswhatwecalla"learningrate",andisconstantforeachgivenupdate.It'sthereasonwhywe
don'tcaremuchaboutthemagnitudeofthederivativeat ,onlyitsdirection.Ingeneral,itmakessense
tokeepthelearningratefairlysmallsoweonlymakeatinystepatattime.Thismakessense
http://eli.thegreenplace.net/2016/understandinggradientdescent/
4/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
mathematically,becausethederivativeatapointisdefinedastherateofchangeof
assumingan
infinitesimalchangeinx.Forsomelargechangexwhoknowswherewewillget.It'seasytoimagine
caseswherewe'llentirelyovershoottheminimumbymakingtoolargeastep[2].
Multivariatefunctionsanddirectionalderivatives
Withfunctionsofmultiplevariables,derivativesbecomemoreinteresting.Wecan'tjustsay"thederivative
pointstowherethefunctionisincreasing",because...whichderivative?
Recalltheformaldefinitionofthederivativeasthelimitforasmallsteph.Whenourfunctionhasmany
variables,whichoneshouldhavethestepadded?Oneatatime?Allatonce?Inmultivariatecalculus,we
usepartialderivativesasbuildingblocks.Let'suseafunctionoftwovariables
asanexample
throughoutthissection,anddefinethepartialderivativesw.r.t.xandyatsomepoint
:
Whenwehaveasinglevariablefunction
,there'sreallyonlytwodirectionsinwhichwecanmove
fromagivenpoint left(decreasex)orright(increasex).Withtwovariables,thenumberofpossible
directionsisinfinite,becasewepickadirectiontomoveona2Dplane.Hopefullythisimmediatelypops
ups"vectors"inyourhead,sincevectorsaretheperfecttooltodealwithsuchproblems.Wecan
representthechangefromthepoint
asthevector
[3].Thedirectionalderivativeof
along at
isdefinedasitsrateofchangeinthedirectionofthevectoratthatpoint.
Mathematically,it'sdefinedas:
http://eli.thegreenplace.net/2016/understandinggradientdescent/
5/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
Thepartialderivativesw.r.t.xandycanbeseenasspecialcasesofthisdefinition.Thepartialderivative
isjustthedirectionaldirevativeinthedirectionofthexaxis.Invectorspeak,thisisthedirectional
derivativefor
,thestandardbasisvectorforx.Justplug
into(1)to
seewhy.Similarly,thepartialderivative
basisvector
isthedirectionalderivativeinthedirectionofthestandard
Avisualinterlude
Functionsoftwovariables
arethelastfrontierformeaningfulvisualizations,forwhichweneed3D
toplotthevalueof foreachgivenxandy.Evenin3D,visualizinggradientsissignificantlyharderthan
in2D,andyetwehavetotrysinceforanythingabovetwovariablesallhopesofvisualizationarelost.
Here'sthefunction
vectors
and
http://eli.thegreenplace.net/2016/understandinggradientdescent/
plottedinasmallrangearoundzero.Idrewthestandardbasis
[4]andsomecombinationofthem .
6/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
Ialsomarkedthepointon
wherethevectorsarebased.Thegoalistohelpuskeepinmindhow
theindependentvariablesxandychange,andhowthataffects
.Wechangexandybyadding
somesmallvector totheircurrentvalue.Theresultis"nudging"
inthedirectionof .Remember
ourgoalforthisarticlefind suchthatthis"nudge"getsusclosertoaminimum.
Findingdirectionalderivativesusingthegradient
http://eli.thegreenplace.net/2016/understandinggradientdescent/
7/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
Aswe'veseen,thederivativeof
inthedirectionof isdefinedas:
Lookingatthe3Dplotabove,thisishowmuchthevalueof
changeswhenweadd tothevector
.Buthowdowedothat?Thislimitdefinitiondoesn'tlooklikesomethingfriendlyforanalytical
analysisforarbitraryfunctions.Sure,wecouldplug
and inthereanddothecomputation,butit
wouldbenicetohaveaneasiertouseformula.Luckily,withthehelpofthegradientof
itbecomes
mucheasier.
Thegradientisavectorvaluewecomputefromascalarfunction.It'sdefinedas:
Itturnsoutthatgivenavector ,thedirectionalderivative
product:
canbeexpressedasthefollowingdot
Ifthislookslikeamentalleaptoobigtotrust,pleasereadtheAppendixsectionatthebottom.Otherwise,
feelfreetoverifythatthetwoareequivalentwithacoupleofexamples.Forinstance,trytofindthe
derivativeinthedirectionof
at
.Youshouldget
usingboth
methods.
Directionofmaximalchange
We'realmostthere!Nowthatwehavearelativelysimplewayofcomputinganydirectionalderivativefrom
thepartialderivativesofafunction,wecanfigureoutwhichdirectiontotaketogetthemaximalchangein
thevalueof
.
Wecanrewrite:
http://eli.thegreenplace.net/2016/understandinggradientdescent/
8/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
As:
Gradientdescentupdateformultivariatefunctions
Tosumup,givensomestartingpoint
firstcomputethegradientof
at
,tonudgeitinthedirectionoftheminimumof
.Then,weupdate(usingvectornotation):
Generalizingtomultipledimensions,let'ssaywehavethefunction
vector
.Wedefinethegradientupdateatstepktobe:
,we
takingthendimensional
Previously,forthesinglevariatecasewesaidthatthederivatvepointsustothewaytotheminimum.
Nowwecansaythatwhiletherearemanywaystogettotheminimum(eventually),thegradientpointsus
tothefastestwayfromanygivenpoint.
Appendix:directionalderivativedefinitionandgradient
http://eli.thegreenplace.net/2016/understandinggradientdescent/
9/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
Thisisanoptionalsectionforthosewhodon'tliketakingmathematicalstatementsforgranted.Nowit's
timetoprovetheequationshownearlierinthearticle,andonwhichitsmainresultisbased:
Asusualwithproofs,itreallyhelpstostartbyworkingthroughanexampleortwotobuildupsome
intuitionintowhytheequationworks.Feelfreetodothatifyou'dlike,usinganyfunction,startingpoint
anddirectionvector .
Supposewedefineafunction
Where
and
asfollows:
definedas:
Inthesedefinitions, , ,aandbareconstants,soboth
variable.Usingthechainrule,weknowthat:
Substitutingthederivativesof
and
and
aretrulyfunctionsofasingle
,weget:
Onemorestep,thesignificanceofwhichwillbecomeclearshortly.Specifically,thederivativeof
is:
Nowlet'sseehowtocomputethederivativeof
http://eli.thegreenplace.net/2016/understandinggradientdescent/
at
at
usingtheformallimitdefinition:
10/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
Butthelatterispreciselythedefinitionofthedirectionalderivativeinequation(1).Therefore,wecansay
that:
Fromthisand(2),weget:
Thisderivationisnotspecialtothepoint
itworksjustaswellforanypointwhere
partialderivativesw.r.t.xandytherefore,foranypoint
where
isdifferentiable:
means:thevalueofthederivativeof w.r.t.x,evaluatedat
[1] Thenotation
saythesamewouldbe
.
http://eli.thegreenplace.net/2016/understandinggradientdescent/
has
.Anotherwayto
11/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
[2] Thatsaid,insomeadvancedvariationsofgradientdescentweactuallywanttoprobedifferentareas
ofthefunctionearlyonintheprocess,soalargerstepmakessense(remember,realisticfunctions
havemanylocalminimaandwewanttofindthebestone).Furtheralongintheoptimizationprocess,
whenwe'vesettledonageneralareaofthefunctionwewantthelearningratetobesmallsowe
actuallygettotheminimum.ThisapproachiscalledannealingandI'llleaveitforsomefuturearticle.
[3] Toavoidtrackingvectormagnitudes,fromnowoninthearticlewe'llbedealingwithnormalized
directionvectors.Thatis,wealwaysassumethat
.
[4] Yes, isactuallygoingintheoppositedirectionsoit's
waseasiertodraw:)
,butthatreallydoesn'tchangeanything.It
Comments
5Comments
EliBendersky'swebsite
Recommend 2
Share
Login
SortbyOldest
Jointhediscussion
CharlesMoyes2monthsago
Awesomearticlethatexplainstheconceptsquitewellvisually.
MathJaxisyourfriend.Allofthenotationonthispagelookssuperpixelatedandblurryupscaledmany
timesonmy4khidpidisplay.
Reply Share
EliBendersky
Whyisitupscaled?Becausetheimageswouldbesmallotherwise?
http://eli.thegreenplace.net/2016/understandinggradientdescent/
12/13
10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
IconsideredMathJax,butithasitsownissueslikeslownessandjankiness
Reply Share
CharlesMoyes>EliBendersky2monthsago
Itisupscaledtobedisplayedatacomfortablyreadablesizeonsuchahighdensityscreen:
http://imgur.com/a/FSW04
Reply Share
MiroZoricak2monthsago
Greatarticle!Onesmallsuggestion,thesingularformof"minima"is"minimum",not"minima".
Reply Share
EliBendersky
Thanks,I'llgetitfixed
Reply Share
Subscribe d AddDisqustoyoursiteAddDisqusAdd
20032016EliBendersky
http://eli.thegreenplace.net/2016/understandinggradientdescent/
Privacy
Backto
top
13/13