Understanding Gradient Descent - Eli Bendersky's Website

10/6/2016
UnderstandinggradientdescentEliBendersky'swebsite
Understandinggradientdescent
(http://eli.thegreenplace.net/2016/understanding
gradientdescent/)
August05,2016at05:38 Tags Math(http://eli.thegreenplace.net/tag/math),MachineLearning
(http://eli.thegreenplace.net/tag/machinelearning)
Gradientdescentisastandardtoolforoptimizingcomplexfunctionsiterativelywithinacomputer
program.Itsgoalis:givensomearbitraryfunction,findaminumum.Forsomesmallsubsetoffunctions
thosethatareconvexthere'sjustasingleminumumwhichalsohappenstobeglobal.Formostrealistic
functions,theremaybemanyminima,somostminimaarelocal.Makingsuretheoptimizationfindsthe
"best"minumumanddoesn'tgetstuckinsuboptimialminimaisoutofthescopeofthisarticle.Herewe'll
justbedealingwiththecoregradientdescentalgorithmforfindingsomeminumumfromagivenstarting
point.
Themainpremiseofgradientdescentis:givensomecurrentlocationxinthesearchspace(thedomain
oftheoptimizedfunction)weoughttoupdatexforthenextstepinthedirectionoppositetothegradientof
thefunctioncomputedatx.Butwhyisthisthecase?Theaimofthisarticleistoexplainwhy,
mathematically.
http://eli.thegreenplace.net/2016/understandinggradientdescent/
1/13
10/6/2016
Thisisalsotheplaceforadisclaimer:theexamplesusedthroughoutthearticlearetrivial,low
dimensional,convexfunctions.Wedon'treallyneedanalgorithmicproceduretofindtheirglobal
minumumaquickcomputationwoulddo,orreallyjusteyeballingthefunction'splot.Inrealitywewillbe
dealingwithnonlinear,1000dimensionalfunctionswhereit'sutterlyimpossibletovisualizeanything,or
solveanythinganalytically.Theapproachworksjustthesamethere,however.
Buildingintuitionwithsinglevariablefunctions
Thegradientisformallydefinedformultivariatefunctions.However,tostartbuildingintuition,it'susefulto
beginwiththetwodimensionalcase,asinglevariablefunction
.
Insinglevariablefunctions,thesimplederivativeplaystheroleofagradient.So"gradientdescent"would
reallybe"derivativedescent"let'sseewhatthatmeans.
Asanexample,let'stakethefunction
.Here'sitsplot,inred:
2/13
10/6/2016
Imarkedacoupleofpointsontheplot,inblue,anddrewthetangentstothefunctionatthesepoints.
Remember,ourgoalistofindtheminimumofthefunction.Todothat,we'llstartwithaguessforanx,
andcontinouslyupdateittoimproveourguessbasedonsomecomputation.Howdoweknowhowto
updatex?Theupdatehasonlytwopossibledirections:increasexordecreasex.Wehavetodecide
whichofthetwodirectionstotake.
Wedothatbasedonthederivativeof
.Thederivativeatsomepoint
isdefinedasthelimit[1]:
3/13
10/6/2016
Intuitively,thistellsuswhathappensto
above,at
wehave:
whenweaddaverysmallvaluetox.Forexampleintheplot
Thismeansthattheslopeof at
is4foraverysmallpositivechangehtoxatthatpoint,the
valueof
willincreaseby4h.Therefore,togetclosertotheminimumof
weshouldrather
decrease abit.
Let'stakeanotherexamplepoint,
.Atthatpoint,ifweaddalittlebitto ,
4xthatlittlebit.Sothat'sexactlywhatweshoulddotogetclosertotheminimum.
willdecreaseby
Itturnsoutthatinbothcases,weshouldnudge inthedirectionoppositetothederivativeat .That's

themostbasicideabehindgradientdescentthederivativeshowsusthewaytotheminimumorrather,
itshowsusthewaytothemaximumandwethengointheoppositedirection.Givensomeinitialguess
,thenextguesswillbe:
Where iswhatwecalla"learningrate",andisconstantforeachgivenupdate.It'sthereasonwhywe
don'tcaremuchaboutthemagnitudeofthederivativeat ,onlyitsdirection.Ingeneral,itmakessense
tokeepthelearningratefairlysmallsoweonlymakeatinystepatattime.Thismakessense
4/13
10/6/2016
mathematically,becausethederivativeatapointisdefinedastherateofchangeof
assumingan
infinitesimalchangeinx.Forsomelargechangexwhoknowswherewewillget.It'seasytoimagine
caseswherewe'llentirelyovershoottheminimumbymakingtoolargeastep[2].
Multivariatefunctionsanddirectionalderivatives
Withfunctionsofmultiplevariables,derivativesbecomemoreinteresting.Wecan'tjustsay"thederivative
pointstowherethefunctionisincreasing",because...whichderivative?
Recalltheformaldefinitionofthederivativeasthelimitforasmallsteph.Whenourfunctionhasmany
variables,whichoneshouldhavethestepadded?Oneatatime?Allatonce?Inmultivariatecalculus,we
usepartialderivativesasbuildingblocks.Let'suseafunctionoftwovariables
asanexample
throughoutthissection,anddefinethepartialderivativesw.r.t.xandyatsomepoint
:
Whenwehaveasinglevariablefunction
,there'sreallyonlytwodirectionsinwhichwecanmove
fromagivenpoint left(decreasex)orright(increasex).Withtwovariables,thenumberofpossible
directionsisinfinite,becasewepickadirectiontomoveona2Dplane.Hopefullythisimmediatelypops
ups"vectors"inyourhead,sincevectorsaretheperfecttooltodealwithsuchproblems.Wecan
representthechangefromthepoint
asthevector
[3].Thedirectionalderivativeof
along at
isdefinedasitsrateofchangeinthedirectionofthevectoratthatpoint.
Mathematically,it'sdefinedas:
5/13
10/6/2016
Thepartialderivativesw.r.t.xandycanbeseenasspecialcasesofthisdefinition.Thepartialderivative
isjustthedirectionaldirevativeinthedirectionofthexaxis.Invectorspeak,thisisthedirectional
derivativefor
,thestandardbasisvectorforx.Justplug
into(1)to
seewhy.Similarly,thepartialderivative
basisvector
isthedirectionalderivativeinthedirectionofthestandard
Avisualinterlude
Functionsoftwovariables
arethelastfrontierformeaningfulvisualizations,forwhichweneed3D
toplotthevalueof foreachgivenxandy.Evenin3D,visualizinggradientsissignificantlyharderthan
in2D,andyetwehavetotrysinceforanythingabovetwovariablesallhopesofvisualizationarelost.
Here'sthefunction
vectors
and
plottedinasmallrangearoundzero.Idrewthestandardbasis
[4]andsomecombinationofthem .
6/13
10/6/2016
Ialsomarkedthepointon
wherethevectorsarebased.Thegoalistohelpuskeepinmindhow
theindependentvariablesxandychange,andhowthataffects
.Wechangexandybyadding
somesmallvector totheircurrentvalue.Theresultis"nudging"
inthedirectionof .Remember
ourgoalforthisarticlefind suchthatthis"nudge"getsusclosertoaminimum.
Findingdirectionalderivativesusingthegradient
7/13
10/6/2016
Aswe'veseen,thederivativeof
inthedirectionof isdefinedas:
Lookingatthe3Dplotabove,thisishowmuchthevalueof
changeswhenweadd tothevector
.Buthowdowedothat?Thislimitdefinitiondoesn'tlooklikesomethingfriendlyforanalytical
analysisforarbitraryfunctions.Sure,wecouldplug
and inthereanddothecomputation,butit
wouldbenicetohaveaneasiertouseformula.Luckily,withthehelpofthegradientof
itbecomes
mucheasier.
Thegradientisavectorvaluewecomputefromascalarfunction.It'sdefinedas:
Itturnsoutthatgivenavector ,thedirectionalderivative
product:
canbeexpressedasthefollowingdot
Ifthislookslikeamentalleaptoobigtotrust,pleasereadtheAppendixsectionatthebottom.Otherwise,
feelfreetoverifythatthetwoareequivalentwithacoupleofexamples.Forinstance,trytofindthe
derivativeinthedirectionof
at
.Youshouldget
usingboth
methods.
Directionofmaximalchange
We'realmostthere!Nowthatwehavearelativelysimplewayofcomputinganydirectionalderivativefrom
thepartialderivativesofafunction,wecanfigureoutwhichdirectiontotaketogetthemaximalchangein
thevalueof
.
Wecanrewrite:
8/13
10/6/2016
As:
Where istheanglebetweenthetwovectors.Now,recallthat isnormalizedsoitsmagnitudeis1.

Therefore,weonlycareaboutthedirectionof w.r.t.thegradient.Whenisthisequationmaximized?
When
,becausethen
.Sinceacosinecanneverbelargerthan1,that'sthebestwecan
have.
So
givesusthelargestpositivechangein
.Toget
, hastopointinthesame
directionasthegradient.Similarly,for
weget
andthereforethelargestnegative
changein
.Soifwewanttodecrease
themost, hastopointintheoppositedirectionof
thegradient.
Gradientdescentupdateformultivariatefunctions
Tosumup,givensomestartingpoint
firstcomputethegradientof
at
,tonudgeitinthedirectionoftheminimumof
.Then,weupdate(usingvectornotation):
Generalizingtomultipledimensions,let'ssaywehavethefunction
vector
.Wedefinethegradientupdateatstepktobe:
,we
takingthendimensional
Previously,forthesinglevariatecasewesaidthatthederivatvepointsustothewaytotheminimum.
Nowwecansaythatwhiletherearemanywaystogettotheminimum(eventually),thegradientpointsus
tothefastestwayfromanygivenpoint.
Appendix:directionalderivativedefinitionandgradient
9/13
10/6/2016
Thisisanoptionalsectionforthosewhodon'tliketakingmathematicalstatementsforgranted.Nowit's
timetoprovetheequationshownearlierinthearticle,andonwhichitsmainresultisbased:
Asusualwithproofs,itreallyhelpstostartbyworkingthroughanexampleortwotobuildupsome
intuitionintowhytheequationworks.Feelfreetodothatifyou'dlike,usinganyfunction,startingpoint
anddirectionvector .
Supposewedefineafunction
Where
and
asfollows:
definedas:
Inthesedefinitions, , ,aandbareconstants,soboth
variable.Usingthechainrule,weknowthat:
Substitutingthederivativesof
and
and
aretrulyfunctionsofasingle
,weget:
Onemorestep,thesignificanceofwhichwillbecomeclearshortly.Specifically,thederivativeof
is:
Nowlet'sseehowtocomputethederivativeof
at
at
usingtheformallimitdefinition:
10/13
10/6/2016
Butthelatterispreciselythedefinitionofthedirectionalderivativeinequation(1).Therefore,wecansay
that:
Fromthisand(2),weget:
Thisderivationisnotspecialtothepoint
itworksjustaswellforanypointwhere
partialderivativesw.r.t.xandytherefore,foranypoint
where
isdifferentiable:
means:thevalueofthederivativeof w.r.t.x,evaluatedat
[1] Thenotation
saythesamewouldbe
.
has
.Anotherwayto
11/13
10/6/2016
[2] Thatsaid,insomeadvancedvariationsofgradientdescentweactuallywanttoprobedifferentareas
ofthefunctionearlyonintheprocess,soalargerstepmakessense(remember,realisticfunctions
havemanylocalminimaandwewanttofindthebestone).Furtheralongintheoptimizationprocess,
whenwe'vesettledonageneralareaofthefunctionwewantthelearningratetobesmallsowe
actuallygettotheminimum.ThisapproachiscalledannealingandI'llleaveitforsomefuturearticle.
[3] Toavoidtrackingvectormagnitudes,fromnowoninthearticlewe'llbedealingwithnormalized
directionvectors.Thatis,wealwaysassumethat
.
[4] Yes, isactuallygoingintheoppositedirectionsoit's
waseasiertodraw:)
,butthatreallydoesn'tchangeanything.It
Comments
5Comments
EliBendersky'swebsite
Recommend 2
Share
Login
SortbyOldest
Jointhediscussion
CharlesMoyes2monthsago
Awesomearticlethatexplainstheconceptsquitewellvisually.
MathJaxisyourfriend.Allofthenotationonthispagelookssuperpixelatedandblurryupscaledmany
timesonmy4khidpidisplay.
Reply Share
EliBendersky
Mod >CharlesMoyes 2monthsago
Whyisitupscaled?Becausetheimageswouldbesmallotherwise?
12/13
10/6/2016
IconsideredMathJax,butithasitsownissueslikeslownessandjankiness
Reply Share
CharlesMoyes>EliBendersky2monthsago
Itisupscaledtobedisplayedatacomfortablyreadablesizeonsuchahighdensityscreen:
http://imgur.com/a/FSW04
Reply Share
MiroZoricak2monthsago
Greatarticle!Onesmallsuggestion,thesingularformof"minima"is"minimum",not"minima".
Reply Share
EliBendersky
Mod >MiroZoricak 2monthsago
Thanks,I'llgetitfixed
Reply Share
Subscribe d AddDisqustoyoursiteAddDisqusAdd
20032016EliBendersky
Privacy
Backto
top
13/13

Understanding Gradient Descent - Eli Bendersky's Website

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understanding Gradient Descent - Eli Bendersky's Website

Uploaded by

Copyright:

Available Formats

10/6/2016

Itturnsoutthatinbothcases,weshouldnudge inthedirectionoppositetothederivativeat .That's

Where istheanglebetweenthetwovectors.Now,recallthat isnormalizedsoitsmagnitudeis1.

Mod >CharlesMoyes 2monthsago

Mod >MiroZoricak 2monthsago

You might also like

Understanding Gradient Descent - Eli Bendersky&#39;s Website

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understanding Gradient Descent - Eli Bendersky&#39;s Website

Uploaded by

Copyright:

Available Formats

10/6/2016

Itturnsoutthatinbothcases,weshouldnudge inthedirectionoppositetothederivativeat .That's

Where istheanglebetweenthetwovectors.Now,recallthat isnormalizedsoitsmagnitudeis1.

Mod >CharlesMoyes 2monthsago

Mod >MiroZoricak 2monthsago

You might also like

Understanding Gradient Descent - Eli Bendersky's Website

Understanding Gradient Descent - Eli Bendersky's Website