You are on page 1of 13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

Understandinggradientdescent
(http://eli.thegreenplace.net/2016/understanding
gradientdescent/)
August05,2016at05:38 Tags Math(http://eli.thegreenplace.net/tag/math),MachineLearning
(http://eli.thegreenplace.net/tag/machinelearning)

Gradientdescentisastandardtoolforoptimizingcomplexfunctionsiterativelywithinacomputer
program.Itsgoalis:givensomearbitraryfunction,findaminumum.Forsomesmallsubsetoffunctions
thosethatareconvexthere'sjustasingleminumumwhichalsohappenstobeglobal.Formostrealistic
functions,theremaybemanyminima,somostminimaarelocal.Makingsuretheoptimizationfindsthe
"best"minumumanddoesn'tgetstuckinsuboptimialminimaisoutofthescopeofthisarticle.Herewe'll
justbedealingwiththecoregradientdescentalgorithmforfindingsomeminumumfromagivenstarting
point.
Themainpremiseofgradientdescentis:givensomecurrentlocationxinthesearchspace(thedomain
oftheoptimizedfunction)weoughttoupdatexforthenextstepinthedirectionoppositetothegradientof
thefunctioncomputedatx.Butwhyisthisthecase?Theaimofthisarticleistoexplainwhy,
mathematically.

http://eli.thegreenplace.net/2016/understandinggradientdescent/

1/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

Thisisalsotheplaceforadisclaimer:theexamplesusedthroughoutthearticlearetrivial,low
dimensional,convexfunctions.Wedon'treallyneedanalgorithmicproceduretofindtheirglobal
minumumaquickcomputationwoulddo,orreallyjusteyeballingthefunction'splot.Inrealitywewillbe
dealingwithnonlinear,1000dimensionalfunctionswhereit'sutterlyimpossibletovisualizeanything,or
solveanythinganalytically.Theapproachworksjustthesamethere,however.

Buildingintuitionwithsinglevariablefunctions
Thegradientisformallydefinedformultivariatefunctions.However,tostartbuildingintuition,it'susefulto
beginwiththetwodimensionalcase,asinglevariablefunction
.
Insinglevariablefunctions,thesimplederivativeplaystheroleofagradient.So"gradientdescent"would
reallybe"derivativedescent"let'sseewhatthatmeans.
Asanexample,let'stakethefunction

http://eli.thegreenplace.net/2016/understandinggradientdescent/

.Here'sitsplot,inred:

2/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

Imarkedacoupleofpointsontheplot,inblue,anddrewthetangentstothefunctionatthesepoints.
Remember,ourgoalistofindtheminimumofthefunction.Todothat,we'llstartwithaguessforanx,
andcontinouslyupdateittoimproveourguessbasedonsomecomputation.Howdoweknowhowto
updatex?Theupdatehasonlytwopossibledirections:increasexordecreasex.Wehavetodecide
whichofthetwodirectionstotake.
Wedothatbasedonthederivativeof

http://eli.thegreenplace.net/2016/understandinggradientdescent/

.Thederivativeatsomepoint

isdefinedasthelimit[1]:

3/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

Intuitively,thistellsuswhathappensto
above,at
wehave:

whenweaddaverysmallvaluetox.Forexampleintheplot

Thismeansthattheslopeof at
is4foraverysmallpositivechangehtoxatthatpoint,the
valueof
willincreaseby4h.Therefore,togetclosertotheminimumof
weshouldrather
decrease abit.
Let'stakeanotherexamplepoint,
.Atthatpoint,ifweaddalittlebitto ,
4xthatlittlebit.Sothat'sexactlywhatweshoulddotogetclosertotheminimum.

willdecreaseby

Itturnsoutthatinbothcases,weshouldnudge inthedirectionoppositetothederivativeat .That's


themostbasicideabehindgradientdescentthederivativeshowsusthewaytotheminimumorrather,
itshowsusthewaytothemaximumandwethengointheoppositedirection.Givensomeinitialguess
,thenextguesswillbe:

Where iswhatwecalla"learningrate",andisconstantforeachgivenupdate.It'sthereasonwhywe
don'tcaremuchaboutthemagnitudeofthederivativeat ,onlyitsdirection.Ingeneral,itmakessense
tokeepthelearningratefairlysmallsoweonlymakeatinystepatattime.Thismakessense

http://eli.thegreenplace.net/2016/understandinggradientdescent/

4/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

mathematically,becausethederivativeatapointisdefinedastherateofchangeof
assumingan
infinitesimalchangeinx.Forsomelargechangexwhoknowswherewewillget.It'seasytoimagine
caseswherewe'llentirelyovershoottheminimumbymakingtoolargeastep[2].

Multivariatefunctionsanddirectionalderivatives
Withfunctionsofmultiplevariables,derivativesbecomemoreinteresting.Wecan'tjustsay"thederivative
pointstowherethefunctionisincreasing",because...whichderivative?
Recalltheformaldefinitionofthederivativeasthelimitforasmallsteph.Whenourfunctionhasmany
variables,whichoneshouldhavethestepadded?Oneatatime?Allatonce?Inmultivariatecalculus,we
usepartialderivativesasbuildingblocks.Let'suseafunctionoftwovariables
asanexample
throughoutthissection,anddefinethepartialderivativesw.r.t.xandyatsomepoint
:

Whenwehaveasinglevariablefunction
,there'sreallyonlytwodirectionsinwhichwecanmove
fromagivenpoint left(decreasex)orright(increasex).Withtwovariables,thenumberofpossible
directionsisinfinite,becasewepickadirectiontomoveona2Dplane.Hopefullythisimmediatelypops
ups"vectors"inyourhead,sincevectorsaretheperfecttooltodealwithsuchproblems.Wecan
representthechangefromthepoint
asthevector
[3].Thedirectionalderivativeof
along at
isdefinedasitsrateofchangeinthedirectionofthevectoratthatpoint.
Mathematically,it'sdefinedas:

http://eli.thegreenplace.net/2016/understandinggradientdescent/

5/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

Thepartialderivativesw.r.t.xandycanbeseenasspecialcasesofthisdefinition.Thepartialderivative
isjustthedirectionaldirevativeinthedirectionofthexaxis.Invectorspeak,thisisthedirectional
derivativefor
,thestandardbasisvectorforx.Justplug
into(1)to
seewhy.Similarly,thepartialderivative
basisvector

isthedirectionalderivativeinthedirectionofthestandard

Avisualinterlude
Functionsoftwovariables
arethelastfrontierformeaningfulvisualizations,forwhichweneed3D
toplotthevalueof foreachgivenxandy.Evenin3D,visualizinggradientsissignificantlyharderthan
in2D,andyetwehavetotrysinceforanythingabovetwovariablesallhopesofvisualizationarelost.
Here'sthefunction
vectors
and

http://eli.thegreenplace.net/2016/understandinggradientdescent/

plottedinasmallrangearoundzero.Idrewthestandardbasis
[4]andsomecombinationofthem .

6/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

Ialsomarkedthepointon
wherethevectorsarebased.Thegoalistohelpuskeepinmindhow
theindependentvariablesxandychange,andhowthataffects
.Wechangexandybyadding
somesmallvector totheircurrentvalue.Theresultis"nudging"
inthedirectionof .Remember
ourgoalforthisarticlefind suchthatthis"nudge"getsusclosertoaminimum.

Findingdirectionalderivativesusingthegradient
http://eli.thegreenplace.net/2016/understandinggradientdescent/

7/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

Aswe'veseen,thederivativeof

inthedirectionof isdefinedas:

Lookingatthe3Dplotabove,thisishowmuchthevalueof
changeswhenweadd tothevector
.Buthowdowedothat?Thislimitdefinitiondoesn'tlooklikesomethingfriendlyforanalytical
analysisforarbitraryfunctions.Sure,wecouldplug
and inthereanddothecomputation,butit
wouldbenicetohaveaneasiertouseformula.Luckily,withthehelpofthegradientof
itbecomes
mucheasier.
Thegradientisavectorvaluewecomputefromascalarfunction.It'sdefinedas:

Itturnsoutthatgivenavector ,thedirectionalderivative
product:

canbeexpressedasthefollowingdot

Ifthislookslikeamentalleaptoobigtotrust,pleasereadtheAppendixsectionatthebottom.Otherwise,
feelfreetoverifythatthetwoareequivalentwithacoupleofexamples.Forinstance,trytofindthe
derivativeinthedirectionof

at

.Youshouldget

usingboth

methods.

Directionofmaximalchange
We'realmostthere!Nowthatwehavearelativelysimplewayofcomputinganydirectionalderivativefrom
thepartialderivativesofafunction,wecanfigureoutwhichdirectiontotaketogetthemaximalchangein
thevalueof
.
Wecanrewrite:

http://eli.thegreenplace.net/2016/understandinggradientdescent/

8/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

As:

Where istheanglebetweenthetwovectors.Now,recallthat isnormalizedsoitsmagnitudeis1.


Therefore,weonlycareaboutthedirectionof w.r.t.thegradient.Whenisthisequationmaximized?
When
,becausethen
.Sinceacosinecanneverbelargerthan1,that'sthebestwecan
have.
So
givesusthelargestpositivechangein
.Toget
, hastopointinthesame
directionasthegradient.Similarly,for
weget
andthereforethelargestnegative
changein
.Soifwewanttodecrease
themost, hastopointintheoppositedirectionof
thegradient.

Gradientdescentupdateformultivariatefunctions
Tosumup,givensomestartingpoint
firstcomputethegradientof
at

,tonudgeitinthedirectionoftheminimumof
.Then,weupdate(usingvectornotation):

Generalizingtomultipledimensions,let'ssaywehavethefunction
vector
.Wedefinethegradientupdateatstepktobe:

,we

takingthendimensional

Previously,forthesinglevariatecasewesaidthatthederivatvepointsustothewaytotheminimum.
Nowwecansaythatwhiletherearemanywaystogettotheminimum(eventually),thegradientpointsus
tothefastestwayfromanygivenpoint.

Appendix:directionalderivativedefinitionandgradient
http://eli.thegreenplace.net/2016/understandinggradientdescent/

9/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

Thisisanoptionalsectionforthosewhodon'tliketakingmathematicalstatementsforgranted.Nowit's
timetoprovetheequationshownearlierinthearticle,andonwhichitsmainresultisbased:

Asusualwithproofs,itreallyhelpstostartbyworkingthroughanexampleortwotobuildupsome
intuitionintowhytheequationworks.Feelfreetodothatifyou'dlike,usinganyfunction,startingpoint
anddirectionvector .
Supposewedefineafunction

Where

and

asfollows:

definedas:

Inthesedefinitions, , ,aandbareconstants,soboth
variable.Usingthechainrule,weknowthat:

Substitutingthederivativesof

and

and

aretrulyfunctionsofasingle

,weget:

Onemorestep,thesignificanceofwhichwillbecomeclearshortly.Specifically,thederivativeof
is:

Nowlet'sseehowtocomputethederivativeof

http://eli.thegreenplace.net/2016/understandinggradientdescent/

at

at

usingtheformallimitdefinition:

10/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

Butthelatterispreciselythedefinitionofthedirectionalderivativeinequation(1).Therefore,wecansay
that:

Fromthisand(2),weget:

Thisderivationisnotspecialtothepoint
itworksjustaswellforanypointwhere
partialderivativesw.r.t.xandytherefore,foranypoint
where
isdifferentiable:

means:thevalueofthederivativeof w.r.t.x,evaluatedat
[1] Thenotation
saythesamewouldbe
.

http://eli.thegreenplace.net/2016/understandinggradientdescent/

has

.Anotherwayto

11/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

[2] Thatsaid,insomeadvancedvariationsofgradientdescentweactuallywanttoprobedifferentareas
ofthefunctionearlyonintheprocess,soalargerstepmakessense(remember,realisticfunctions
havemanylocalminimaandwewanttofindthebestone).Furtheralongintheoptimizationprocess,
whenwe'vesettledonageneralareaofthefunctionwewantthelearningratetobesmallsowe
actuallygettotheminimum.ThisapproachiscalledannealingandI'llleaveitforsomefuturearticle.
[3] Toavoidtrackingvectormagnitudes,fromnowoninthearticlewe'llbedealingwithnormalized
directionvectors.Thatis,wealwaysassumethat
.
[4] Yes, isactuallygoingintheoppositedirectionsoit's
waseasiertodraw:)

,butthatreallydoesn'tchangeanything.It

Comments
5Comments

EliBendersky'swebsite

Recommend 2

Share

Login

SortbyOldest

Jointhediscussion
CharlesMoyes2monthsago

Awesomearticlethatexplainstheconceptsquitewellvisually.
MathJaxisyourfriend.Allofthenotationonthispagelookssuperpixelatedandblurryupscaledmany
timesonmy4khidpidisplay.

Reply Share
EliBendersky

Mod >CharlesMoyes 2monthsago

Whyisitupscaled?Becausetheimageswouldbesmallotherwise?

http://eli.thegreenplace.net/2016/understandinggradientdescent/

12/13

10/6/2016

UnderstandinggradientdescentEliBendersky'swebsite

IconsideredMathJax,butithasitsownissueslikeslownessandjankiness

Reply Share
CharlesMoyes>EliBendersky2monthsago

Itisupscaledtobedisplayedatacomfortablyreadablesizeonsuchahighdensityscreen:
http://imgur.com/a/FSW04

Reply Share
MiroZoricak2monthsago

Greatarticle!Onesmallsuggestion,thesingularformof"minima"is"minimum",not"minima".

Reply Share
EliBendersky

Mod >MiroZoricak 2monthsago

Thanks,I'llgetitfixed

Reply Share
Subscribe d AddDisqustoyoursiteAddDisqusAdd

20032016EliBendersky

http://eli.thegreenplace.net/2016/understandinggradientdescent/

Privacy

Backto
top

13/13

You might also like