You are on page 1of 39

3/17/2017 Hacker'sguidetoNeuralNetworks

AndrejKarpathyblog About Hacker'sguidetoNeuralNetworks

Hacker'sguidetoNeuralNetworks
Hithere,ImaCSPhDstudentatStanford.IveworkedonDeepLearningforafewyearsas
partofmyresearchandamongseveralofmyrelatedpetprojectsisConvNetJSaJavascript
libraryfortrainingNeuralNetworks.Javascriptallowsonetonicelyvisualizewhatsgoingon
andtoplayaroundwiththevarioushyperparametersettings,butIstillregularlyhearfrom
peoplewhoaskforamorethoroughtreatmentofthetopic.Thisarticle(whichIplantoslowly
expandouttolengthsofafewbookchapters)ismyhumbleattempt.ItsonwebinsteadofPDF
becauseallbooksshouldbe,andeventuallyitwillhopefullyincludeanimations/demosetc.

MypersonalexperiencewithNeuralNetworksisthateverythingbecamemuchclearerwhenI
startedignoringfullpage,densederivationsofbackpropagationequationsandjuststarted
writingcode.Thus,thistutorialwillcontainverylittlemath(Idontbelieveitisnecessaryandit
cansometimesevenobfuscatesimpleconcepts).SincemybackgroundisinComputerScience
andPhysics,IwillinsteaddevelopthetopicfromwhatIrefertoashackerssperspective.My
expositionwillcenteraroundcodeandphysicalintuitionsinsteadofmathematicalderivations.
Basically,IwillstrivetopresentthealgorithmsinawaythatIwishIhadcomeacrosswhenI
wasstartingout.

everythingbecamemuchclearerwhenIstartedwritingcode.

YoumightbeeagertojumprightinandlearnaboutNeuralNetworks,backpropagation,how
theycanbeappliedtodatasetsinpractice,etc.Butbeforewegetthere,Idlikeustofirstforget
aboutallthat.Letstakeastepbackandunderstandwhatisreallygoingonatthecore.Lets
firsttalkaboutrealvaluedcircuits.

Updatenote:Isuspendedmyworkonthisguideawhileagoandredirectedalotofmyenergy
toteachingCS231n(ConvolutionalNeuralNetworks)classatStanford.Thenotesareon
cs231.github.ioandthecourseslidescanbefoundhere.Thesematerialsarehighlyrelatedto
materialhere,butmorecomprehensiveandsometimesmorepolished.

Chapter1:RealvaluedCircuits

https://karpathy.github.io/neuralnets/ 1/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Inmyopinion,thebestwaytothinkofNeuralNetworksisasrealvaluedcircuits,wherereal
values(insteadofbooleanvalues {0,1} )flowalongedgesandinteractingates.However,
insteadofgatessuchas AND , OR , NOT ,etc,wehavebinarygatessuchas * (multiply), +
(add), max orunarygatessuchas exp ,etc.Unlikeordinarybooleancircuits,however,wewill
eventuallyalsohavegradientsflowingonthesameedgesofthecircuit,butintheopposite
direction.Butweregettingaheadofourselves.Letsfocusandstartoutsimple.

BaseCase:SingleGateintheCircuit
Letsfirstconsiderasingle,simplecircuitwithonegate.Heresanexample:

y *

Thecircuittakestworealvaluedinputs x and y andcomputes x*y withthe * gate.


Javascriptversionofthiswouldverysimplylooksomethinglikethis:

varforwardMultiplyGate=function(x,y){
returnx*y;
};
forwardMultiplyGate(2,3);//returns6.Exciting.

Andinmathformwecanthinkofthisgateasimplementingtherealvaluedfunction:

f (x, y) = xy

Aswiththisexample,allofourgateswilltakeoneortwoinputsandproduceasingleoutput
value.

TheGoal
Theproblemweareinterestedinstudyinglooksasfollows:

1.Weprovideagivencircuitsomespecificinputvalues(e.g. x=2 , y=3 )


2.Thecircuitcomputesanoutputvalue(e.g. 6 )

https://karpathy.github.io/neuralnets/ 2/41
3/17/2017 Hacker'sguidetoNeuralNetworks

3.Thecorequestionthenbecomes:Howshouldonetweaktheinputslightlytoincreasethe
output?

Inthiscase,inwhatdirectionshouldwechange x,y togetanumberlargerthan 6 ?Note


that,forexample, x=1.99 and y=2.99 gives x*y=5.95 ,whichishigherthan
6.0 .Dontgetconfusedbythis: 5.95 isbetter(higher)than 6.0 .Itsanimprovementof
0.05 ,eventhoughthemagnitudeof 5.95 (thedistancefromzero)happenstobelower.

Strategy#1:RandomLocalSearch

Okay.Sowait,wehaveacircuit,wehavesomeinputsandwejustwanttotweakthemslightly
toincreasetheoutputvalue?Whyisthishard?Wecaneasilyforwardthecircuittocompute
theoutputforanygiven x and y .Soisntthistrivial?Whydontwetweak x and y
randomlyandkeeptrackofthetweakthatworksbest:

//circuitwithsinglegatefornow
varforwardMultiplyGate=function(x,y){returnx*y;};
varx=2,y=3;//someinputvalues

//trychangingx,yrandomlysmallamountsandkeeptrackofwhatworksbest
vartweak_amount=0.01;
varbest_out=Infinity;
varbest_x=x,best_y=y;
for(vark=0;k<100;k++){
varx_try=x+tweak_amount*(Math.random()*21);//tweakxabit
vary_try=y+tweak_amount*(Math.random()*21);//tweakyabit
varout=forwardMultiplyGate(x_try,y_try);
if(out>best_out){
//bestimprovementyet!Keeptrackofthexandy
best_out=out;
best_x=x_try,best_y=y_try;
}
}

WhenIrunthis,Iget best_x=1.9928 , best_y=2.9901 ,and best_out=5.9588 .


Again, 5.9588 ishigherthan 6.0 .So,weredone,right?Notquite:Thisisaperfectlyfine
strategyfortinyproblemswithafewgatesifyoucanaffordthecomputetime,butitwontdoif
wewanttoeventuallyconsiderhugecircuitswithmillionsofinputs.Itturnsoutthatwecando
muchbetter.

Stategy#2:NumericalGradient
https://karpathy.github.io/neuralnets/ 3/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Heresabetterway.Rememberagainthatinoursetupwearegivenacircuit(e.g.ourcircuit
withasingle * gate)andsomeparticularinput(e.g. x=2,y=3 ).Thegatecomputesthe
output( 6 )andnowwedliketotweak x and y tomaketheoutputhigher.

Aniceintuitionforwhatwereabouttodoisasfollows:Imaginetakingtheoutputvaluethat
comesoutfromthecircuitandtuggingonitinthepositivedirection.Thispositivetensionwillin
turntranslatethroughthegateandinduceforcesontheinputs x and y .Forcesthattellus
how x and y shouldchangetoincreasetheoutputvalue.

Whatmightthoseforceslooklikeinourspecificexample?Thinkingthroughit,wecanintuitthat
theforceon x shouldalsobepositive,becausemaking x slightlylargerimprovesthecircuits
output.Forexample,increasing x from x=2 to x=1 wouldgiveusoutput 3 much
largerthan 6 .Ontheotherhand,wedexpectanegativeforceinducedon y thatpushesit
tobecomelower(sincealower y ,suchas y=2 ,downfromtheoriginal y=3 would
makeoutputhigher: 2x2=4 ,again,largerthan 6 ).Thatstheintuitiontokeepinmind,
anyway.Aswegothroughthis,itwillturnoutthatforcesImdescribingwillinfactturnouttobe
thederivativeoftheoutputvaluewithrespecttoitsinputs( x and y ).Youmayhaveheard
thistermbefore.

Thederivativecanbethoughtofasaforceoneachinputaswepullontheoutputtobecome
higher.

Sohowdoweexactlyevaluatethisforce(derivative)?Itturnsoutthatthereisaverysimple
procedureforthis.Wewillworkbackwards:Insteadofpullingonthecircuitsoutput,welliterate
overeveryinputonebyone,increaseitveryslightlyandlookatwhathappenstotheoutput
value.Theamounttheoutputchangesinresponseisthederivative.Enoughintuitionsfornow.
Letslookatthemathematicaldefinition.Wecanwritedownthederivativeforourfunctionwith
respecttotheinputs.Forexample,thederivativewithrespectto x canbecomputedas:

f (x, y) f (x + h, y) f (x, y)
=
x h

Whereh issmallitsthetweakamount.Also,ifyourenotveryfamiliarwithcalculusitis
importanttonotethatinthelefthandsideoftheequationabove,thehorizontallinedoesnot
f (x,y)
indicatedivision.Theentiresymbol isasinglething:thederivativeofthefunction
x

withrespecttox .Thehorizontallineontherightisdivision.Iknowitsconfusingbutits
f (x, y)

standardnotation.Anyway,Ihopeitdoesntlooktooscarybecauseitisnt:Thecircuitwas
givingsomeinitialoutputf (x, y),andthenwechangedoneoftheinputsbyatinyamounth
andreadthenewoutputf (x + h, y).Subtractingthosetwoquantitiestellsusthechange,and

https://karpathy.github.io/neuralnets/ 4/41
3/17/2017 Hacker'sguidetoNeuralNetworks

thedivisionbyh justnormalizesthischangebythe(arbitrary)tweakamountweused.Inother
wordsitsexpressingexactlywhatIdescribedaboveandtranslatesdirectlytothiscode:

varx=2,y=3;
varout=forwardMultiplyGate(x,y);//6
varh=0.0001;

//computederivativewithrespecttox
varxph=x+h;//1.9999
varout2=forwardMultiplyGate(xph,y);//5.9997
varx_derivative=(out2out)/h;//3.0

//computederivativewithrespecttoy
varyph=y+h;//3.0001
varout3=forwardMultiplyGate(x,yph);//6.0002
vary_derivative=(out3out)/h;//2.0

Letswalkthrough x forexample.Weturnedtheknobfrom x to x+h andthecircuit


respondedbygivingahighervalue(noteagainthatyes, 5.9997 ishigherthan 6 : 5.9997
>6 ).Thedivisionby h istheretonormalizethecircuitsresponsebythe(arbitrary)valueof
h wechosetousehere.Technically,youwantthevalueof h tobeinfinitesimal(theprecise
mathematicaldefinitionofthegradientisdefinedasthelimitoftheexpressionas h goesto
zero),butinpractice h=0.00001 orsoworksfineinmostcasestogetagoodapproximation.
Now,weseethatthederivativew.r.t. x is +3 .Immakingthepositivesignexplicit,becauseit
indicatesthatthecircuitistuggingonxtobecomehigher.Theactualvalue, 3 canbe
interpretedastheforceofthattug.

Thederivativewithrespecttosomeinputcanbecomputedbytweakingthatinputbyasmall
amountandobservingthechangeontheoutputvalue.

Bytheway,weusuallytalkaboutthederivativewithrespecttoasingleinput,orabouta
gradientwithrespecttoalltheinputs.Thegradientisjustmadeupofthederivativesofallthe
inputsconcatenatedinavector(i.e.alist).Crucially,noticethatifwelettheinputsrespondto
thetugbyfollowingthegradientatinyamount(i.e.wejustaddthederivativeontopofevery
input),wecanseethatthevalueincreases,asexpected:

varstep_size=0.01;
varout=forwardMultiplyGate(x,y);//before:6
x=x+step_size*x_derivative;//xbecomes1.97

https://karpathy.github.io/neuralnets/ 5/41
3/17/2017 Hacker'sguidetoNeuralNetworks

y=y+step_size*y_derivative;//ybecomes2.98
varout_new=forwardMultiplyGate(x,y);//5.87!exciting.

Asexpected,wechangedtheinputsbythegradientandthecircuitnowgivesaslightlyhigher
value( 5.87>6.0 ).Thatwasmuchsimplerthantryingrandomchangesto x and y ,
right?Afacttoappreciatehereisthatifyoutakecalculusyoucanprovethatthegradientis,in
fact,thedirectionofthesteepestincreaseofthefunction.Thereisnoneedtomonkeyaround
tryingoutrandompertubationsasdoneinStrategy#1.Evaluatingthegradientrequiresjust
threeevaluationsoftheforwardpassofourcircuitinsteadofhundreds,andgivesthebesttug
youcanhopefor(locally)ifyouareinterestedinincreasingthevalueoftheoutput.

Biggerstepisnotalwaysbetter.Letmeclarifyonthispointabit.Itisimportanttonotethatin
thisverysimpleexample,usingabigger step_size than0.01willalwaysworkbetter.For
example, step_size=1.0 givesoutput 1 (higer,better!),andindeedinfinitestepsize
wouldgiveinfinitelygoodresults.Thecrucialthingtorealizeisthatonceourcircuitsgetmuch
morecomplex(e.g.entireneuralnetworks),thefunctionfrominputstotheoutputvaluewillbe
morechaoticandwiggly.Thegradientguaranteesthatifyouhaveaverysmall(indeed,
infinitesimallysmall)stepsize,thenyouwilldefinitelygetahighernumberwhenyoufollowits
direction,andforthatinfinitesimallysmallstepsizethereisnootherdirectionthatwouldhave
workedbetter.Butifyouuseabiggerstepsize(e.g. step_size=0.01 )allbetsareoff.The
reasonwecangetawaywithalargerstepsizethaninfinitesimallysmallisthatourfunctions
areusuallyrelativelysmooth.Butreally,werecrossingourfingersandhopingforthebest.

Hillclimbinganalogy.OneanalogyIveheardbeforeisthattheoutputvalueofourcircutis
liketheheightofahill,andweareblindfoldedandtryingtoclimbupwards.Wecansensethe
steepnessofthehillatourfeet(thegradient),sowhenweshuffleourfeetabitwewillgo
upwards.Butifwetookabig,overconfidentstep,wecouldhavesteppedrightintoahole.

Great,IhopeIveconvincedyouthatthenumericalgradientisindeedaveryusefulthingto
evaluate,andthatitischeap.But.Itturnsoutthatwecandoevenbetter.

Strategy#3:AnalyticGradient
Intheprevioussectionweevaluatedthegradientbyprobingthecircuitsoutputvalue,
independentlyforeveryinput.Thisproceduregivesyouwhatwecallanumericalgradient.
Thisapproach,however,isstillexpensivebecauseweneedtocomputethecircuitsoutputas
wetweakeveryinputvalueindependentlyasmallamount.Sothecomplexityofevaluatingthe
gradientislinearinnumberofinputs.Butinpracticewewillhavehundreds,thousandsor(for
neuralnetworks)eventenstohundredsofmillionsofinputs,andthecircuitsarentjustone

https://karpathy.github.io/neuralnets/ 6/41
3/17/2017 Hacker'sguidetoNeuralNetworks

multiplygatebuthugeexpressionsthatcanbeexpensivetocompute.Weneedsomething
better.

Luckily,thereisaneasierandmuchfasterwaytocomputethegradient:wecanusecalculusto
deriveadirectexpressionforitthatwillbeassimpletoevaluateasthecircuitsoutputvalue.
Wecallthisananalyticgradientandtherewillbenoneedfortweakinganything.Youmay
haveseenotherpeoplewhoteachNeuralNetworksderivethegradientinhugeand,frankly,
scaryandconfusingmathematicalequations(ifyourenotwellversedinmaths).Butits
unnecessary.IvewrittenplentyofNeuralNetscodeandIrarelyhavetodomathematical
derivationlongerthantwolines,and95%ofthetimeitcanbedonewithoutwritinganythingat
all.Thatisbecausewewillonlyeverderivethegradientforverysmallandsimpleexpressions
(thinkofitasthebasecase)andthenIwillshowyouhowwecancomposetheseverysimply
withchainruletoevaluatethefullgradient(thinkinductive/recursivecase).

Theanalyticderivativerequiresnotweakingoftheinputs.Itcanbederivedusing
mathematics(calculus).

Ifyourememberyourproductrules,powerrules,quotientrules,etc.(seee.g.derivativerulesor
wikipage),itsveryeasytowritedownthederivitativewithrespecttoboth x and y fora
smallexpressionsuchas x*y .Butsupposeyoudontrememberyourcalculusrules.We
cangobacktothedefinition.Forexample,herestheexpressionforthederivativew.r.t x :

f (x, y) f (x + h, y) f (x, y)
=
x h

(TechnicallyImnotwritingthelimitas h goestozero,forgivememathpeople).Okayandlets
pluginourfunction(f (x, y) = xy )intotheexpression.Readyforthehardestpieceofmath
ofthisentirearticle?Herewego:

f (x, y) f (x + h, y) f (x, y) (x + h)y xy xy + hy xy hy


= = = = = y
x h h h h

Thatsinteresting.Thederivativewithrespectto x isjustequalto y .Didyounoticethe


coincidenceintheprevioussection?Wetweaked x to x+h andcalculated x_derivative=
3.0 ,whichexactlyhappenstobethevalueof y inthatexample.Itturnsoutthatwasnta
coincidenceatallbecausethatsjustwhattheanalyticgradienttellsusthe x derivativeshould
befor f(x,y)=x*y .Thederivativewithrespectto y ,bytheway,turnsouttobe x ,
unsurprisinglybysymmetry.Sothereisnoneedforanytweaking!Weinvokedpowerful
mathematicsandcannowtransformourderivativecalculationintothefollowingcode:

https://karpathy.github.io/neuralnets/ 7/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varx=2,y=3;
varout=forwardMultiplyGate(x,y);//before:6
varx_gradient=y;//byourcomplexmathematicalderivationabove
vary_gradient=x;

varstep_size=0.01;
x+=step_size*x_gradient;//2.03
y+=step_size*y_gradient;//2.98
varout_new=forwardMultiplyGate(x,y);//5.87.Higheroutput!Nice.

Tocomputethegradientwewentfromforwardingthecircuithundredsoftimes(Strategy#1)to
forwardingitonlyonorderofnumberoftimestwicethenumberofinputs(Strategy#2),to
forwardingitasingletime!AnditgetsEVENbetter,sincethemoreexpensivestrategies(#1
and#2)onlygiveanapproximationofthegradient,while#3(thefastestonebyfar)givesyou
theexactgradient.Noapproximations.Theonlydownsideisthatyoushouldbecomfortable
withsomecalculus101.

Letsrecapwhatwehavelearned:

INPUT:Wearegivenacircuit,someinputsandcomputeanoutputvalue.
OUTPUT:Wearetheninterestedfindingsmallchangestoeachinput(independently)that
wouldmaketheoutputhigher.
Strategy#1:Onesillywayistorandomlysearchforsmallpertubationsoftheinputsand
keeptrackofwhatgivesthehighestincreaseinoutput.
Strategy#2:Wesawwecandomuchbetterbycomputingthegradient.Regardlessof
howcomplicatedthecircuitis,thenumericalgradientisverysimple(butrelatively
expensive)tocompute.Wecomputeitbyprobingthecircuitsoutputvalueaswetweak
theinputsoneatatime.
Strategy#3:Intheend,wesawthatwecanbeevenmorecleverandanalyticallyderivea
directexpressiontogettheanalyticgradient.Itisidenticaltothenumericalgradient,itis
fastestbyfar,andthereisnoneedforanytweaking.

Inpracticebytheway(andwewillgettothisonceagainlater),allNeuralNetworklibraries
alwayscomputetheanalyticgradient,butthecorrectnessoftheimplementationisverifiedby
comparingittothenumericalgradient.Thatsbecausethenumericalgradientisveryeasyto
evaluate(butcanbeabitexpensivetocompute),whiletheanalyticgradientcancontainbugs
attimes,butisusuallyextremelyefficienttocompute.Aswewillsee,evaluatingthegradient
(i.e.whiledoingbackprop,orbackwardpass)willturnouttocostaboutasmuchasevaluating
theforwardpass.

RecursiveCase:CircuitswithMultipleGates
https://karpathy.github.io/neuralnets/ 8/41
3/17/2017 Hacker'sguidetoNeuralNetworks

RecursiveCase:CircuitswithMultipleGates
Butholdon,yousay:Theanalyticgradientwastrivialtoderiveforyoursupersimple
expression.Thisisuseless.WhatdoIdowhentheexpressionsaremuchlarger?Dontthe
equationsgethugeandcomplexveryfast?.Goodquestion.Yestheexpressionsgetmuch
morecomplex.No,thisdoesntmakeitmuchharder.Aswewillsee,everygatewillbehanging
outbyitself,completelyunawareofanydetailsofthehugeandcomplexcircuitthatitcouldbe
partof.Itwillonlyworryaboutitsinputsanditwillcomputeitslocalderivativesasseeninthe
previoussection,exceptnowtherewillbeasingleextramultiplicationitwillhavetodo.

Asingleextramultiplicationwillturnasingle(uselessgate)intoacoginthecomplexmachine
thatisanentireneuralnetwork.

Ishouldstophypingitupnow.IhopeIvepiquedyourinterest!Letsdrilldownintodetailsand
gettwogatesinvolvedwiththisnextexample:

x q

y
+ * f

Theexpressionwearecomputingnowisf (x, y, z) = (x + y)z .Letsstructurethecodeas


followstomakethegatesexplicitasfunctions:

varforwardMultiplyGate=function(a,b){
returna*b;
};
varforwardAddGate=function(a,b){
returna+b;
};
varforwardCircuit=function(x,y,z){
varq=forwardAddGate(x,y);
varf=forwardMultiplyGate(q,z);
returnf;
};

varx=2,y=5,z=4;
varf=forwardCircuit(x,y,z);//outputis12

https://karpathy.github.io/neuralnets/ 9/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Intheabove,Iamusing a and b asthelocalvariablesinthegatefunctionssothatwedont


gettheseconfusedwithourcircuitinputs x,y,z .Asbefore,weareinterestedinfindingthe
derivativeswithrespecttothethreeinputs x,y,z .Buthowdowecomputeitnowthatthere
aremultiplegatesinvolved?First,letspretendthatthe + gateisnotthereandthatweonly
havetwovariablesinthecircuit: q,z andasingle * gate.Notethatthe q isisoutputofthe
+ gate.Ifwedontworryabout x and y butonlyabout q and z ,thenwearebackto
havingonlyasinglegate,andasfarasthatsingle * gateisconcerned,weknowwhatthe
(analytic)derivatesarefromprevioussection.Wecanwritethemdown(exceptherewere
replacing x,y with q,z ):

f (q, z) f (q, z)
f (q, z) = qz = z, = q
q z

Simpleenough:thesearetheexpressionsforthegradientwithrespectto q and z .Butwait,


wedontwantgradientwithrespectto q ,butwithrespecttotheinputs: x and y .Luckily, q
iscomputedasafunctionof x and y (byadditioninourexample).Wecanwritedownthe
gradientfortheadditiongateaswell,itsevensimpler:

q(x, y) q(x, y)
q(x, y) = x + y = 1, = 1
x y

Thatsright,thederivativesarejust1,regardlessoftheactualvaluesof x and y .Ifyouthink


aboutit,thismakessensebecausetomaketheoutputofasingleadditiongatehigher,we
expectapositivetugonboth x and y ,regardlessoftheirvalues.

Backpropagation

WearefinallyreadytoinvoketheChainRule:Weknowhowtocomputethegradientof q
withrespectto x and y (thatsasinglegatecasewith + asthegate).Andweknowhowto
computethegradientofourfinaloutputwithrespectto q .Thechainruletellsushowto
combinethesetogetthegradientofthefinaloutputwithrespectto x and y ,whichiswhat
wereultimatelyinterestedin.Bestofall,thechainruleverysimplystatesthattherightthingto
doistosimplymultiplythegradientstogethertochainthem.Forexample,thefinalderivative
for x willbe:

f (q, z) q(x, y) f (q, z)


=
x x q

Therearemanysymbolstheresomaybethisisconfusingagain,butitsreallyjusttwonumbers
beingmultipliedtogether.Hereisthecode:

https://karpathy.github.io/neuralnets/ 10/41
3/17/2017 Hacker'sguidetoNeuralNetworks

//initialconditions
varx=2,y=5,z=4;
varq=forwardAddGate(x,y);//qis3
varf=forwardMultiplyGate(q,z);//outputis12

//gradientoftheMULTIPLYgatewithrespecttoitsinputs
//wrtisshortfor"withrespectto"
varderivative_f_wrt_z=q;//3
varderivative_f_wrt_q=z;//4

//derivativeoftheADDgatewithrespecttoitsinputs
varderivative_q_wrt_x=1.0;
varderivative_q_wrt_y=1.0;

//chainrule
varderivative_f_wrt_x=derivative_q_wrt_x*derivative_f_wrt_q;//4
varderivative_f_wrt_y=derivative_q_wrt_y*derivative_f_wrt_q;//4

Thatsit.Wecomputedthegradient(theforces)andnowwecanletourinputsrespondtoitby
abit.Letsaddthegradientsontopoftheinputs.Theoutputvalueofthecircuitbetterincrease,
upfrom12!

//finalgradient,fromabove:[4,4,3]
vargradient_f_wrt_xyz=[derivative_f_wrt_x,derivative_f_wrt_y,derivative_f_wrt_z

//lettheinputsrespondtotheforce/tug:
varstep_size=0.01;
x=x+step_size*derivative_f_wrt_x;//2.04
y=y+step_size*derivative_f_wrt_y;//4.96
z=z+step_size*derivative_f_wrt_z;//3.97

//Ourcircuitnowbettergivehigheroutput:
varq=forwardAddGate(x,y);//qbecomes2.92
varf=forwardMultiplyGate(q,z);//outputis11.59,upfrom12!Nice!

Lookslikethatworked!Letsnowtrytointerpretintuitivelywhatjusthappened.Thecircuitwants
tooutputhighervalues.Thelastgatesawinputs q=3,z=4 andcomputedoutput 12 .
Pullingupwardsonthisoutputvalueinducedaforceonboth q and z :Toincreasethe
outputvalue,thecircuitwants z toincrease,ascanbeseenbythepositivevalueofthe
derivative( derivative_f_wrt_z=+3 ).Again,thesizeofthisderivativecanbeinterpretedas

https://karpathy.github.io/neuralnets/ 11/41
3/17/2017 Hacker'sguidetoNeuralNetworks

themagnitudeoftheforce.Ontheotherhand, q feltastrongeranddownwardforce,since
derivative_f_wrt_q=4 .Inotherwordsthecircuitwants q todecrease,withaforceof
4.

Nowwegettothesecond, + gatewhichoutputs q .Bydefault,the + gatecomputesits


derivativeswhichtellsushowtochange x and y tomake q higher.BUT!Hereisthe
crucialpoint:thegradienton q wascomputedasnegative( derivative_f_wrt_q=4 ),so
thecircuitwants q todecrease,andwithaforceof 4 !Soifthe + gatewantstocontributeto
makingthefinaloutputvaluelarger,itneedstolistentothegradientsignalcomingfromthetop.
Inthisparticularcase,itneedstoapplytugson x,y oppositeofwhatitwouldnormallyapply,
andwithaforceof 4 ,sotospeak.Themultiplicationby 4 seeninthechainruleachieves
exactlythis:insteadofapplyingapositiveforceof +1 onboth x and y (thelocalderivative),
thefullcircuitsgradientonboth x and y becomes 1x4=4 .Thismakessense:the
circuitwantsboth x and y togetsmallerbecausethiswillmake q smaller,whichinturnwill
make f larger.

Ifthismakessense,youunderstandbackpropagation.

Letsrecaponceagainwhatwelearned:

Inthepreviouschapterwesawthatinthecaseofasinglegate(orasingleexpression),
wecanderivetheanalyticgradientusingsimplecalculus.Weinterpretedthegradientas
aforce,oratugontheinputsthatpullstheminadirectionwhichwouldmakethisgates
outputhigher.
Incaseofmultiplegateseverythingstaysprettymuchthesameway:everygateis
hangingoutbyitselfcompletelyunawareofthecircuititisembeddedin.Someinputs
comeinandthegatecomputesitsoutputandthederivatewithrespecttotheinputs.The
onlydifferencenowisthatsuddenly,somethingcanpullonthisgatefromabove.Thats
thegradientofthefinalcircuitoutputvaluewithrespecttotheouputthisgatecomputed.It
isthecircuitaskingthegatetooutputhigherorlowernumbers,andwithsomeforce.The
gatesimplytakesthisforceandmultipliesittoalltheforcesitcomputedforitsinputs
before(chainrule).Thishasthedesiredeffect:

1.Ifagateexperiencesastrongpositivepullfromabove,itwillalsopullharderonitsown
inputs,scaledbytheforceitisexperiencingfromabove
2.Andifitexperiencesanegativetug,thismeansthatcircuitwantsitsvaluetodecreasenot
increase,soitwillfliptheforceofthepullonitsinputstomakeitsownoutputvalue
smaller.

https://karpathy.github.io/neuralnets/ 12/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Anicepicturetohaveinmindisthataswepullonthecircuitsoutputvalueattheend,this
inducespullsdownwardthroughtheentirecircuit,allthewaydowntotheinputs.

Isntitbeautiful?Theonlydifferencebetweenthecaseofasinglegateandmultipleinteracting
gatesthatcomputearbitrarilycomplexexpressionsisthisadditionalmultipyoperationthatnow
happensineachgate.

Patternsinthebackwardflow

Letslookagainatourexamplecircuitwiththenumbersfilledin.Thefirstcircuitshowstheraw
values,andthesecondcircuitshowsthegradientsthatflowbacktotheinputsasdiscussed.
Noticethatthegradientalwaysstartsoffwith +1 attheendtostartoffthechain.Thisisthe
(default)pullonthecircuittohaveitsvalueincreased.

2 3

5
+ * 12 (Values)

4 4
+
4 * 1 (Gradients)

Afterawhileyoustarttonoticepatternsinhowthegradientsflowbackwardinthecircuits.For
example,the + gatealwaystakesthegradientontopandsimplypassesitontoallofits
inputs(noticetheexamplewith4simplypassedontobothoftheinputsof + gate).Thisis
becauseitsownderivativefortheinputsisjust +1 ,regardlessofwhattheactualvaluesofthe
inputsare,sointhechainrule,thegradientfromaboveisjustmultipliedby1andstaysthe
same.Similarintuitionsapplyto,forexample,a max(x,y) gate.Sincethegradientof
max(x,y) withrespecttoitsinputis +1 forwhicheveroneof x , y islargerand 0 forthe
other,thisgateisduringbackpropeffectivelyjustagradientswitch:itwilltakethegradient
fromaboveandrouteittotheinputthathadahighervalueduringtheforwardpass.

https://karpathy.github.io/neuralnets/ 13/41
3/17/2017 Hacker'sguidetoNeuralNetworks

NumericalGradientCheck.Beforewefinishwiththissection,letsjustmakesurethatthe
(analytic)gradientwecomputedbybackpropaboveiscorrectasasanitycheck.Remember
thatwecandothissimplybycomputingthenumericalgradientandmakingsurethatweget
[4,4,3] for x,y,z .Heresthecode:

//initialconditions
varx=2,y=5,z=4;

//numericalgradientcheck
varh=0.0001;
varx_derivative=(forwardCircuit(x+h,y,z)forwardCircuit(x,y,z))/h;//4
vary_derivative=(forwardCircuit(x,y+h,z)forwardCircuit(x,y,z))/h;//4
varz_derivative=(forwardCircuit(x,y,z+h)forwardCircuit(x,y,z))/h;//3

andweget [4,4,3] ,ascomputedwithbackprop.phew!:)

Example:SingleNeuron
Intheprevioussectionyouhopefullygotthebasicintuitionbehindbackpropagation.Letsnow
lookatanevenmorecomplicatedandborderlinepracticalexample.Wewillconsidera2
dimensionalneuronthatcomputesthefollowingfunction:

f (x, y, a, b, c) = (ax + by + c)

Inthisexpression, isthesigmoidfunction.Itsbestthoughtofasasquashingfunction,
becauseittakestheinputandsquashesittobebetweenzeroandone:Verynegativevalues
aresquashedtowardszeroandpositivevaluesgetsquashedtowardsone.Forexample,we
have sig(5)=0.006,sig(0)=0.5,sig(5)=0.993 .Sigmoidfunctionisdefinedas:

1
(x) =
x
1 + e

Thegradientwithrespecttoitssingleinput,asyoucancheckonWikipediaorderiveyourselfif
youknowsomecalculusisgivenbythisexpression:

(x)
= (x)(1 (x))
x

Forexample,iftheinputtothesigmoidgateis x=3 ,thegatewillcomputeoutput f=1.0/


(1.0+Math.exp(x))=0.95 ,andthenthe(local)gradientonitsinputwillsimplybe dx=
(0.95)*(10.95)=0.0475 .

https://karpathy.github.io/neuralnets/ 14/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Thatsallweneedtousethisgate:weknowhowtotakeaninputandforwarditthroughthe
sigmoidgate,andwealsohavetheexpressionforthegradientwithrespecttoitsinput,sowe
canalsobackpropthroughit.Anotherthingtonoteisthattechnically,thesigmoidfunctionis
madeupofanentireseriesofgatesinalinethatcomputemoreatomicfunctions:an
exponentiationgate,anadditiongateandadivisiongate.Treatingitsowouldworkperfectly
finebutforthisexampleIchosetocollapseallofthesegatesintoasinglegatethatjust
computessigmoidinoneshot,becausethegradientexpressionturnsouttobesimple.

Letstakethisopportunitytocarefullystructuretheassociatedcodeinaniceandmodularway.
First,Idlikeyoutonotethateverywireinourdiagramshastwonumbersassociatedwithit:

1.thevalueitcarriesduringtheforwardpass
2.thegradient(i.ethepull)thatflowsbackthroughitinthebackwardpass

Letscreateasimple Unit structurethatwillstorethesetwovaluesoneverywire.Ourgates


willnowoperateover Unit s:theywilltakethemasinputsandcreatethemasoutputs.

//everyUnitcorrespondstoawireinthediagrams
varUnit=function(value,grad){
//valuecomputedintheforwardpass
this.value=value;
//thederivativeofcircuitoutputw.r.tthisunit,computedinbackwardpass
this.grad=grad;
}

InadditiontoUnitswealsoneed3gates: + , * and sig (sigmoid).Letsstartoutby


implementingamultiplygate.ImusingJavascriptherewhichhasafunnywayofsimulating
classesusingfunctions.IfyourenotaJavascriptfamiliarperson,allthatsgoingonhereis
thatImdefiningaclassthathascertainproperties(accessedwithuseof this keyword),and
somemethods(whichinJavascriptareplacedintothefunctionsprototype).Justthinkabout
theseasclassmethods.Alsokeepinmindthatthewaywewillusetheseeventuallyisthatwe
willfirst forward allthegatesonebyone,andthen backward allthegatesinreverseorder.
Hereistheimplementation:

varmultiplyGate=function(){};
multiplyGate.prototype={
forward:function(u0,u1){
//storepointerstoinputUnitsu0andu1andoutputunitutop
this.u0=u0;
this.u1=u1;
https://karpathy.github.io/neuralnets/ 15/41
3/17/2017 Hacker'sguidetoNeuralNetworks

this.utop=newUnit(u0.value*u1.value,0.0);
returnthis.utop;
},
backward:function(){
//takethegradientinoutputunitandchainitwiththe
//localgradients,whichwederivedformultiplygatebefore
//thenwritethosegradientstothoseUnits.
this.u0.grad+=this.u1.value*this.utop.grad;
this.u1.grad+=this.u0.value*this.utop.grad;
}
}

Themultiplygatetakestwounitsthateachholdavalueandcreatesaunitthatstoresitsoutput.
Thegradientisinitializedtozero.Thennoticethatinthe backward functioncallwegetthe
gradientfromtheoutputunitweproducedduringtheforwardpass(whichwillbynowhopefully
haveitsgradientfilledin)andmultiplyitwiththelocalgradientforthisgate(chainrule!).This
gatecomputesmultiplication( u0.value*u1.value )duringforwardpass,sorecallthatthe
gradientw.r.t u0 is u1.value andw.r.t u1 is u0.value .Alsonotethatweareusing += to
addontothegradientinthe backward function.Thiswillallowustopossiblyusetheoutputof
onegatemultipletimes(thinkofitasawirebranchingout),sinceitturnsoutthatthegradients
fromthesedifferentbranchesjustaddupwhencomputingthefinalgradientwithrespecttothe
circuitoutput.Theothertwogatesaredefinedanalogously:

varaddGate=function(){};
addGate.prototype={
forward:function(u0,u1){
this.u0=u0;
this.u1=u1;//storepointerstoinputunits
this.utop=newUnit(u0.value+u1.value,0.0);
returnthis.utop;
},
backward:function(){
//addgate.derivativewrtbothinputsis1
this.u0.grad+=1*this.utop.grad;
this.u1.grad+=1*this.utop.grad;
}
}

varsigmoidGate=function(){
//helperfunction
this.sig=function(x){return1/(1+Math.exp(x));};
};
https://karpathy.github.io/neuralnets/ 16/41
3/17/2017 Hacker'sguidetoNeuralNetworks

sigmoidGate.prototype={
forward:function(u0){
this.u0=u0;
this.utop=newUnit(this.sig(this.u0.value),0.0);
returnthis.utop;
},
backward:function(){
vars=this.sig(this.u0.value);
this.u0.grad+=(s*(1s))*this.utop.grad;
}
}

Notethat,again,the backward functioninallcasesjustcomputesthelocalderivativewith


respecttoitsinputandthenmultipliesonthegradientfromtheunitabove(i.e.chainrule).To
fullyspecifyeverythingletsfinallywriteouttheforwardandbackwardflowforour2dimensional
neuronwithsomeexamplevalues:

//createinputunits
vara=newUnit(1.0,0.0);
varb=newUnit(2.0,0.0);
varc=newUnit(3.0,0.0);
varx=newUnit(1.0,0.0);
vary=newUnit(3.0,0.0);

//createthegates
varmulg0=newmultiplyGate();
varmulg1=newmultiplyGate();
varaddg0=newaddGate();
varaddg1=newaddGate();
varsg0=newsigmoidGate();

//dotheforwardpass
varforwardNeuron=function(){
ax=mulg0.forward(a,x);//a*x=1
by=mulg1.forward(b,y);//b*y=6
axpby=addg0.forward(ax,by);//a*x+b*y=5
axpbypc=addg1.forward(axpby,c);//a*x+b*y+c=2
s=sg0.forward(axpbypc);//sig(a*x+b*y+c)=0.8808
};
forwardNeuron();

console.log('circuitoutput:'+s.value);//prints0.8808

https://karpathy.github.io/neuralnets/ 17/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Andnowletscomputethegradient:Simplyiterateinreverseorderandcallthe backward
function!Rememberthatwestoredthepointerstotheunitswhenwedidtheforwardpass,so
everygatehasaccesstoitsinputsandalsotheoutputunititpreviouslyproduced.

s.grad=1.0;
sg0.backward();//writesgradientintoaxpbypc
addg1.backward();//writesgradientsintoaxpbyandc
addg0.backward();//writesgradientsintoaxandby
mulg1.backward();//writesgradientsintobandy
mulg0.backward();//writesgradientsintoaandx

Notethatthefirstlinesetsthegradientattheoutput(verylastunit)tobe 1.0 tostartoffthe


gradientchain.Thiscanbeinterpretedastuggingonthelastgatewithaforceof +1 .Inother
words,wearepullingontheentirecircuittoinducetheforcesthatwillincreasetheoutput
value.Ifwedidnotsetthisto1,allgradientswouldbecomputedaszeroduetothe
multiplicationsinthechainrule.Finally,letsmaketheinputsrespondtothecomputedgradients
andcheckthatthefunctionincreased:

varstep_size=0.01;
a.value+=step_size*a.grad;//a.gradis0.105
b.value+=step_size*b.grad;//b.gradis0.315
c.value+=step_size*c.grad;//c.gradis0.105
x.value+=step_size*x.grad;//x.gradis0.105
y.value+=step_size*y.grad;//y.gradis0.210

forwardNeuron();
console.log('circuitoutputafteronebackprop:'+s.value);//prints0.8825

Success! 0.8825 ishigherthanthepreviousvalue, 0.8808 .Finally,letsverifythatwe


implementedthebackpropagationcorrectlybycheckingthenumericalgradient:

varforwardCircuitFast=function(a,b,c,x,y){
return1/(1+Math.exp((a*x+b*y+c)));
};
vara=1,b=2,c=3,x=1,y=3;
varh=0.0001;
vara_grad=(forwardCircuitFast(a+h,b,c,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varb_grad=(forwardCircuitFast(a,b+h,c,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varc_grad=(forwardCircuitFast(a,b,c+h,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varx_grad=(forwardCircuitFast(a,b,c,x+h,y)forwardCircuitFast(a,b,c,x,y))/h;
vary_grad=(forwardCircuitFast(a,b,c,x,y+h)forwardCircuitFast(a,b,c,x,y))/h;

https://karpathy.github.io/neuralnets/ 18/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Indeed,theseallgivethesamevaluesasthebackpropagatedgradients [0.105,0.315,
0.105,0.105,0.210] .Nice!

Ihopeitisclearthateventhoughweonlylookedatanexampleofasingleneuron,thecodeI
gaveabovegeneralizesinaverystraightforwardwaytocomputegradientsofarbitrary
expressions(includingverydeepexpressions#foreshadowing).Allyouhavetodoiswritesmall
gatesthatcomputelocal,simplederivativesw.r.ttheirinputs,wireitupinagraph,doaforward
passtocomputetheoutputvalueandthenabackwardpassthatchainsthegradientsallthe
waytotheinput.

BecomingaBackpropNinja
Overtimeyouwillbecomemuchmoreefficientinwritingthebackwardpass,evenfor
complicatedcircuitsandallatonce.Letspracticebackpropabitwithafewexamples.Inwhat
follows,letsnotworryaboutUnit,Circuitclassesbecausetheyobfuscatethingsabit,andlets
justusevariablessuchas a,b,c,x ,andrefertotheirgradientsas da,db,dc,dx respectively.
Again,wethinkofthevariablesastheforwardflowandtheirgradientsasbackwardflow
alongeverywire.Ourfirstexamplewasthe * gate:

varx=a*b;
//andgivengradientonx(dx),wesawthatinbackpropwewouldcompute:
varda=b*dx;
vardb=a*dx;

Inthecodeabove,Imassumingthatthevariable dx isgiven,comingfromsomewhereabove
usinthecircuitwhileweredoingbackprop(oritis+1bydefaultotherwise).Imwritingitout
becauseIwanttoexplicitlyshowhowthegradientsgetchainedtogether.Notefromthe
equationsthatthe * gateactsasaswitcherduringbackwardpass,forlackofbetterword.It
rememberswhatitsinputswere,andthegradientsoneachonewillbethevalueoftheother
duringtheforwardpass.Andthenofcoursewehavetomultiplywiththegradientfromabove,
whichisthechainrule.Heresthe + gateinthiscondensedform:

varx=a+b;
//>
varda=1.0*dx;
vardb=1.0*dx;

Where 1.0 isthelocalgradient,andthemultiplicationisourchainrule.Whataboutadding


threenumbers?:
https://karpathy.github.io/neuralnets/ 19/41
3/17/2017 Hacker'sguidetoNeuralNetworks

//letscomputex=a+b+cintwosteps:
varq=a+b;//gate1
varx=q+c;//gate2

//backwardpass:
dc=1.0*dx;//backpropgate2
dq=1.0*dx;
da=1.0*dq;//backpropgate1
db=1.0*dq;

Youcanseewhatshappening,right?Ifyourememberthebackwardflowdiagram,the + gate
simplytakesthegradientontopandroutesitequallytoallofitsinputs(becauseitslocal
gradientisalwayssimply 1.0 forallitsinputs,regardlessoftheiractualvalues).Sowecando
itmuchfaster:

varx=a+b+c;
varda=1.0*dx;vardb=1.0*dx;vardc=1.0*dx;

Okay,howaboutcombininggates?:

varx=a*b+c;
//givendx,backpropinonesweepwouldbe=>
da=b*dx;
db=a*dx;
dc=1.0*dx;

Ifyoudontseehowtheabovehappened,introduceatemporaryvariable q=a*b andthen


compute x=q+c toconvinceyourself.Andhereisourneuron,letsdoitintwosteps:

//letsdoourneuronintwosteps:
varq=a*x+b*y+c;
varf=sig(q);//sigisthesigmoidfunction
//andnowbackwardpass,wearegivendf,and:
vardf=1;
vardq=(f*(1f))*df;
//andnowwechainittotheinputs
varda=x*dq;
vardx=a*dq;
vardy=b*dq;
vardb=y*dq;
vardc=1.0*dq;

https://karpathy.github.io/neuralnets/ 20/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Ihopethisisstartingtomakealittlemoresense.Nowhowaboutthis:

varx=a*a;
varda=//???

Youcanthinkofthisasvalue a flowingtothe * gate,butthewiregetssplitandbecomes


bothinputs.Thisisactuallysimplebecausethebackwardflowofgradientsalwaysaddsup.In
otherwordsnothingchanges:

varda=a*dx;//gradientintoafromfirstbranch
da+=a*dx;//andaddonthegradientfromthesecondbranch

//shortforminsteadis:
varda=2*a*dx;

Infact,ifyouknowyourpowerrulefromcalculusyouwouldalsoknowthatifyouhave
f (a)
f (a) = a
2
then = 2a ,whichisexactlywhatwegetifwethinkofitaswiresplittingup
a

andbeingtwoinputstoagate.

Letsdoanotherone:

varx=a*a+b*b+c*c;
//weget:
varda=2*a*dx;
vardb=2*b*dx;
vardc=2*c*dx;

Okaynowletsstarttogetmorecomplex:

varx=Math.pow(((a*b+c)*d),2);//pow(x,2)squarestheinputJS

Whenmorecomplexcaseslikethiscomeupinpractice,Iliketosplittheexpressioninto
manageablechunkswhicharealmostalwayscomposedofsimplerexpressionsandthenI
chainthemtogetherwithchainrule:

varx1=a*b+c;
varx2=x1*d;
varx=x2*x2;//thisisidenticaltotheaboveexpressionforx
//andnowinbackpropwegobackwards:
vardx2=2*x2*dx;//backpropintox2
https://karpathy.github.io/neuralnets/ 21/41
3/17/2017 Hacker'sguidetoNeuralNetworks

vardd=x1*dx2;//backpropintod
vardx1=d*dx2;//backpropintox1
varda=b*dx1;
vardb=a*dx1;
vardc=1.0*dx1;//done!

Thatwasnttoodifficult!Thosearethebackpropequationsfortheentireexpression,andweve
donethempiecebypieceandbackproppedtoallthevariables.Noticeagainhowforevery
variableduringforwardpasswehaveanequivalentvariableduringbackwardpassthat
containsitsgradientwithrespecttothecircuitsfinaloutput.Hereareafewmoreuseful
functionsandtheirlocalgradientsthatareusefulinpractice:

varx=1.0/a;//division
varda=1.0/(a*a);

Hereswhatdivisionmightlooklikeinpracticethen:

varx=(a+b)/(c+d);
//letsdecomposeitinsteps:
varx1=a+b;
varx2=c+d;
varx3=1.0/x2;
varx=x1*x3;//equivalenttoabove
//andnowbackprop,againinreverseorder:
vardx1=x3*dx;
vardx3=x1*dx;
vardx2=(1.0/(x2*x2))*dx3;//localgradientasshownabove,andchainrule
varda=1.0*dx1;//andfinallyintotheoriginalvariables
vardb=1.0*dx1;
vardc=1.0*dx2;
vardd=1.0*dx2;

Hopefullyyouseethatwearebreakingdownexpressions,doingtheforwardpass,andthenfor
everyvariable(suchas a )wederiveitsgradient da aswegobackwards,onebyone,
applyingthesimplelocalgradientsandchainingthemwithgradientsfromabove.Heres
anotherone:

varx=Math.max(a,b);
varda=a===x?1.0*dx:0.0;
vardb=b===x?1.0*dx:0.0;

https://karpathy.github.io/neuralnets/ 22/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Okaythisismakingaverysimplethinghardtoread.The max functionpassesonthevalueof


theinputthatwaslargestandignorestheotherones.Inthebackwardpassthen,themaxgate
willsimplytakethegradientontopandrouteittotheinputthatactuallyflowedthroughitduring
theforwardpass.Thegateactsasasimpleswitchbasedonwhichinputhadthehighestvalue
duringforwardpass.Theotherinputswillhavezerogradient.Thatswhatthe === isabout,
sincewearetestingforwhichinputwastheactualmaxandonlyroutingthegradienttoit.

Finally,letslookattheRectifiedLinearUnitnonlinearity(orReLU),whichyoumayhaveheard
of.ItisusedinNeuralNetworksinplaceofthesigmoidfunction.Itissimplythresholdingat
zero:

varx=Math.max(a,0)
//backpropthroughthisgatewillthenbe:
varda=a>0?1.0*dx:0.0;

Inotherwordsthisgatesimplypassesthevaluethroughifitslargerthan0,oritstopstheflow
andsetsittozero.Inthebackwardpass,thegatewillpassonthegradientfromthetopifitwas
activatedduringtheforawrdpass,oriftheoriginalinputwasbelowzero,itwillstopthegradient
flow.

Iwillstopatthispoint.Ihopeyougotsomeintuitionabouthowyoucancomputeentire
expressions(whicharemadeupofmanygatesalongtheway)andhowyoucancompute
backpropforeveryoneofthem.

Everythingwevedoneinthischaptercomesdowntothis:Wesawthatwecanfeedsomeinput
througharbitrarilycomplexrealvaluedcircuit,tugattheendofthecircuitwithsomeforce,and
backpropagationdistributesthattugthroughtheentirecircuitallthewaybacktotheinputs.If
theinputsrespondslightlyalongthefinaldirectionoftheirtug,thecircuitwillgiveabitalong
theoriginalpulldirection.Maybethisisnotimmediatelyobvious,butthismachineryisa
powerfulhammerforMachineLearning.

Maybethisisnotimmediatelyobvious,butthismachineryisapowerfulhammerforMachine
Learning.

Letsnowputthismachinerytogooduse.

Chapter2:MachineLearning

https://karpathy.github.io/neuralnets/ 23/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Inthelastchapterwewereconcernedwithrealvaluedcircuitsthatcomputedpossiblycomplex
expressionsoftheirinputs(theforwardpass),andalsowecouldcomputethegradientsof
theseexpressionsontheoriginalinputs(backwardpass).Inthischapterwewillseehowuseful
thisextremelysimplemechanismisinMachineLearning.

BinaryClassification
Aswedidbefore,letsstartoutsimple.Thesimplest,commonandyetverypracticalproblemin
MachineLearningisbinaryclassification.Alotofveryinterestingandimportantproblemscan
bereducedtoit.Thesetupisasfollows:Wearegivenadatasetof N vectorsandeveryoneof
themislabeledwitha +1 ora 1 .Forexample,intwodimensionsourdatasetcouldlookas
simpleas:

vector>label

[1.2,0.7]>+1
[0.3,0.5]>1
[3,1]>+1
[0.1,1.0]>1
[3.0,1.1]>1
[2.1,3]>+1

Here,wehave N=6 datapoints,whereeverydatapointhastwofeatures( D=2 ).Threeof


thedatapointshavelabel +1 andtheotherthreelabel 1 .Thisisasillytoyexample,butin
practicea+1/1datasetcouldbeveryusefulthingsindeed:Forexamplespam/nospamemails,
wherethevectorssomehowmeasurevariousfeaturesofthecontentoftheemail,suchasthe
numberoftimescertainenhancementdrugsarementioned.

Goal.Ourgoalinbinaryclassificationistolearnafunctionthattakesa2dimensionalvector
andpredictsthelabel.Thisfunctionisusuallyparameterizedbyacertainsetofparameters,
andwewillwanttotunetheparametersofthefunctionsothatitsoutputsareconsistentwith
thelabelingintheprovideddataset.Intheendwecandiscardthedatasetandusethelearned
parameterstopredictlabelsforpreviouslyunseenvectors.

Trainingprotocol
Wewilleventuallybuilduptoentireneuralnetworksandcomplexexpressions,butletsstartout
simpleandtrainalinearclassifierverysimilartothesingleneuronwesawattheendof
Chapter1.Theonlydifferenceisthatwellgetridofthesigmoidbecauseitmakesthings
unnecessarilycomplicated(IonlyuseditasanexampleinChapter1becausesigmoidneurons
https://karpathy.github.io/neuralnets/ 24/41
3/17/2017 Hacker'sguidetoNeuralNetworks

arehistoricallypopularbutmodernNeuralNetworksrarely,ifever,usesigmoidnonlinearities).
Anyway,letsuseasimplelinearfunction:

f (x, y) = ax + by + c

Inthisexpressionwethinkof x and y astheinputs(the2Dvectors)and a,b,c asthe


parametersofthefunctionthatwewillwanttolearn.Forexample,if a=1,b=2,c=1 ,
thenthefunctionwilltakethefirstdatapoint( [1.2,0.7] )andoutput 1*1.2+(2)*0.7
+(1)=1.2 .Hereishowthetrainingwillwork:

1.Weselectarandomdatapointandfeeditthroughthecircuit
2.Wewillinterprettheoutputofthecircuitasaconfidencethatthedatapointhasclass +1 .
(i.e.veryhighvalues=circuitisverycertaindatapointhasclass +1 andverylowvalues
=circuitiscertainthisdatapointhasclass 1 .)
3.Wewillmeasurehowwellthepredictionalignswiththeprovidedlabels.Intuitively,for
example,ifapositiveexamplescoresverylow,wewillwanttotuginthepositivedirection
onthecircuit,demandingthatitshouldoutputhighervalueforthisdatapoint.Notethat
thisisthecaseforthethefirstdatapoint:itislabeledas +1 butourpredictorunctiononly
assignsitvalue 1.2 .WewillthereforetugonthecircuitinpositivedirectionWewant
thevaluetobehigher.
4.Thecircuitwilltakethetugandbackpropagateittocomputetugsontheinputs
a,b,c,x,y
5.Sincewethinkof x,y as(fixed)datapoints,wewillignorethepullon x,y .Ifyourea
fanofmyphysicalanalogies,thinkoftheseinputsaspegs,fixedintheground.
6.Ontheotherhand,wewilltaketheparameters a,b,c andmakethemrespondtotheir
tug(i.e.wellperformwhatwecallaparameterupdate).This,ofcourse,willmakeitso
thatthecircuitwilloutputaslightlyhigherscoreonthisparticulardatapointinthefuture.
7.Iterate!Gobacktostep1.

ThetrainingschemeIdescribedabove,iscommonlyreferredasStochasticGradient
Descent.TheinterestingpartIdliketoreiterateisthat a,b,c,x,y areallmadeupofthe
samestuffasfarasthecircuitisconcerned:Theyareinputstothecircuitandthecircuitwilltug
onalloftheminsomedirection.Itdoesntknowthedifferencebetweenparametersand
datapoints.However,afterthebackwardpassiscompleteweignorealltugsonthedatapoints
( x,y )andkeepswappingtheminandoutasweiterateoverexamplesinthedataset.Onthe
otherhand,wekeeptheparameters( a,b,c )aroundandkeeptuggingonthemeverytimewe
sampleadatapoint.Overtime,thepullsontheseparameterswilltunethesevaluesinsucha
waythatthefunctionoutputshighscoresforpositiveexamplesandlowscoresfornegative
examples.

LearningaSupportVectorMachine
https://karpathy.github.io/neuralnets/ 25/41
3/17/2017 Hacker'sguidetoNeuralNetworks

LearningaSupportVectorMachine
Asaconcreteexample,letslearnaSupportVectorMachine.TheSVMisaverypopularlinear
classifierItsfunctionalformisexactlyasIvedescribedinprevioussection,
f (x, y) = ax + by + c .Atthispoint,ifyouveseenanexplanationofSVMsyoureprobably

expectingmetodefinetheSVMlossfunctionandplungeintoanexplanationofslackvariables,
geometricalintuitionsoflargemargins,kernels,duality,etc.Buthere,Idliketotakeadifferent
approach.Insteadofdefinininglossfunctions,Iwouldliketobasetheexplanationontheforce
specification(Ijustmadethistermupbytheway)ofaSupportVectorMachine,whichI
personallyfindmuchmoreintuitive.Aswewillsee,talkingabouttheforcespecificationandthe
lossfunctionareidenticalwaysofseeingthesameproblem.Anyway,hereitis:

SupportVectorMachineForceSpecification:

IfwefeedapositivedatapointthroughtheSVMcircuitandtheoutputvalueislessthan1,
pullonthecircuitwithforce +1 .Thisisapositiveexamplesowewantthescoretobe
higherforit.
Conversely,ifwefeedanegativedatapointthroughtheSVMandtheoutputisgreater
than1,thenthecircuitisgivingthisdatapointdangerouslyhighscore:Pullonthecircuit
downwardswithforce 1 .
Inadditiontothepullsabove,alwaysaddasmallamountofpullontheparameters a,b
(notice,noton c !)thatpullsthemtowardszero.Youcanthinkofboth a,b asbeing
attachedtoaphysicalspringthatisattachedatzero.Justaswithaphysicalspring,this
willmakethepullproprotionaltothevalueofeachof a,b (Hookeslawinphysics,
anyone?).Forexample,if a becomesveryhighitwillexperienceastrongpullof
magnitude |a| backtowardszero.Thispullissomethingwecallregularization,andit
ensuresthatneitherofourparameters a or b getsdisproportionallylarge.Thiswould
beundesirablebecauseboth a,b getmultipliedtotheinputfeatures x,y (remember
theequationis a*x+b*y+c ),soifeitherofthemistoohigh,ourclassifierwouldbe
overlysensitivetothesefeatures.Thisisntanicepropertybecausefeaturescanoftenbe
noisyinpractice,sowewantourclassifiertochangerelativelysmoothlyiftheywiggle
around.

Letsquicklygothroughasmallbutconcreteexample.Supposewestartoutwitharandom
parametersetting,say, a=1,b=2,c=1 .Then:

Ifwefeedthepoint [1.2,0.7] ,theSVMwillcomputescore 1*1.2+(2)*0.7


1=1.2 .Thispointislabeledas +1 inthetrainingdata,sowewantthescoretobe
higherthan1.Thegradientontopofthecircuitwillthusbepositive: +1 ,whichwill
backpropagateto a,b,c .Additionally,therewillalsobearegularizationpullon a of

https://karpathy.github.io/neuralnets/ 26/41
3/17/2017 Hacker'sguidetoNeuralNetworks

1 (tomakeitsmaller)andregularizationpullon b of +2 tomakeitlarger,toward
zero.
Supposeinsteadthatwefedthedatapoint [0.3,0.5] totheSVM.Itcomputes 1*
(0.3)+(2)*0.51=2.3 .Thelabelforthispointis 1 ,andsince 2.3 is
smallerthan 1 ,weseethataccordingtoourforcespecificationtheSVMshouldbe
happy:Thecomputedscoreisverynegative,consistentwiththenegativelabelofthis
example.Therewillbenopullattheendofthecircuit(i.eitszero),sincethereno
changesarenecessary.However,therewillstillbetheregularizationpullon a of 1
andon b of +2 .

Okaytheresbeentoomuchtext.LetswritetheSVMcodeandtakeadvantageofthecircuit
machinerywehavefromChapter1:

//Acircuit:ittakes5Units(x,y,a,b,c)andoutputsasingleUnit
//Itcanalsocomputethegradientw.r.t.itsinputs
varCircuit=function(){
//createsomegates
this.mulg0=newmultiplyGate();
this.mulg1=newmultiplyGate();
this.addg0=newaddGate();
this.addg1=newaddGate();
};
Circuit.prototype={
forward:function(x,y,a,b,c){
this.ax=this.mulg0.forward(a,x);//a*x
this.by=this.mulg1.forward(b,y);//b*y
this.axpby=this.addg0.forward(this.ax,this.by);//a*x+b*y
this.axpbypc=this.addg1.forward(this.axpby,c);//a*x+b*y+c
returnthis.axpbypc;
},
backward:function(gradient_top){//takespullfromabove
this.axpbypc.grad=gradient_top;
this.addg1.backward();//setsgradientinaxpbyandc
this.addg0.backward();//setsgradientinaxandby
this.mulg1.backward();//setsgradientinbandy
this.mulg0.backward();//setsgradientinaandx
}
}

Thatsacircuitthatsimplycomputes a*x+b*y+c andcanalsocomputethegradient.It


usesthegatescodewedevelopedinChapter1.NowletswritetheSVM,whichdoesntcare

https://karpathy.github.io/neuralnets/ 27/41
3/17/2017 Hacker'sguidetoNeuralNetworks

abouttheactualcircuit.Itisonlyconcernedwiththevaluesthatcomeoutofit,anditpullson
thecircuit.

//SVMclass
varSVM=function(){

//randominitialparametervalues
this.a=newUnit(1.0,0.0);
this.b=newUnit(2.0,0.0);
this.c=newUnit(1.0,0.0);

this.circuit=newCircuit();
};
SVM.prototype={
forward:function(x,y){//assumexandyareUnits
this.unit_out=this.circuit.forward(x,y,this.a,this.b,this.c);
returnthis.unit_out;
},
backward:function(label){//labelis+1or1

//resetpullsona,b,c
this.a.grad=0.0;
this.b.grad=0.0;
this.c.grad=0.0;

//computethepullbasedonwhatthecircuitoutputwas
varpull=0.0;
if(label===1&&this.unit_out.value<1){
pull=1;//thescorewastoolow:pullup
}
if(label===1&&this.unit_out.value>1){
pull=1;//thescorewastoohighforapositiveexample,pulldown
}
this.circuit.backward(pull);//writesgradientintox,y,a,b,c

//addregularizationpullforparameters:towardszeroandproportionaltovalue
this.a.grad+=this.a.value;
this.b.grad+=this.b.value;
},
learnFrom:function(x,y,label){
this.forward(x,y);//forwardpass(set.valueinallUnits)
this.backward(label);//backwardpass(set.gradinallUnits)
this.parameterUpdate();//parametersrespondtotug

https://karpathy.github.io/neuralnets/ 28/41
3/17/2017 Hacker'sguidetoNeuralNetworks

},
parameterUpdate:function(){
varstep_size=0.01;
this.a.value+=step_size*this.a.grad;
this.b.value+=step_size*this.b.grad;
this.c.value+=step_size*this.c.grad;
}
};

NowletstraintheSVMwithStochasticGradientDescent:

vardata=[];varlabels=[];
data.push([1.2,0.7]);labels.push(1);
data.push([0.3,0.5]);labels.push(1);
data.push([3.0,0.1]);labels.push(1);
data.push([0.1,1.0]);labels.push(1);
data.push([1.0,1.1]);labels.push(1);
data.push([2.1,3]);labels.push(1);
varsvm=newSVM();

//afunctionthatcomputestheclassificationaccuracy
varevalTrainingAccuracy=function(){
varnum_correct=0;
for(vari=0;i<data.length;i++){
varx=newUnit(data[i][0],0.0);
vary=newUnit(data[i][1],0.0);
vartrue_label=labels[i];

//seeifthepredictionmatchestheprovidedlabel
varpredicted_label=svm.forward(x,y).value>0?1:1;
if(predicted_label===true_label){
num_correct++;
}
}
returnnum_correct/data.length;
};

//thelearningloop
for(variter=0;iter<400;iter++){
//pickarandomdatapoint
vari=Math.floor(Math.random()*data.length);
varx=newUnit(data[i][0],0.0);

https://karpathy.github.io/neuralnets/ 29/41
3/17/2017 Hacker'sguidetoNeuralNetworks

vary=newUnit(data[i][1],0.0);
varlabel=labels[i];
svm.learnFrom(x,y,label);

if(iter%25==0){//every10iterations...
console.log('trainingaccuracyatiter'+iter+':'+evalTrainingAccuracy());
}
}

Thiscodeprintsthefollowingoutput:

trainingaccuracyatiteration0:0.3333333333333333
trainingaccuracyatiteration25:0.3333333333333333
trainingaccuracyatiteration50:0.5
trainingaccuracyatiteration75:0.5
trainingaccuracyatiteration100:0.3333333333333333
trainingaccuracyatiteration125:0.5
trainingaccuracyatiteration150:0.5
trainingaccuracyatiteration175:0.5
trainingaccuracyatiteration200:0.5
trainingaccuracyatiteration225:0.6666666666666666
trainingaccuracyatiteration250:0.6666666666666666
trainingaccuracyatiteration275:0.8333333333333334
trainingaccuracyatiteration300:1
trainingaccuracyatiteration325:1
trainingaccuracyatiteration350:1
trainingaccuracyatiteration375:1

Weseethatinitiallyourclassifieronlyhad33%trainingaccuracy,butbytheendalltraining
examplesarecorrectlyclassifierastheparameters a,b,c adjustedtheirvaluesaccordingto
thepullsweexerted.WejusttrainedanSVM!Butpleasedontusethiscodeanywherein
production:)Wewillseehowwecanmakethingsmuchmoreefficientonceweunderstand
whatisgoingonatthecore.

Numberofiterationsneeded.Withthisexampledata,withthisexampleinitialization,andwith
thesettingofstepsizeweused,ittookabout300iterationstotraintheSVM.Inpractice,this
couldbemanymoreormanylessdependingonhowhardorlargetheproblemis,howyoure
initializating,normalizingyourdata,whatstepsizeyoureusing,andsoon.Thisisjustatoy
demonstration,butlaterwewillgooverallthebestpracticesforactuallytrainingthese
classifiersinpractice.Forexample,itwillturnoutthatthesettingofthestepsizeisvery
imporantandtricky.Smallstepsizewillmakeyourmodelslowtotrain.Largestepsizewilltrain

https://karpathy.github.io/neuralnets/ 30/41
3/17/2017 Hacker'sguidetoNeuralNetworks

faster,butifitistoolarge,itwillmakeyourclassifierchaoticallyjumparoundandnotconverge
toagoodfinalresult.Wewilleventuallyusewitheldvalidationdatatoproperlytuneittobejust
inthesweetspotforyourparticulardata.

OnethingIdlikeyoutoappreciateisthatthecircuitcanbearbitraryexpression,notjustthe
linearpredictionfunctionweusedinthisexample.Forexample,itcanbeanentireneural
network.

Bytheway,Iintentionallystructuredthecodeinamodularway,butwecouldhavetrainedan
SVMwithamuchsimplercode.Hereisreallywhatalloftheseclassesandcomputationsboil
downto:

vara=1,b=2,c=1;//initialparameters
for(variter=0;iter<400;iter++){
//pickarandomdatapoint
vari=Math.floor(Math.random()*data.length);
varx=data[i][0];
vary=data[i][1];
varlabel=labels[i];

//computepull
varscore=a*x+b*y+c;
varpull=0.0;
if(label===1&&score<1)pull=1;
if(label===1&&score>1)pull=1;

//computegradientandupdateparameters
varstep_size=0.01;
a+=step_size*(x*pulla);//aisfromtheregularization
b+=step_size*(y*pullb);//bisfromtheregularization
c+=step_size*(1*pull);
}

thiscodegivesanidenticalresult.Perhapsbynowyoucanglanceatthecodeandseehow
theseequationscameabout.

Variablepull?Aquicknotetomakeatthispoint:Youmayhavenoticedthatthepullisalways
1,0,or1.Youcouldimaginedoingotherthings,forexamplemakingthispullproportionalto
howbadthemistakewas.ThisleadstoavariationontheSVMthatsomepeoplerefertoas
squaredhingelossSVM,forreasonsthatwilllaterbecomeclear.Dependingonvarious
featuresofyourdataset,thatmayworkbetterorworse.Forexample,ifyouhaveverybad
outliersinyourdata,e.g.anegativedatapointthatgetsascore +100 ,itsinfluencewillbe

https://karpathy.github.io/neuralnets/ 31/41
3/17/2017 Hacker'sguidetoNeuralNetworks

relativelyminoronourclassifierbecausewewillonlypullwithforceof 1 regardlessofhow
badthemistakewas.Inpracticewerefertothispropertyofaclassifierasrobustnessto
outliers.

Letsrecap.Weintroducedthebinaryclassificationproblem,wherewearegivenND
dimensionalvectorsandalabel+1/1foreach.Wesawthatwecancombinethesefeatures
withasetofparametersinsidearealvaluedcircuit(suchasaSupportVectorMachinecircuit
inourexample).Then,wecanrepeatedlypassourdatathroughthecircuitandeachtimetweak
theparameterssothatthecircuitsoutputvalueisconsistentwiththeprovidedlabels.The
tweakingrelied,crucially,onourabilitytobackpropagategradientsthroughthecircuit.Inthe
end,thefinalcircuitcanbeusedtopredictvaluesforunseeninstances!

GeneralizingtheSVMintoaNeuralNetwork
OfinterestisthefactthatanSVMisjustaparticulartypeofaverysimplecircuit(circuitthat
computes score=a*x+b*y+c where a,b,c areweightsand x,y aredatapoints).This
canbeeasilyextendedtomorecomplicatedfunctions.Forexample,letswritea2layerNeural
Networkthatdoesthebinaryclassification.Theforwardpasswilllooklikethis:

//assumeinputsx,y
varn1=Math.max(0,a1*x+b1*y+c1);//activationof1sthiddenneuron
varn2=Math.max(0,a2*x+b2*y+c2);//2ndneuron
varn3=Math.max(0,a3*x+b3*y+c3);//3rdneuron
varscore=a4*n1+b4*n2+c4*n3+d4;//thescore

Thespecificationaboveisa2layerNeuralNetworkwith3hiddenneurons(n1,n2,n3)that
usesRectifiedLinearUnit(ReLU)nonlinearityoneachhiddenneuron.Asyoucansee,there
arenowseveralparametersinvolved,whichmeansthatourclassifierismorecomplexandcan
representmoreintricatedecisionboundariesthanjustasimplelineardecisionrulesuchasan
SVM.Anotherwaytothinkaboutitisthateveryoneofthethreehiddenneuronsisalinear
classifierandnowwereputtinganextralinearclassifierontopofthat.Nowwerestartingtogo
deeper:).Okay,letstrainthis2layerNeuralNetwork.ThecodelooksverysimilartotheSVM
examplecodeabove,wejusthavetochangetheforwardpassandthebackwardpass:

//randominitialparameters
vara1=Math.random()0.5;//arandomnumberbetween0.5and0.5
//...similarlyinitializeallotherparameterstorandoms
for(variter=0;iter<400;iter++){
//pickarandomdatapoint
vari=Math.floor(Math.random()*data.length);

https://karpathy.github.io/neuralnets/ 32/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varx=data[i][0];
vary=data[i][1];
varlabel=labels[i];

//computeforwardpass
varn1=Math.max(0,a1*x+b1*y+c1);//activationof1sthiddenneuron
varn2=Math.max(0,a2*x+b2*y+c2);//2ndneuron
varn3=Math.max(0,a3*x+b3*y+c3);//3rdneuron
varscore=a4*n1+b4*n2+c4*n3+d4;//thescore

//computethepullontop
varpull=0.0;
if(label===1&&score<1)pull=1;//wewanthigheroutput!Pullup.
if(label===1&&score>1)pull=1;//wewantloweroutput!Pulldown.

//nowcomputebackwardpasstoallparametersofthemodel

//backpropthroughthelast"score"neuron
vardscore=pull;
varda4=n1*dscore;
vardn1=a4*dscore;
vardb4=n2*dscore;
vardn2=b4*dscore;
vardc4=n3*dscore;
vardn3=c4*dscore;
vardd4=1.0*dscore;//phew

//backproptheReLUnonlinearities,inplace
//i.e.justsetgradientstozeroiftheneuronsdidnot"fire"
vardn3=n3===0?0:dn3;
vardn2=n2===0?0:dn2;
vardn1=n1===0?0:dn1;

//backproptoparametersofneuron1
varda1=x*dn1;
vardb1=y*dn1;
vardc1=1.0*dn1;

//backproptoparametersofneuron2
varda2=x*dn2;
vardb2=y*dn2;
vardc2=1.0*dn2;

//backproptoparametersofneuron3

https://karpathy.github.io/neuralnets/ 33/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varda3=x*dn3;
vardb3=y*dn3;
vardc3=1.0*dn3;

//phew!Endofbackprop!
//notewecouldhavealsobackproppedintox,y
//butwedonotneedthesegradients.Weonlyusethegradients
//onourparametersintheparameterupdate,andwediscardx,y

//addthepullsfromtheregularization,tuggingallmultiplicative
//parameters(i.e.notthebiases)downward,proportionaltotheirvalue
da1+=a1;da2+=a2;da3+=a3;
db1+=b1;db2+=b2;db3+=b3;
da4+=a4;db4+=b4;dc4+=c4;

//finally,dotheparameterupdate
varstep_size=0.01;
a1+=step_size*da1;
b1+=step_size*db1;
c1+=step_size*dc1;
a2+=step_size*da2;
b2+=step_size*db2;
c2+=step_size*dc2;
a3+=step_size*da3;
b3+=step_size*db3;
c3+=step_size*dc3;
a4+=step_size*da4;
b4+=step_size*db4;
c4+=step_size*dc4;
d4+=step_size*dd4;
//wowthisistedious,pleaseuseforloopsinprod.
//we'redone!
}

Andthatshowyoutrainaneuralnetwork.Obviously,youwanttomodularizeyourcodenicely
butIexpendedthisexampleforyouinthehopethatitmakesthingsmuchmoreconcreteand
simplertounderstand.Later,wewilllookatbestpracticeswhenimplementingthesenetworks
andwewillstructurethecodemuchmoreneatlyinamodularandmoresensibleway.

Butfornow,Ihopeyourtakeawayisthata2layerNeuralNetisreallynotsuchascarything:
wewriteaforwardpassexpression,interpretthevalueattheendasascore,andthenwepull
onthatvalueinapositiveornegativedirectiondependingonwhatwewantthatvaluetobefor
ourcurrentparticularexample.Theparameterupdateafterbackpropwillensurethatwhenwe

https://karpathy.github.io/neuralnets/ 34/41
3/17/2017 Hacker'sguidetoNeuralNetworks

seethisparticularexampleinthefuture,thenetworkwillbemorelikelytogiveusavaluewe
desire,nottheoneitgavejustbeforetheupdate.

AmoreConventionalApproach:LossFunctions
Nowthatweunderstandthebasicsofhowthesecircuitsfunctionwithdata,letsadoptamore
conventionalapproachthatyoumightseeelsewhereontheinternetandinothertutorialsand
books.Youwontseepeopletalkingtoomuchaboutforcespecifications.Instead,Machine
Learningalgorithmsarespecifiedintermsoflossfunctions(orcostfunctions,or
objectives).

AsIdevelopthisformalismIwouldalsoliketostarttobealittlemorecarefulwithhowwename
ourvariablesandparameters.Idliketheseequationstolooksimilartowhatyoumightseeina
bookorsomeothertutorial,soletmeusemorestandardnamingconventions.

Example:2DSupportVectorMachine
Letsstartwithanexampleofa2dimensionalSVM.WearegivenadatasetofN examples
(xi0 , xi1 ) andtheircorrespondinglabelsyi whichareallowedtobeeither+1/ 1 for

positiveornegativeexamplerespectively.Mostimportantly,asyourecallwehavethree
parameters(w0 , w1 , w2 ) .TheSVMlossfunctionisthendefinedasfollows:

N 2 2
L = [ _i = 1 max(0, y_i(w_0x_i0 + w_1x_i1 + w_2) + 1)] + [w_0 + w_1 ]

Noticethatthisexpressionisalwayspositive,duetothethresholdingatzerointhefirst
expressionandthesquaringintheregularization.Theideaisthatwewillwantthisexpression
tobeassmallaspossible.Beforewediveintosomeofitssubtletiesletmefirsttranslateitto
code:

varX=[[1.2,0.7],[0.3,0.5],[3,2.5]]//arrayof2dimensionaldata
vary=[1,1,1]//arrayoflabels
varw=[0.1,0.2,0.3]//example:randomnumbers
varalpha=0.1;//regularizationstrength

functioncost(X,y,w){

vartotal_cost=0.0;//L,inSVMlossfunctionabove
N=X.length;
for(vari=0;i<N;i++){
//loopoveralldatapointsandcomputetheirscore

https://karpathy.github.io/neuralnets/ 35/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varxi=X[i];
varscore=w[0]*xi[0]+w[1]*xi[1]+w[2];

//accumulatecostbasedonhowcompatiblethescoreiswiththelabel
varyi=y[i];//label
varcosti=Math.max(0,yi*score+1);
console.log('example'+i+':xi=('+xi+')andlabel='+yi);
console.log('scorecomputedtobe'+score.toFixed(3));
console.log('=>costcomputedtobe'+costi.toFixed(3));
total_cost+=costi;
}

//regularizationcost:wewantsmallweights
reg_cost=alpha*(w[0]*w[0]+w[1]*w[1])
console.log('regularizationcostforcurrentmodelis'+reg_cost.toFixed(3));
total_cost+=reg_cost;

console.log('totalcostis'+total_cost.toFixed(3));
returntotal_cost;
}

Andhereistheoutput:

costforexample0is0.440
costforexample1is1.370
costforexample2is0.000
regularizationcostforcurrentmodelis0.005
totalcostis1.815

Noticehowthisexpressionworks:ItmeasureshowbadourSVMclassifieris.Letsstepthrough
thisexplicitly:

Thefirstdatapoint xi=[1.2,0.7] withlabel yi=1 willgivescore 0.1*1.2+


0.2*0.7+0.3 ,whichis 0.56 .Notice,thisisapositiveexamplesowewanttothe
scoretobegreaterthan +1 . 0.56 isnotenough.Andindeed,theexpressionforcost
forthisdatapointwillcompute: costi=Math.max(0,1*0.56+1) ,whichis 0.44 .
YoucanthinkofthecostasquantifyingtheSVMsunhappiness.
Theseconddatapoint xi=[0.3,0.5] withlabel yi=1 willgivescore 0.1*
(0.3)+0.2*0.5+0.3 ,whichis 0.37 .Thisisntlookingverygood:Thisscoreis
veryhighforanegativeexample.Itshouldbelessthan1.Indeed,whenwecomputethe

https://karpathy.github.io/neuralnets/ 36/41
3/17/2017 Hacker'sguidetoNeuralNetworks

cost: costi=Math.max(0,1*0.37+1) ,weget 1.37 .Thatsaveryhighcostfrom


thisexample,asitisbeingmisclassified.
Thelastexample xi=[3,2.5] withlabel yi=1 givesscore 0.1*3+0.2*2.5+
0.3 ,andthatis 1.1 .Inthiscase,theSVMwillcompute costi=Math.max(0,1*1.1
+1) ,whichisinfactzero.Thisdatapointisbeingclassifiedcorrectlyandthereisnocost
associatedwithit.

Acostfunctionisanexpressionthatmeasuresshowbadyourclassifieris.Whenthetraining
setisperfectlyclassified,thecost(ignoringtheregularization)willbezero.

Noticethatthelastterminthelossistheregularizationcost,whichsaysthatourmodel
parametersshouldbesmallvalues.Duetothistermthecostwillneveractuallybecomezero
(becausethiswouldmeanallparametersofthemodelexceptthebiasareexactlyzero),butthe
closerweget,thebetterourclassifierwillbecome.

ThemajorityofcostfunctionsinMachineLearningconsistoftwoparts:1.Apartthat
measureshowwellamodelfitsthedata,and2:Regularization,whichmeasuressomenotion
ofhowcomplexorlikelyamodelis.

IhopeIconvincedyouthen,thattogetaverygoodSVMwereallywanttomakethecostas
smallaspossible.Soundsfamiliar?Weknowexactlywhattodo:Thecostfunctionwritten
aboveisourcircuit.Wewillforwardallexamplesthroughthecircuit,computethebackward
passandupdateallparameterssuchthatthecircuitwilloutputasmallercostinthefuture.
Specifically,wewillcomputethegradientandthenupdatetheparametersintheopposite
directionofthegradient(sincewewanttomakethecostsmall,notlarge).

Weknowexactlywhattodo:Thecostfunctionwrittenaboveisourcircuit.

todo:cleanupthissectionandfleshitoutabit

Chapter3:BackpropinPractice

Buildingupalibrary

Example:PracticalNeuralNetworkClassifier
https://karpathy.github.io/neuralnets/ 37/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Multiclass:StructuredSVM
Multiclass:LogisticRegression,Softmax

Example:Regression
Tinychangesneededtocostfunction.L2regularization.

Example:StructuredPrediction
Basicideaistotrainan(unnormalized)energymodel

VectorizedImplementations
WritingaNeuralNetclassfierinPythonwithnumpy.

Backpropinpractice:Tips/Tricks
MonitoringofCostfunction
Monitoringtraining/validationperformance
Tweakinginitiallearningrates,learningrateschedules
Optimization:UsingMomentum
Optimization:LBFGS,Nesterovacceleratedgradient
ImportanceofInitialization:weightsandbiases
Regularization:L2,L1,Groupsparsity,Dropout
Hyperparametersearch,crossvalidations
Commonpitfalls:(e.g.dyingReLUs)
Handlingunbalanceddatasets
Approachestodebuggingnetswhensomethingdoesntwork

Chapter4:NetworksintheWild
Casestudiesofmodelsthatworkwellinpracticeandhavebeendeployedinthewild.

CaseStudy:ConvolutionalNeuralNetworksforimages
Convolutionallayers,pooling,AlexNet,etc.

CaseStudy:RecurrentNeuralNetworksforSpeechandText
https://karpathy.github.io/neuralnets/ 38/41
3/17/2017 Hacker'sguidetoNeuralNetworks

CaseStudy:RecurrentNeuralNetworksforSpeechandText
VanillaRecurrentnets,bidirectionalrecurrentnets.MaybeoverviewofLSTM

CaseStudy:Word2Vec
TrainingwordvectorrepresentationsinNLP

CaseStudy:tSNE
Trainingembeddingsforvisualizingdata

Acknowledgements
Thanksalottothefollowingpeoplewhomadethisguidebetter:wodenokoto(HN),zackmorris
(HN).

Comments
ThisguideisaworkinprogressandIappreciatefeedback,especiallyregardingpartsthatwere
unclearoronlymadehalfsense.Thankyou!

SomeoftheJavascriptcodeinthistutorialhasbeentranslatedtoPythonbyAjit,finditoveron
Github.

8Comments Andrej'sBlog
1 Login

Recommend 5 Share SortbyBest

Jointhediscussion

maxkhesin2yearsago
Thisisawesome,canwehazmoar?
6 Reply Share

GauravKumar3monthsago
moar
Reply Share
https://karpathy.github.io/neuralnets/ 39/41

You might also like