Hacker's guide to Neural Networks

3/17/2017 Hacker'sguidetoNeuralNetworks
AndrejKarpathyblog About Hacker'sguidetoNeuralNetworks
Hacker'sguidetoNeuralNetworks
Hithere,ImaCSPhDstudentatStanford.IveworkedonDeepLearningforafewyearsas
partofmyresearchandamongseveralofmyrelatedpetprojectsisConvNetJSaJavascript
libraryfortrainingNeuralNetworks.Javascriptallowsonetonicelyvisualizewhatsgoingon
andtoplayaroundwiththevarioushyperparametersettings,butIstillregularlyhearfrom
peoplewhoaskforamorethoroughtreatmentofthetopic.Thisarticle(whichIplantoslowly
expandouttolengthsofafewbookchapters)ismyhumbleattempt.ItsonwebinsteadofPDF
becauseallbooksshouldbe,andeventuallyitwillhopefullyincludeanimations/demosetc.
MypersonalexperiencewithNeuralNetworksisthateverythingbecamemuchclearerwhenI
startedignoringfullpage,densederivationsofbackpropagationequationsandjuststarted
writingcode.Thus,thistutorialwillcontainverylittlemath(Idontbelieveitisnecessaryandit
cansometimesevenobfuscatesimpleconcepts).SincemybackgroundisinComputerScience
andPhysics,IwillinsteaddevelopthetopicfromwhatIrefertoashackerssperspective.My
expositionwillcenteraroundcodeandphysicalintuitionsinsteadofmathematicalderivations.
Basically,IwillstrivetopresentthealgorithmsinawaythatIwishIhadcomeacrosswhenI
wasstartingout.
everythingbecamemuchclearerwhenIstartedwritingcode.
YoumightbeeagertojumprightinandlearnaboutNeuralNetworks,backpropagation,how
theycanbeappliedtodatasetsinpractice,etc.Butbeforewegetthere,Idlikeustofirstforget
aboutallthat.Letstakeastepbackandunderstandwhatisreallygoingonatthecore.Lets
firsttalkaboutrealvaluedcircuits.
Updatenote:Isuspendedmyworkonthisguideawhileagoandredirectedalotofmyenergy
toteachingCS231n(ConvolutionalNeuralNetworks)classatStanford.Thenotesareon
cs231.github.ioandthecourseslidescanbefoundhere.Thesematerialsarehighlyrelatedto
materialhere,butmorecomprehensiveandsometimesmorepolished.
Chapter1:RealvaluedCircuits
https://karpathy.github.io/neuralnets/ 1/41
Inmyopinion,thebestwaytothinkofNeuralNetworksisasrealvaluedcircuits,wherereal
values(insteadofbooleanvalues {0,1} )flowalongedgesandinteractingates.However,
insteadofgatessuchas AND , OR , NOT ,etc,wehavebinarygatessuchas * (multiply), +
(add), max orunarygatessuchas exp ,etc.Unlikeordinarybooleancircuits,however,wewill
eventuallyalsohavegradientsflowingonthesameedgesofthecircuit,butintheopposite
direction.Butweregettingaheadofourselves.Letsfocusandstartoutsimple.
BaseCase:SingleGateintheCircuit
Letsfirstconsiderasingle,simplecircuitwithonegate.Heresanexample:
y *
Thecircuittakestworealvaluedinputs x and y andcomputes x*y withthe * gate.

Javascriptversionofthiswouldverysimplylooksomethinglikethis:
varforwardMultiplyGate=function(x,y){
returnx*y;
};
forwardMultiplyGate(2,3);//returns6.Exciting.
Andinmathformwecanthinkofthisgateasimplementingtherealvaluedfunction:
f (x, y) = xy
Aswiththisexample,allofourgateswilltakeoneortwoinputsandproduceasingleoutput
value.
TheGoal
Theproblemweareinterestedinstudyinglooksasfollows:
1.Weprovideagivencircuitsomespecificinputvalues(e.g. x=2 , y=3 )

2.Thecircuitcomputesanoutputvalue(e.g. 6 )
3.Thecorequestionthenbecomes:Howshouldonetweaktheinputslightlytoincreasethe
output?
Inthiscase,inwhatdirectionshouldwechange x,y togetanumberlargerthan 6 ?Note

that,forexample, x=1.99 and y=2.99 gives x*y=5.95 ,whichishigherthan
6.0 .Dontgetconfusedbythis: 5.95 isbetter(higher)than 6.0 .Itsanimprovementof
0.05 ,eventhoughthemagnitudeof 5.95 (thedistancefromzero)happenstobelower.
Strategy#1:RandomLocalSearch
Okay.Sowait,wehaveacircuit,wehavesomeinputsandwejustwanttotweakthemslightly
toincreasetheoutputvalue?Whyisthishard?Wecaneasilyforwardthecircuittocompute
theoutputforanygiven x and y .Soisntthistrivial?Whydontwetweak x and y
randomlyandkeeptrackofthetweakthatworksbest:
//circuitwithsinglegatefornow
varforwardMultiplyGate=function(x,y){returnx*y;};
varx=2,y=3;//someinputvalues
//trychangingx,yrandomlysmallamountsandkeeptrackofwhatworksbest
vartweak_amount=0.01;
varbest_out=Infinity;
varbest_x=x,best_y=y;
for(vark=0;k<100;k++){
varx_try=x+tweak_amount*(Math.random()*21);//tweakxabit
vary_try=y+tweak_amount*(Math.random()*21);//tweakyabit
varout=forwardMultiplyGate(x_try,y_try);
if(out>best_out){
//bestimprovementyet!Keeptrackofthexandy
best_out=out;
best_x=x_try,best_y=y_try;
}
}
WhenIrunthis,Iget best_x=1.9928 , best_y=2.9901 ,and best_out=5.9588 .

Again, 5.9588 ishigherthan 6.0 .So,weredone,right?Notquite:Thisisaperfectlyfine
strategyfortinyproblemswithafewgatesifyoucanaffordthecomputetime,butitwontdoif
wewanttoeventuallyconsiderhugecircuitswithmillionsofinputs.Itturnsoutthatwecando
muchbetter.
Stategy#2:NumericalGradient
Heresabetterway.Rememberagainthatinoursetupwearegivenacircuit(e.g.ourcircuit
withasingle * gate)andsomeparticularinput(e.g. x=2,y=3 ).Thegatecomputesthe
output( 6 )andnowwedliketotweak x and y tomaketheoutputhigher.
Aniceintuitionforwhatwereabouttodoisasfollows:Imaginetakingtheoutputvaluethat
comesoutfromthecircuitandtuggingonitinthepositivedirection.Thispositivetensionwillin
turntranslatethroughthegateandinduceforcesontheinputs x and y .Forcesthattellus
how x and y shouldchangetoincreasetheoutputvalue.
Whatmightthoseforceslooklikeinourspecificexample?Thinkingthroughit,wecanintuitthat
theforceon x shouldalsobepositive,becausemaking x slightlylargerimprovesthecircuits
output.Forexample,increasing x from x=2 to x=1 wouldgiveusoutput 3 much
largerthan 6 .Ontheotherhand,wedexpectanegativeforceinducedon y thatpushesit
tobecomelower(sincealower y ,suchas y=2 ,downfromtheoriginal y=3 would
makeoutputhigher: 2x2=4 ,again,largerthan 6 ).Thatstheintuitiontokeepinmind,
anyway.Aswegothroughthis,itwillturnoutthatforcesImdescribingwillinfactturnouttobe
thederivativeoftheoutputvaluewithrespecttoitsinputs( x and y ).Youmayhaveheard
thistermbefore.
Thederivativecanbethoughtofasaforceoneachinputaswepullontheoutputtobecome
higher.
Sohowdoweexactlyevaluatethisforce(derivative)?Itturnsoutthatthereisaverysimple
procedureforthis.Wewillworkbackwards:Insteadofpullingonthecircuitsoutput,welliterate
overeveryinputonebyone,increaseitveryslightlyandlookatwhathappenstotheoutput
value.Theamounttheoutputchangesinresponseisthederivative.Enoughintuitionsfornow.
Letslookatthemathematicaldefinition.Wecanwritedownthederivativeforourfunctionwith
respecttotheinputs.Forexample,thederivativewithrespectto x canbecomputedas:
f (x, y) f (x + h, y) f (x, y)
=
x h
Whereh issmallitsthetweakamount.Also,ifyourenotveryfamiliarwithcalculusitis
importanttonotethatinthelefthandsideoftheequationabove,thehorizontallinedoesnot
f (x,y)
indicatedivision.Theentiresymbol isasinglething:thederivativeofthefunction
x
withrespecttox .Thehorizontallineontherightisdivision.Iknowitsconfusingbutits
f (x, y)
standardnotation.Anyway,Ihopeitdoesntlooktooscarybecauseitisnt:Thecircuitwas
givingsomeinitialoutputf (x, y),andthenwechangedoneoftheinputsbyatinyamounth
andreadthenewoutputf (x + h, y).Subtractingthosetwoquantitiestellsusthechange,and
thedivisionbyh justnormalizesthischangebythe(arbitrary)tweakamountweused.Inother
wordsitsexpressingexactlywhatIdescribedaboveandtranslatesdirectlytothiscode:
varx=2,y=3;
varout=forwardMultiplyGate(x,y);//6
varh=0.0001;
//computederivativewithrespecttox
varxph=x+h;//1.9999
varout2=forwardMultiplyGate(xph,y);//5.9997
varx_derivative=(out2out)/h;//3.0
//computederivativewithrespecttoy
varyph=y+h;//3.0001
varout3=forwardMultiplyGate(x,yph);//6.0002
vary_derivative=(out3out)/h;//2.0
Letswalkthrough x forexample.Weturnedtheknobfrom x to x+h andthecircuit

respondedbygivingahighervalue(noteagainthatyes, 5.9997 ishigherthan 6 : 5.9997
>6 ).Thedivisionby h istheretonormalizethecircuitsresponsebythe(arbitrary)valueof
h wechosetousehere.Technically,youwantthevalueof h tobeinfinitesimal(theprecise
mathematicaldefinitionofthegradientisdefinedasthelimitoftheexpressionas h goesto
zero),butinpractice h=0.00001 orsoworksfineinmostcasestogetagoodapproximation.
Now,weseethatthederivativew.r.t. x is +3 .Immakingthepositivesignexplicit,becauseit
indicatesthatthecircuitistuggingonxtobecomehigher.Theactualvalue, 3 canbe
interpretedastheforceofthattug.
Thederivativewithrespecttosomeinputcanbecomputedbytweakingthatinputbyasmall
amountandobservingthechangeontheoutputvalue.
Bytheway,weusuallytalkaboutthederivativewithrespecttoasingleinput,orabouta
gradientwithrespecttoalltheinputs.Thegradientisjustmadeupofthederivativesofallthe
inputsconcatenatedinavector(i.e.alist).Crucially,noticethatifwelettheinputsrespondto
thetugbyfollowingthegradientatinyamount(i.e.wejustaddthederivativeontopofevery
input),wecanseethatthevalueincreases,asexpected:
varstep_size=0.01;
varout=forwardMultiplyGate(x,y);//before:6
x=x+step_size*x_derivative;//xbecomes1.97
y=y+step_size*y_derivative;//ybecomes2.98
varout_new=forwardMultiplyGate(x,y);//5.87!exciting.
Asexpected,wechangedtheinputsbythegradientandthecircuitnowgivesaslightlyhigher
value( 5.87>6.0 ).Thatwasmuchsimplerthantryingrandomchangesto x and y ,
right?Afacttoappreciatehereisthatifyoutakecalculusyoucanprovethatthegradientis,in
fact,thedirectionofthesteepestincreaseofthefunction.Thereisnoneedtomonkeyaround
tryingoutrandompertubationsasdoneinStrategy#1.Evaluatingthegradientrequiresjust
threeevaluationsoftheforwardpassofourcircuitinsteadofhundreds,andgivesthebesttug
youcanhopefor(locally)ifyouareinterestedinincreasingthevalueoftheoutput.
Biggerstepisnotalwaysbetter.Letmeclarifyonthispointabit.Itisimportanttonotethatin
thisverysimpleexample,usingabigger step_size than0.01willalwaysworkbetter.For
example, step_size=1.0 givesoutput 1 (higer,better!),andindeedinfinitestepsize
wouldgiveinfinitelygoodresults.Thecrucialthingtorealizeisthatonceourcircuitsgetmuch
morecomplex(e.g.entireneuralnetworks),thefunctionfrominputstotheoutputvaluewillbe
morechaoticandwiggly.Thegradientguaranteesthatifyouhaveaverysmall(indeed,
infinitesimallysmall)stepsize,thenyouwilldefinitelygetahighernumberwhenyoufollowits
direction,andforthatinfinitesimallysmallstepsizethereisnootherdirectionthatwouldhave
workedbetter.Butifyouuseabiggerstepsize(e.g. step_size=0.01 )allbetsareoff.The
reasonwecangetawaywithalargerstepsizethaninfinitesimallysmallisthatourfunctions
areusuallyrelativelysmooth.Butreally,werecrossingourfingersandhopingforthebest.
Hillclimbinganalogy.OneanalogyIveheardbeforeisthattheoutputvalueofourcircutis
liketheheightofahill,andweareblindfoldedandtryingtoclimbupwards.Wecansensethe
steepnessofthehillatourfeet(thegradient),sowhenweshuffleourfeetabitwewillgo
upwards.Butifwetookabig,overconfidentstep,wecouldhavesteppedrightintoahole.
Great,IhopeIveconvincedyouthatthenumericalgradientisindeedaveryusefulthingto
evaluate,andthatitischeap.But.Itturnsoutthatwecandoevenbetter.
Strategy#3:AnalyticGradient
Intheprevioussectionweevaluatedthegradientbyprobingthecircuitsoutputvalue,
independentlyforeveryinput.Thisproceduregivesyouwhatwecallanumericalgradient.
Thisapproach,however,isstillexpensivebecauseweneedtocomputethecircuitsoutputas
wetweakeveryinputvalueindependentlyasmallamount.Sothecomplexityofevaluatingthe
gradientislinearinnumberofinputs.Butinpracticewewillhavehundreds,thousandsor(for
neuralnetworks)eventenstohundredsofmillionsofinputs,andthecircuitsarentjustone
multiplygatebuthugeexpressionsthatcanbeexpensivetocompute.Weneedsomething
better.
Luckily,thereisaneasierandmuchfasterwaytocomputethegradient:wecanusecalculusto
deriveadirectexpressionforitthatwillbeassimpletoevaluateasthecircuitsoutputvalue.
Wecallthisananalyticgradientandtherewillbenoneedfortweakinganything.Youmay
haveseenotherpeoplewhoteachNeuralNetworksderivethegradientinhugeand,frankly,
scaryandconfusingmathematicalequations(ifyourenotwellversedinmaths).Butits
unnecessary.IvewrittenplentyofNeuralNetscodeandIrarelyhavetodomathematical
derivationlongerthantwolines,and95%ofthetimeitcanbedonewithoutwritinganythingat
all.Thatisbecausewewillonlyeverderivethegradientforverysmallandsimpleexpressions
(thinkofitasthebasecase)andthenIwillshowyouhowwecancomposetheseverysimply
withchainruletoevaluatethefullgradient(thinkinductive/recursivecase).
Theanalyticderivativerequiresnotweakingoftheinputs.Itcanbederivedusing
mathematics(calculus).
Ifyourememberyourproductrules,powerrules,quotientrules,etc.(seee.g.derivativerulesor
wikipage),itsveryeasytowritedownthederivitativewithrespecttoboth x and y fora
smallexpressionsuchas x*y .Butsupposeyoudontrememberyourcalculusrules.We
cangobacktothedefinition.Forexample,herestheexpressionforthederivativew.r.t x :
f (x, y) f (x + h, y) f (x, y)
=
x h
(TechnicallyImnotwritingthelimitas h goestozero,forgivememathpeople).Okayandlets
pluginourfunction(f (x, y) = xy )intotheexpression.Readyforthehardestpieceofmath
ofthisentirearticle?Herewego:
f (x, y) f (x + h, y) f (x, y) (x + h)y xy xy + hy xy hy

= = = = = y
x h h h h
Thatsinteresting.Thederivativewithrespectto x isjustequalto y .Didyounoticethe

coincidenceintheprevioussection?Wetweaked x to x+h andcalculated x_derivative=
3.0 ,whichexactlyhappenstobethevalueof y inthatexample.Itturnsoutthatwasnta
coincidenceatallbecausethatsjustwhattheanalyticgradienttellsusthe x derivativeshould
befor f(x,y)=x*y .Thederivativewithrespectto y ,bytheway,turnsouttobe x ,
unsurprisinglybysymmetry.Sothereisnoneedforanytweaking!Weinvokedpowerful
mathematicsandcannowtransformourderivativecalculationintothefollowingcode:
varx=2,y=3;
varout=forwardMultiplyGate(x,y);//before:6
varx_gradient=y;//byourcomplexmathematicalderivationabove
vary_gradient=x;
varstep_size=0.01;
x+=step_size*x_gradient;//2.03
y+=step_size*y_gradient;//2.98
varout_new=forwardMultiplyGate(x,y);//5.87.Higheroutput!Nice.
Tocomputethegradientwewentfromforwardingthecircuithundredsoftimes(Strategy#1)to
forwardingitonlyonorderofnumberoftimestwicethenumberofinputs(Strategy#2),to
forwardingitasingletime!AnditgetsEVENbetter,sincethemoreexpensivestrategies(#1
and#2)onlygiveanapproximationofthegradient,while#3(thefastestonebyfar)givesyou
theexactgradient.Noapproximations.Theonlydownsideisthatyoushouldbecomfortable
withsomecalculus101.
Letsrecapwhatwehavelearned:
INPUT:Wearegivenacircuit,someinputsandcomputeanoutputvalue.
OUTPUT:Wearetheninterestedfindingsmallchangestoeachinput(independently)that
wouldmaketheoutputhigher.
Strategy#1:Onesillywayistorandomlysearchforsmallpertubationsoftheinputsand
keeptrackofwhatgivesthehighestincreaseinoutput.
Strategy#2:Wesawwecandomuchbetterbycomputingthegradient.Regardlessof
howcomplicatedthecircuitis,thenumericalgradientisverysimple(butrelatively
expensive)tocompute.Wecomputeitbyprobingthecircuitsoutputvalueaswetweak
theinputsoneatatime.
Strategy#3:Intheend,wesawthatwecanbeevenmorecleverandanalyticallyderivea
directexpressiontogettheanalyticgradient.Itisidenticaltothenumericalgradient,itis
fastestbyfar,andthereisnoneedforanytweaking.
Inpracticebytheway(andwewillgettothisonceagainlater),allNeuralNetworklibraries
alwayscomputetheanalyticgradient,butthecorrectnessoftheimplementationisverifiedby
comparingittothenumericalgradient.Thatsbecausethenumericalgradientisveryeasyto
evaluate(butcanbeabitexpensivetocompute),whiletheanalyticgradientcancontainbugs
attimes,butisusuallyextremelyefficienttocompute.Aswewillsee,evaluatingthegradient
(i.e.whiledoingbackprop,orbackwardpass)willturnouttocostaboutasmuchasevaluating
theforwardpass.
RecursiveCase:CircuitswithMultipleGates
RecursiveCase:CircuitswithMultipleGates
Butholdon,yousay:Theanalyticgradientwastrivialtoderiveforyoursupersimple
expression.Thisisuseless.WhatdoIdowhentheexpressionsaremuchlarger?Dontthe
equationsgethugeandcomplexveryfast?.Goodquestion.Yestheexpressionsgetmuch
morecomplex.No,thisdoesntmakeitmuchharder.Aswewillsee,everygatewillbehanging
outbyitself,completelyunawareofanydetailsofthehugeandcomplexcircuitthatitcouldbe
partof.Itwillonlyworryaboutitsinputsanditwillcomputeitslocalderivativesasseeninthe
previoussection,exceptnowtherewillbeasingleextramultiplicationitwillhavetodo.
Asingleextramultiplicationwillturnasingle(uselessgate)intoacoginthecomplexmachine
thatisanentireneuralnetwork.
Ishouldstophypingitupnow.IhopeIvepiquedyourinterest!Letsdrilldownintodetailsand
gettwogatesinvolvedwiththisnextexample:
x q
y
+ * f
Theexpressionwearecomputingnowisf (x, y, z) = (x + y)z .Letsstructurethecodeas

followstomakethegatesexplicitasfunctions:
varforwardMultiplyGate=function(a,b){
returna*b;
};
varforwardAddGate=function(a,b){
returna+b;
};
varforwardCircuit=function(x,y,z){
varq=forwardAddGate(x,y);
varf=forwardMultiplyGate(q,z);
returnf;
};
varx=2,y=5,z=4;
varf=forwardCircuit(x,y,z);//outputis12
Intheabove,Iamusing a and b asthelocalvariablesinthegatefunctionssothatwedont

gettheseconfusedwithourcircuitinputs x,y,z .Asbefore,weareinterestedinfindingthe
derivativeswithrespecttothethreeinputs x,y,z .Buthowdowecomputeitnowthatthere
aremultiplegatesinvolved?First,letspretendthatthe + gateisnotthereandthatweonly
havetwovariablesinthecircuit: q,z andasingle * gate.Notethatthe q isisoutputofthe
+ gate.Ifwedontworryabout x and y butonlyabout q and z ,thenwearebackto
havingonlyasinglegate,andasfarasthatsingle * gateisconcerned,weknowwhatthe
(analytic)derivatesarefromprevioussection.Wecanwritethemdown(exceptherewere
replacing x,y with q,z ):
f (q, z) f (q, z)
f (q, z) = qz = z, = q
q z
Simpleenough:thesearetheexpressionsforthegradientwithrespectto q and z .Butwait,

wedontwantgradientwithrespectto q ,butwithrespecttotheinputs: x and y .Luckily, q
iscomputedasafunctionof x and y (byadditioninourexample).Wecanwritedownthe
gradientfortheadditiongateaswell,itsevensimpler:
q(x, y) q(x, y)
q(x, y) = x + y = 1, = 1
x y
Thatsright,thederivativesarejust1,regardlessoftheactualvaluesof x and y .Ifyouthink

aboutit,thismakessensebecausetomaketheoutputofasingleadditiongatehigher,we
expectapositivetugonboth x and y ,regardlessoftheirvalues.
Backpropagation
WearefinallyreadytoinvoketheChainRule:Weknowhowtocomputethegradientof q
withrespectto x and y (thatsasinglegatecasewith + asthegate).Andweknowhowto
computethegradientofourfinaloutputwithrespectto q .Thechainruletellsushowto
combinethesetogetthegradientofthefinaloutputwithrespectto x and y ,whichiswhat
wereultimatelyinterestedin.Bestofall,thechainruleverysimplystatesthattherightthingto
doistosimplymultiplythegradientstogethertochainthem.Forexample,thefinalderivative
for x willbe:
f (q, z) q(x, y) f (q, z)

=
x x q
Therearemanysymbolstheresomaybethisisconfusingagain,butitsreallyjusttwonumbers
beingmultipliedtogether.Hereisthecode:
//initialconditions
varx=2,y=5,z=4;
varq=forwardAddGate(x,y);//qis3
varf=forwardMultiplyGate(q,z);//outputis12
//gradientoftheMULTIPLYgatewithrespecttoitsinputs
//wrtisshortfor"withrespectto"
varderivative_f_wrt_z=q;//3
varderivative_f_wrt_q=z;//4
//derivativeoftheADDgatewithrespecttoitsinputs
varderivative_q_wrt_x=1.0;
varderivative_q_wrt_y=1.0;
//chainrule
varderivative_f_wrt_x=derivative_q_wrt_x*derivative_f_wrt_q;//4
varderivative_f_wrt_y=derivative_q_wrt_y*derivative_f_wrt_q;//4
Thatsit.Wecomputedthegradient(theforces)andnowwecanletourinputsrespondtoitby
abit.Letsaddthegradientsontopoftheinputs.Theoutputvalueofthecircuitbetterincrease,
upfrom12!
//finalgradient,fromabove:[4,4,3]
vargradient_f_wrt_xyz=[derivative_f_wrt_x,derivative_f_wrt_y,derivative_f_wrt_z
//lettheinputsrespondtotheforce/tug:
varstep_size=0.01;
x=x+step_size*derivative_f_wrt_x;//2.04
y=y+step_size*derivative_f_wrt_y;//4.96
z=z+step_size*derivative_f_wrt_z;//3.97
//Ourcircuitnowbettergivehigheroutput:
varq=forwardAddGate(x,y);//qbecomes2.92
varf=forwardMultiplyGate(q,z);//outputis11.59,upfrom12!Nice!
Lookslikethatworked!Letsnowtrytointerpretintuitivelywhatjusthappened.Thecircuitwants
tooutputhighervalues.Thelastgatesawinputs q=3,z=4 andcomputedoutput 12 .
Pullingupwardsonthisoutputvalueinducedaforceonboth q and z :Toincreasethe
outputvalue,thecircuitwants z toincrease,ascanbeseenbythepositivevalueofthe
derivative( derivative_f_wrt_z=+3 ).Again,thesizeofthisderivativecanbeinterpretedas
themagnitudeoftheforce.Ontheotherhand, q feltastrongeranddownwardforce,since
derivative_f_wrt_q=4 .Inotherwordsthecircuitwants q todecrease,withaforceof
4.
Nowwegettothesecond, + gatewhichoutputs q .Bydefault,the + gatecomputesits

derivativeswhichtellsushowtochange x and y tomake q higher.BUT!Hereisthe
crucialpoint:thegradienton q wascomputedasnegative( derivative_f_wrt_q=4 ),so
thecircuitwants q todecrease,andwithaforceof 4 !Soifthe + gatewantstocontributeto
makingthefinaloutputvaluelarger,itneedstolistentothegradientsignalcomingfromthetop.
Inthisparticularcase,itneedstoapplytugson x,y oppositeofwhatitwouldnormallyapply,
andwithaforceof 4 ,sotospeak.Themultiplicationby 4 seeninthechainruleachieves
exactlythis:insteadofapplyingapositiveforceof +1 onboth x and y (thelocalderivative),
thefullcircuitsgradientonboth x and y becomes 1x4=4 .Thismakessense:the
circuitwantsboth x and y togetsmallerbecausethiswillmake q smaller,whichinturnwill
make f larger.
Ifthismakessense,youunderstandbackpropagation.
Letsrecaponceagainwhatwelearned:
Inthepreviouschapterwesawthatinthecaseofasinglegate(orasingleexpression),
wecanderivetheanalyticgradientusingsimplecalculus.Weinterpretedthegradientas
aforce,oratugontheinputsthatpullstheminadirectionwhichwouldmakethisgates
outputhigher.
Incaseofmultiplegateseverythingstaysprettymuchthesameway:everygateis
hangingoutbyitselfcompletelyunawareofthecircuititisembeddedin.Someinputs
comeinandthegatecomputesitsoutputandthederivatewithrespecttotheinputs.The
onlydifferencenowisthatsuddenly,somethingcanpullonthisgatefromabove.Thats
thegradientofthefinalcircuitoutputvaluewithrespecttotheouputthisgatecomputed.It
isthecircuitaskingthegatetooutputhigherorlowernumbers,andwithsomeforce.The
gatesimplytakesthisforceandmultipliesittoalltheforcesitcomputedforitsinputs
before(chainrule).Thishasthedesiredeffect:
1.Ifagateexperiencesastrongpositivepullfromabove,itwillalsopullharderonitsown
inputs,scaledbytheforceitisexperiencingfromabove
2.Andifitexperiencesanegativetug,thismeansthatcircuitwantsitsvaluetodecreasenot
increase,soitwillfliptheforceofthepullonitsinputstomakeitsownoutputvalue
smaller.
Anicepicturetohaveinmindisthataswepullonthecircuitsoutputvalueattheend,this
inducespullsdownwardthroughtheentirecircuit,allthewaydowntotheinputs.
Isntitbeautiful?Theonlydifferencebetweenthecaseofasinglegateandmultipleinteracting
gatesthatcomputearbitrarilycomplexexpressionsisthisadditionalmultipyoperationthatnow
happensineachgate.
Patternsinthebackwardflow
Letslookagainatourexamplecircuitwiththenumbersfilledin.Thefirstcircuitshowstheraw
values,andthesecondcircuitshowsthegradientsthatflowbacktotheinputsasdiscussed.
Noticethatthegradientalwaysstartsoffwith +1 attheendtostartoffthechain.Thisisthe
(default)pullonthecircuittohaveitsvalueincreased.
2 3
5
+ * 12 (Values)
4 4
+
4 * 1 (Gradients)
Afterawhileyoustarttonoticepatternsinhowthegradientsflowbackwardinthecircuits.For
example,the + gatealwaystakesthegradientontopandsimplypassesitontoallofits
inputs(noticetheexamplewith4simplypassedontobothoftheinputsof + gate).Thisis
becauseitsownderivativefortheinputsisjust +1 ,regardlessofwhattheactualvaluesofthe
inputsare,sointhechainrule,thegradientfromaboveisjustmultipliedby1andstaysthe
same.Similarintuitionsapplyto,forexample,a max(x,y) gate.Sincethegradientof
max(x,y) withrespecttoitsinputis +1 forwhicheveroneof x , y islargerand 0 forthe
other,thisgateisduringbackpropeffectivelyjustagradientswitch:itwilltakethegradient
fromaboveandrouteittotheinputthathadahighervalueduringtheforwardpass.
NumericalGradientCheck.Beforewefinishwiththissection,letsjustmakesurethatthe
(analytic)gradientwecomputedbybackpropaboveiscorrectasasanitycheck.Remember
thatwecandothissimplybycomputingthenumericalgradientandmakingsurethatweget
[4,4,3] for x,y,z .Heresthecode:
//initialconditions
varx=2,y=5,z=4;
//numericalgradientcheck
varh=0.0001;
varx_derivative=(forwardCircuit(x+h,y,z)forwardCircuit(x,y,z))/h;//4
vary_derivative=(forwardCircuit(x,y+h,z)forwardCircuit(x,y,z))/h;//4
varz_derivative=(forwardCircuit(x,y,z+h)forwardCircuit(x,y,z))/h;//3
andweget [4,4,3] ,ascomputedwithbackprop.phew!:)
Example:SingleNeuron
Intheprevioussectionyouhopefullygotthebasicintuitionbehindbackpropagation.Letsnow
lookatanevenmorecomplicatedandborderlinepracticalexample.Wewillconsidera2
dimensionalneuronthatcomputesthefollowingfunction:
f (x, y, a, b, c) = (ax + by + c)
Inthisexpression, isthesigmoidfunction.Itsbestthoughtofasasquashingfunction,
becauseittakestheinputandsquashesittobebetweenzeroandone:Verynegativevalues
aresquashedtowardszeroandpositivevaluesgetsquashedtowardsone.Forexample,we
have sig(5)=0.006,sig(0)=0.5,sig(5)=0.993 .Sigmoidfunctionisdefinedas:
1
(x) =
x
1 + e
Thegradientwithrespecttoitssingleinput,asyoucancheckonWikipediaorderiveyourselfif
youknowsomecalculusisgivenbythisexpression:
(x)
= (x)(1 (x))
x
Forexample,iftheinputtothesigmoidgateis x=3 ,thegatewillcomputeoutput f=1.0/

(1.0+Math.exp(x))=0.95 ,andthenthe(local)gradientonitsinputwillsimplybe dx=
(0.95)*(10.95)=0.0475 .
Thatsallweneedtousethisgate:weknowhowtotakeaninputandforwarditthroughthe
sigmoidgate,andwealsohavetheexpressionforthegradientwithrespecttoitsinput,sowe
canalsobackpropthroughit.Anotherthingtonoteisthattechnically,thesigmoidfunctionis
madeupofanentireseriesofgatesinalinethatcomputemoreatomicfunctions:an
exponentiationgate,anadditiongateandadivisiongate.Treatingitsowouldworkperfectly
finebutforthisexampleIchosetocollapseallofthesegatesintoasinglegatethatjust
computessigmoidinoneshot,becausethegradientexpressionturnsouttobesimple.
Letstakethisopportunitytocarefullystructuretheassociatedcodeinaniceandmodularway.
First,Idlikeyoutonotethateverywireinourdiagramshastwonumbersassociatedwithit:
1.thevalueitcarriesduringtheforwardpass
2.thegradient(i.ethepull)thatflowsbackthroughitinthebackwardpass
Letscreateasimple Unit structurethatwillstorethesetwovaluesoneverywire.Ourgates

willnowoperateover Unit s:theywilltakethemasinputsandcreatethemasoutputs.
//everyUnitcorrespondstoawireinthediagrams
varUnit=function(value,grad){
//valuecomputedintheforwardpass
this.value=value;
//thederivativeofcircuitoutputw.r.tthisunit,computedinbackwardpass
this.grad=grad;
}
InadditiontoUnitswealsoneed3gates: + , * and sig (sigmoid).Letsstartoutby

implementingamultiplygate.ImusingJavascriptherewhichhasafunnywayofsimulating
classesusingfunctions.IfyourenotaJavascriptfamiliarperson,allthatsgoingonhereis
thatImdefiningaclassthathascertainproperties(accessedwithuseof this keyword),and
somemethods(whichinJavascriptareplacedintothefunctionsprototype).Justthinkabout
theseasclassmethods.Alsokeepinmindthatthewaywewillusetheseeventuallyisthatwe
willfirst forward allthegatesonebyone,andthen backward allthegatesinreverseorder.
Hereistheimplementation:
varmultiplyGate=function(){};
multiplyGate.prototype={
forward:function(u0,u1){
//storepointerstoinputUnitsu0andu1andoutputunitutop
this.u0=u0;
this.u1=u1;
this.utop=newUnit(u0.value*u1.value,0.0);
returnthis.utop;
},
backward:function(){
//takethegradientinoutputunitandchainitwiththe
//localgradients,whichwederivedformultiplygatebefore
//thenwritethosegradientstothoseUnits.
this.u0.grad+=this.u1.value*this.utop.grad;
this.u1.grad+=this.u0.value*this.utop.grad;
}
}
Themultiplygatetakestwounitsthateachholdavalueandcreatesaunitthatstoresitsoutput.
Thegradientisinitializedtozero.Thennoticethatinthe backward functioncallwegetthe
gradientfromtheoutputunitweproducedduringtheforwardpass(whichwillbynowhopefully
haveitsgradientfilledin)andmultiplyitwiththelocalgradientforthisgate(chainrule!).This
gatecomputesmultiplication( u0.value*u1.value )duringforwardpass,sorecallthatthe
gradientw.r.t u0 is u1.value andw.r.t u1 is u0.value .Alsonotethatweareusing += to
addontothegradientinthe backward function.Thiswillallowustopossiblyusetheoutputof
onegatemultipletimes(thinkofitasawirebranchingout),sinceitturnsoutthatthegradients
fromthesedifferentbranchesjustaddupwhencomputingthefinalgradientwithrespecttothe
circuitoutput.Theothertwogatesaredefinedanalogously:
varaddGate=function(){};
addGate.prototype={
forward:function(u0,u1){
this.u0=u0;
this.u1=u1;//storepointerstoinputunits
this.utop=newUnit(u0.value+u1.value,0.0);
returnthis.utop;
},
//addgate.derivativewrtbothinputsis1
this.u0.grad+=1*this.utop.grad;
this.u1.grad+=1*this.utop.grad;
}
}
varsigmoidGate=function(){
//helperfunction
this.sig=function(x){return1/(1+Math.exp(x));};
};
sigmoidGate.prototype={
forward:function(u0){
this.u0=u0;
this.utop=newUnit(this.sig(this.u0.value),0.0);
returnthis.utop;
},
vars=this.sig(this.u0.value);
this.u0.grad+=(s*(1s))*this.utop.grad;
}
}
Notethat,again,the backward functioninallcasesjustcomputesthelocalderivativewith

respecttoitsinputandthenmultipliesonthegradientfromtheunitabove(i.e.chainrule).To
fullyspecifyeverythingletsfinallywriteouttheforwardandbackwardflowforour2dimensional
neuronwithsomeexamplevalues:
//createinputunits
vara=newUnit(1.0,0.0);
varb=newUnit(2.0,0.0);
varc=newUnit(3.0,0.0);
varx=newUnit(1.0,0.0);
vary=newUnit(3.0,0.0);
//createthegates
varmulg0=newmultiplyGate();
varmulg1=newmultiplyGate();
varaddg0=newaddGate();
varaddg1=newaddGate();
varsg0=newsigmoidGate();
//dotheforwardpass
varforwardNeuron=function(){
ax=mulg0.forward(a,x);//a*x=1
by=mulg1.forward(b,y);//b*y=6
axpby=addg0.forward(ax,by);//a*x+b*y=5
axpbypc=addg1.forward(axpby,c);//a*x+b*y+c=2
s=sg0.forward(axpbypc);//sig(a*x+b*y+c)=0.8808
};
forwardNeuron();
console.log('circuitoutput:'+s.value);//prints0.8808
Andnowletscomputethegradient:Simplyiterateinreverseorderandcallthe backward
function!Rememberthatwestoredthepointerstotheunitswhenwedidtheforwardpass,so
everygatehasaccesstoitsinputsandalsotheoutputunititpreviouslyproduced.
s.grad=1.0;
sg0.backward();//writesgradientintoaxpbypc
addg1.backward();//writesgradientsintoaxpbyandc
addg0.backward();//writesgradientsintoaxandby
mulg1.backward();//writesgradientsintobandy
mulg0.backward();//writesgradientsintoaandx
Notethatthefirstlinesetsthegradientattheoutput(verylastunit)tobe 1.0 tostartoffthe

gradientchain.Thiscanbeinterpretedastuggingonthelastgatewithaforceof +1 .Inother
words,wearepullingontheentirecircuittoinducetheforcesthatwillincreasetheoutput
value.Ifwedidnotsetthisto1,allgradientswouldbecomputedaszeroduetothe
multiplicationsinthechainrule.Finally,letsmaketheinputsrespondtothecomputedgradients
andcheckthatthefunctionincreased:
varstep_size=0.01;
a.value+=step_size*a.grad;//a.gradis0.105
b.value+=step_size*b.grad;//b.gradis0.315
c.value+=step_size*c.grad;//c.gradis0.105
x.value+=step_size*x.grad;//x.gradis0.105
y.value+=step_size*y.grad;//y.gradis0.210
forwardNeuron();
console.log('circuitoutputafteronebackprop:'+s.value);//prints0.8825
Success! 0.8825 ishigherthanthepreviousvalue, 0.8808 .Finally,letsverifythatwe

implementedthebackpropagationcorrectlybycheckingthenumericalgradient:
varforwardCircuitFast=function(a,b,c,x,y){
return1/(1+Math.exp((a*x+b*y+c)));
};
vara=1,b=2,c=3,x=1,y=3;
varh=0.0001;
vara_grad=(forwardCircuitFast(a+h,b,c,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varb_grad=(forwardCircuitFast(a,b+h,c,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varc_grad=(forwardCircuitFast(a,b,c+h,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varx_grad=(forwardCircuitFast(a,b,c,x+h,y)forwardCircuitFast(a,b,c,x,y))/h;
vary_grad=(forwardCircuitFast(a,b,c,x,y+h)forwardCircuitFast(a,b,c,x,y))/h;
Indeed,theseallgivethesamevaluesasthebackpropagatedgradients [0.105,0.315,
0.105,0.105,0.210] .Nice!
Ihopeitisclearthateventhoughweonlylookedatanexampleofasingleneuron,thecodeI
gaveabovegeneralizesinaverystraightforwardwaytocomputegradientsofarbitrary
expressions(includingverydeepexpressions#foreshadowing).Allyouhavetodoiswritesmall
gatesthatcomputelocal,simplederivativesw.r.ttheirinputs,wireitupinagraph,doaforward
passtocomputetheoutputvalueandthenabackwardpassthatchainsthegradientsallthe
waytotheinput.
BecomingaBackpropNinja
Overtimeyouwillbecomemuchmoreefficientinwritingthebackwardpass,evenfor
complicatedcircuitsandallatonce.Letspracticebackpropabitwithafewexamples.Inwhat
follows,letsnotworryaboutUnit,Circuitclassesbecausetheyobfuscatethingsabit,andlets
justusevariablessuchas a,b,c,x ,andrefertotheirgradientsas da,db,dc,dx respectively.
Again,wethinkofthevariablesastheforwardflowandtheirgradientsasbackwardflow
alongeverywire.Ourfirstexamplewasthe * gate:
varx=a*b;
//andgivengradientonx(dx),wesawthatinbackpropwewouldcompute:
varda=b*dx;
vardb=a*dx;
Inthecodeabove,Imassumingthatthevariable dx isgiven,comingfromsomewhereabove
usinthecircuitwhileweredoingbackprop(oritis+1bydefaultotherwise).Imwritingitout
becauseIwanttoexplicitlyshowhowthegradientsgetchainedtogether.Notefromthe
equationsthatthe * gateactsasaswitcherduringbackwardpass,forlackofbetterword.It
rememberswhatitsinputswere,andthegradientsoneachonewillbethevalueoftheother
duringtheforwardpass.Andthenofcoursewehavetomultiplywiththegradientfromabove,
whichisthechainrule.Heresthe + gateinthiscondensedform:
varx=a+b;
//>
varda=1.0*dx;
vardb=1.0*dx;
Where 1.0 isthelocalgradient,andthemultiplicationisourchainrule.Whataboutadding

threenumbers?:
//letscomputex=a+b+cintwosteps:
varq=a+b;//gate1
varx=q+c;//gate2
//backwardpass:
dc=1.0*dx;//backpropgate2
dq=1.0*dx;
da=1.0*dq;//backpropgate1
db=1.0*dq;
Youcanseewhatshappening,right?Ifyourememberthebackwardflowdiagram,the + gate
simplytakesthegradientontopandroutesitequallytoallofitsinputs(becauseitslocal
gradientisalwayssimply 1.0 forallitsinputs,regardlessoftheiractualvalues).Sowecando
itmuchfaster:
varx=a+b+c;
varda=1.0*dx;vardb=1.0*dx;vardc=1.0*dx;
Okay,howaboutcombininggates?:
varx=a*b+c;
//givendx,backpropinonesweepwouldbe=>
da=b*dx;
db=a*dx;
dc=1.0*dx;
Ifyoudontseehowtheabovehappened,introduceatemporaryvariable q=a*b andthen

compute x=q+c toconvinceyourself.Andhereisourneuron,letsdoitintwosteps:
//letsdoourneuronintwosteps:
varq=a*x+b*y+c;
varf=sig(q);//sigisthesigmoidfunction
//andnowbackwardpass,wearegivendf,and:
vardf=1;
vardq=(f*(1f))*df;
//andnowwechainittotheinputs
varda=x*dq;
vardx=a*dq;
vardy=b*dq;
vardb=y*dq;
vardc=1.0*dq;
Ihopethisisstartingtomakealittlemoresense.Nowhowaboutthis:
varx=a*a;
varda=//???
Youcanthinkofthisasvalue a flowingtothe * gate,butthewiregetssplitandbecomes

bothinputs.Thisisactuallysimplebecausethebackwardflowofgradientsalwaysaddsup.In
otherwordsnothingchanges:
varda=a*dx;//gradientintoafromfirstbranch
da+=a*dx;//andaddonthegradientfromthesecondbranch
//shortforminsteadis:
varda=2*a*dx;
Infact,ifyouknowyourpowerrulefromcalculusyouwouldalsoknowthatifyouhave
f (a)
f (a) = a
2
then = 2a ,whichisexactlywhatwegetifwethinkofitaswiresplittingup
a
andbeingtwoinputstoagate.
Letsdoanotherone:
varx=a*a+b*b+c*c;
//weget:
varda=2*a*dx;
vardb=2*b*dx;
vardc=2*c*dx;
Okaynowletsstarttogetmorecomplex:
varx=Math.pow(((a*b+c)*d),2);//pow(x,2)squarestheinputJS
Whenmorecomplexcaseslikethiscomeupinpractice,Iliketosplittheexpressioninto
manageablechunkswhicharealmostalwayscomposedofsimplerexpressionsandthenI
chainthemtogetherwithchainrule:
varx1=a*b+c;
varx2=x1*d;
varx=x2*x2;//thisisidenticaltotheaboveexpressionforx
//andnowinbackpropwegobackwards:
vardx2=2*x2*dx;//backpropintox2
vardd=x1*dx2;//backpropintod
vardx1=d*dx2;//backpropintox1
varda=b*dx1;
vardb=a*dx1;
vardc=1.0*dx1;//done!
Thatwasnttoodifficult!Thosearethebackpropequationsfortheentireexpression,andweve
donethempiecebypieceandbackproppedtoallthevariables.Noticeagainhowforevery
variableduringforwardpasswehaveanequivalentvariableduringbackwardpassthat
containsitsgradientwithrespecttothecircuitsfinaloutput.Hereareafewmoreuseful
functionsandtheirlocalgradientsthatareusefulinpractice:
varx=1.0/a;//division
varda=1.0/(a*a);
Hereswhatdivisionmightlooklikeinpracticethen:
varx=(a+b)/(c+d);
//letsdecomposeitinsteps:
varx1=a+b;
varx2=c+d;
varx3=1.0/x2;
varx=x1*x3;//equivalenttoabove
//andnowbackprop,againinreverseorder:
vardx1=x3*dx;
vardx3=x1*dx;
vardx2=(1.0/(x2*x2))*dx3;//localgradientasshownabove,andchainrule
varda=1.0*dx1;//andfinallyintotheoriginalvariables
vardb=1.0*dx1;
vardc=1.0*dx2;
vardd=1.0*dx2;
Hopefullyyouseethatwearebreakingdownexpressions,doingtheforwardpass,andthenfor
everyvariable(suchas a )wederiveitsgradient da aswegobackwards,onebyone,
applyingthesimplelocalgradientsandchainingthemwithgradientsfromabove.Heres
anotherone:
varx=Math.max(a,b);
varda=a===x?1.0*dx:0.0;
vardb=b===x?1.0*dx:0.0;
Okaythisismakingaverysimplethinghardtoread.The max functionpassesonthevalueof

theinputthatwaslargestandignorestheotherones.Inthebackwardpassthen,themaxgate
willsimplytakethegradientontopandrouteittotheinputthatactuallyflowedthroughitduring
theforwardpass.Thegateactsasasimpleswitchbasedonwhichinputhadthehighestvalue
duringforwardpass.Theotherinputswillhavezerogradient.Thatswhatthe === isabout,
sincewearetestingforwhichinputwastheactualmaxandonlyroutingthegradienttoit.
Finally,letslookattheRectifiedLinearUnitnonlinearity(orReLU),whichyoumayhaveheard
of.ItisusedinNeuralNetworksinplaceofthesigmoidfunction.Itissimplythresholdingat
zero:
varx=Math.max(a,0)
//backpropthroughthisgatewillthenbe:
varda=a>0?1.0*dx:0.0;
Inotherwordsthisgatesimplypassesthevaluethroughifitslargerthan0,oritstopstheflow
andsetsittozero.Inthebackwardpass,thegatewillpassonthegradientfromthetopifitwas
activatedduringtheforawrdpass,oriftheoriginalinputwasbelowzero,itwillstopthegradient
flow.
Iwillstopatthispoint.Ihopeyougotsomeintuitionabouthowyoucancomputeentire
expressions(whicharemadeupofmanygatesalongtheway)andhowyoucancompute
backpropforeveryoneofthem.
Everythingwevedoneinthischaptercomesdowntothis:Wesawthatwecanfeedsomeinput
througharbitrarilycomplexrealvaluedcircuit,tugattheendofthecircuitwithsomeforce,and
backpropagationdistributesthattugthroughtheentirecircuitallthewaybacktotheinputs.If
theinputsrespondslightlyalongthefinaldirectionoftheirtug,thecircuitwillgiveabitalong
theoriginalpulldirection.Maybethisisnotimmediatelyobvious,butthismachineryisa
powerfulhammerforMachineLearning.
Maybethisisnotimmediatelyobvious,butthismachineryisapowerfulhammerforMachine
Learning.
Letsnowputthismachinerytogooduse.
Chapter2:MachineLearning
Inthelastchapterwewereconcernedwithrealvaluedcircuitsthatcomputedpossiblycomplex
expressionsoftheirinputs(theforwardpass),andalsowecouldcomputethegradientsof
theseexpressionsontheoriginalinputs(backwardpass).Inthischapterwewillseehowuseful
thisextremelysimplemechanismisinMachineLearning.
BinaryClassification
Aswedidbefore,letsstartoutsimple.Thesimplest,commonandyetverypracticalproblemin
MachineLearningisbinaryclassification.Alotofveryinterestingandimportantproblemscan
bereducedtoit.Thesetupisasfollows:Wearegivenadatasetof N vectorsandeveryoneof
themislabeledwitha +1 ora 1 .Forexample,intwodimensionsourdatasetcouldlookas
simpleas:
vector>label

[1.2,0.7]>+1
[0.3,0.5]>1
[3,1]>+1
[0.1,1.0]>1
[3.0,1.1]>1
[2.1,3]>+1
Here,wehave N=6 datapoints,whereeverydatapointhastwofeatures( D=2 ).Threeof

thedatapointshavelabel +1 andtheotherthreelabel 1 .Thisisasillytoyexample,butin
practicea+1/1datasetcouldbeveryusefulthingsindeed:Forexamplespam/nospamemails,
wherethevectorssomehowmeasurevariousfeaturesofthecontentoftheemail,suchasthe
numberoftimescertainenhancementdrugsarementioned.
Goal.Ourgoalinbinaryclassificationistolearnafunctionthattakesa2dimensionalvector
andpredictsthelabel.Thisfunctionisusuallyparameterizedbyacertainsetofparameters,
andwewillwanttotunetheparametersofthefunctionsothatitsoutputsareconsistentwith
thelabelingintheprovideddataset.Intheendwecandiscardthedatasetandusethelearned
parameterstopredictlabelsforpreviouslyunseenvectors.
Trainingprotocol
Wewilleventuallybuilduptoentireneuralnetworksandcomplexexpressions,butletsstartout
simpleandtrainalinearclassifierverysimilartothesingleneuronwesawattheendof
Chapter1.Theonlydifferenceisthatwellgetridofthesigmoidbecauseitmakesthings
unnecessarilycomplicated(IonlyuseditasanexampleinChapter1becausesigmoidneurons
arehistoricallypopularbutmodernNeuralNetworksrarely,ifever,usesigmoidnonlinearities).
Anyway,letsuseasimplelinearfunction:
f (x, y) = ax + by + c
Inthisexpressionwethinkof x and y astheinputs(the2Dvectors)and a,b,c asthe

parametersofthefunctionthatwewillwanttolearn.Forexample,if a=1,b=2,c=1 ,
thenthefunctionwilltakethefirstdatapoint( [1.2,0.7] )andoutput 1*1.2+(2)*0.7
+(1)=1.2 .Hereishowthetrainingwillwork:
1.Weselectarandomdatapointandfeeditthroughthecircuit
2.Wewillinterprettheoutputofthecircuitasaconfidencethatthedatapointhasclass +1 .
(i.e.veryhighvalues=circuitisverycertaindatapointhasclass +1 andverylowvalues
=circuitiscertainthisdatapointhasclass 1 .)
3.Wewillmeasurehowwellthepredictionalignswiththeprovidedlabels.Intuitively,for
example,ifapositiveexamplescoresverylow,wewillwanttotuginthepositivedirection
onthecircuit,demandingthatitshouldoutputhighervalueforthisdatapoint.Notethat
thisisthecaseforthethefirstdatapoint:itislabeledas +1 butourpredictorunctiononly
assignsitvalue 1.2 .WewillthereforetugonthecircuitinpositivedirectionWewant
thevaluetobehigher.
4.Thecircuitwilltakethetugandbackpropagateittocomputetugsontheinputs
a,b,c,x,y
5.Sincewethinkof x,y as(fixed)datapoints,wewillignorethepullon x,y .Ifyourea
fanofmyphysicalanalogies,thinkoftheseinputsaspegs,fixedintheground.
6.Ontheotherhand,wewilltaketheparameters a,b,c andmakethemrespondtotheir
tug(i.e.wellperformwhatwecallaparameterupdate).This,ofcourse,willmakeitso
thatthecircuitwilloutputaslightlyhigherscoreonthisparticulardatapointinthefuture.
7.Iterate!Gobacktostep1.
ThetrainingschemeIdescribedabove,iscommonlyreferredasStochasticGradient
Descent.TheinterestingpartIdliketoreiterateisthat a,b,c,x,y areallmadeupofthe
samestuffasfarasthecircuitisconcerned:Theyareinputstothecircuitandthecircuitwilltug
onalloftheminsomedirection.Itdoesntknowthedifferencebetweenparametersand
datapoints.However,afterthebackwardpassiscompleteweignorealltugsonthedatapoints
( x,y )andkeepswappingtheminandoutasweiterateoverexamplesinthedataset.Onthe
otherhand,wekeeptheparameters( a,b,c )aroundandkeeptuggingonthemeverytimewe
sampleadatapoint.Overtime,thepullsontheseparameterswilltunethesevaluesinsucha
waythatthefunctionoutputshighscoresforpositiveexamplesandlowscoresfornegative
examples.
LearningaSupportVectorMachine
LearningaSupportVectorMachine
Asaconcreteexample,letslearnaSupportVectorMachine.TheSVMisaverypopularlinear
classifierItsfunctionalformisexactlyasIvedescribedinprevioussection,
f (x, y) = ax + by + c .Atthispoint,ifyouveseenanexplanationofSVMsyoureprobably
expectingmetodefinetheSVMlossfunctionandplungeintoanexplanationofslackvariables,
geometricalintuitionsoflargemargins,kernels,duality,etc.Buthere,Idliketotakeadifferent
approach.Insteadofdefinininglossfunctions,Iwouldliketobasetheexplanationontheforce
specification(Ijustmadethistermupbytheway)ofaSupportVectorMachine,whichI
personallyfindmuchmoreintuitive.Aswewillsee,talkingabouttheforcespecificationandthe
lossfunctionareidenticalwaysofseeingthesameproblem.Anyway,hereitis:
SupportVectorMachineForceSpecification:
IfwefeedapositivedatapointthroughtheSVMcircuitandtheoutputvalueislessthan1,
pullonthecircuitwithforce +1 .Thisisapositiveexamplesowewantthescoretobe
higherforit.
Conversely,ifwefeedanegativedatapointthroughtheSVMandtheoutputisgreater
than1,thenthecircuitisgivingthisdatapointdangerouslyhighscore:Pullonthecircuit
downwardswithforce 1 .
Inadditiontothepullsabove,alwaysaddasmallamountofpullontheparameters a,b
(notice,noton c !)thatpullsthemtowardszero.Youcanthinkofboth a,b asbeing
attachedtoaphysicalspringthatisattachedatzero.Justaswithaphysicalspring,this
willmakethepullproprotionaltothevalueofeachof a,b (Hookeslawinphysics,
anyone?).Forexample,if a becomesveryhighitwillexperienceastrongpullof
magnitude |a| backtowardszero.Thispullissomethingwecallregularization,andit
ensuresthatneitherofourparameters a or b getsdisproportionallylarge.Thiswould
beundesirablebecauseboth a,b getmultipliedtotheinputfeatures x,y (remember
theequationis a*x+b*y+c ),soifeitherofthemistoohigh,ourclassifierwouldbe
overlysensitivetothesefeatures.Thisisntanicepropertybecausefeaturescanoftenbe
noisyinpractice,sowewantourclassifiertochangerelativelysmoothlyiftheywiggle
around.
Letsquicklygothroughasmallbutconcreteexample.Supposewestartoutwitharandom
parametersetting,say, a=1,b=2,c=1 .Then:
Ifwefeedthepoint [1.2,0.7] ,theSVMwillcomputescore 1*1.2+(2)*0.7

1=1.2 .Thispointislabeledas +1 inthetrainingdata,sowewantthescoretobe
higherthan1.Thegradientontopofthecircuitwillthusbepositive: +1 ,whichwill
backpropagateto a,b,c .Additionally,therewillalsobearegularizationpullon a of
1 (tomakeitsmaller)andregularizationpullon b of +2 tomakeitlarger,toward
zero.
Supposeinsteadthatwefedthedatapoint [0.3,0.5] totheSVM.Itcomputes 1*
(0.3)+(2)*0.51=2.3 .Thelabelforthispointis 1 ,andsince 2.3 is
smallerthan 1 ,weseethataccordingtoourforcespecificationtheSVMshouldbe
happy:Thecomputedscoreisverynegative,consistentwiththenegativelabelofthis
example.Therewillbenopullattheendofthecircuit(i.eitszero),sincethereno
changesarenecessary.However,therewillstillbetheregularizationpullon a of 1
andon b of +2 .
Okaytheresbeentoomuchtext.LetswritetheSVMcodeandtakeadvantageofthecircuit
machinerywehavefromChapter1:
//Acircuit:ittakes5Units(x,y,a,b,c)andoutputsasingleUnit
//Itcanalsocomputethegradientw.r.t.itsinputs
varCircuit=function(){
//createsomegates
this.mulg0=newmultiplyGate();
this.mulg1=newmultiplyGate();
this.addg0=newaddGate();
this.addg1=newaddGate();
};
Circuit.prototype={
forward:function(x,y,a,b,c){
this.ax=this.mulg0.forward(a,x);//a*x
this.by=this.mulg1.forward(b,y);//b*y
this.axpby=this.addg0.forward(this.ax,this.by);//a*x+b*y
this.axpbypc=this.addg1.forward(this.axpby,c);//a*x+b*y+c
returnthis.axpbypc;
},
backward:function(gradient_top){//takespullfromabove
this.axpbypc.grad=gradient_top;
this.addg1.backward();//setsgradientinaxpbyandc
this.addg0.backward();//setsgradientinaxandby
this.mulg1.backward();//setsgradientinbandy
this.mulg0.backward();//setsgradientinaandx
}
}
Thatsacircuitthatsimplycomputes a*x+b*y+c andcanalsocomputethegradient.It

usesthegatescodewedevelopedinChapter1.NowletswritetheSVM,whichdoesntcare
abouttheactualcircuit.Itisonlyconcernedwiththevaluesthatcomeoutofit,anditpullson
thecircuit.
//SVMclass
varSVM=function(){

//randominitialparametervalues
this.a=newUnit(1.0,0.0);
this.b=newUnit(2.0,0.0);
this.c=newUnit(1.0,0.0);
this.circuit=newCircuit();
};
SVM.prototype={
forward:function(x,y){//assumexandyareUnits
this.unit_out=this.circuit.forward(x,y,this.a,this.b,this.c);
returnthis.unit_out;
},
backward:function(label){//labelis+1or1
//resetpullsona,b,c
this.a.grad=0.0;
this.b.grad=0.0;
this.c.grad=0.0;
//computethepullbasedonwhatthecircuitoutputwas
varpull=0.0;
if(label===1&&this.unit_out.value<1){
pull=1;//thescorewastoolow:pullup
}
if(label===1&&this.unit_out.value>1){
pull=1;//thescorewastoohighforapositiveexample,pulldown
}
this.circuit.backward(pull);//writesgradientintox,y,a,b,c

//addregularizationpullforparameters:towardszeroandproportionaltovalue
this.a.grad+=this.a.value;
this.b.grad+=this.b.value;
},
learnFrom:function(x,y,label){
this.forward(x,y);//forwardpass(set.valueinallUnits)
this.backward(label);//backwardpass(set.gradinallUnits)
this.parameterUpdate();//parametersrespondtotug
},
parameterUpdate:function(){
varstep_size=0.01;
this.a.value+=step_size*this.a.grad;
this.b.value+=step_size*this.b.grad;
this.c.value+=step_size*this.c.grad;
}
};
NowletstraintheSVMwithStochasticGradientDescent:
vardata=[];varlabels=[];
data.push([1.2,0.7]);labels.push(1);
data.push([2.1,3]);labels.push(1);
varsvm=newSVM();
//afunctionthatcomputestheclassificationaccuracy
varevalTrainingAccuracy=function(){
varnum_correct=0;
for(vari=0;i<data.length;i++){
varx=newUnit(data[i][0],0.0);
vary=newUnit(data[i][1],0.0);
vartrue_label=labels[i];
//seeifthepredictionmatchestheprovidedlabel
varpredicted_label=svm.forward(x,y).value>0?1:1;
if(predicted_label===true_label){
num_correct++;
}
}
returnnum_correct/data.length;
};
//thelearningloop
for(variter=0;iter<400;iter++){
//pickarandomdatapoint
vari=Math.floor(Math.random()*data.length);
varx=newUnit(data[i][0],0.0);
vary=newUnit(data[i][1],0.0);
varlabel=labels[i];
svm.learnFrom(x,y,label);
if(iter%25==0){//every10iterations...
console.log('trainingaccuracyatiter'+iter+':'+evalTrainingAccuracy());
}
}
Thiscodeprintsthefollowingoutput:
trainingaccuracyatiteration0:0.3333333333333333
trainingaccuracyatiteration300:1
Weseethatinitiallyourclassifieronlyhad33%trainingaccuracy,butbytheendalltraining
examplesarecorrectlyclassifierastheparameters a,b,c adjustedtheirvaluesaccordingto
thepullsweexerted.WejusttrainedanSVM!Butpleasedontusethiscodeanywherein
production:)Wewillseehowwecanmakethingsmuchmoreefficientonceweunderstand
whatisgoingonatthecore.
Numberofiterationsneeded.Withthisexampledata,withthisexampleinitialization,andwith
thesettingofstepsizeweused,ittookabout300iterationstotraintheSVM.Inpractice,this
couldbemanymoreormanylessdependingonhowhardorlargetheproblemis,howyoure
initializating,normalizingyourdata,whatstepsizeyoureusing,andsoon.Thisisjustatoy
demonstration,butlaterwewillgooverallthebestpracticesforactuallytrainingthese
classifiersinpractice.Forexample,itwillturnoutthatthesettingofthestepsizeisvery
imporantandtricky.Smallstepsizewillmakeyourmodelslowtotrain.Largestepsizewilltrain
faster,butifitistoolarge,itwillmakeyourclassifierchaoticallyjumparoundandnotconverge
toagoodfinalresult.Wewilleventuallyusewitheldvalidationdatatoproperlytuneittobejust
inthesweetspotforyourparticulardata.
OnethingIdlikeyoutoappreciateisthatthecircuitcanbearbitraryexpression,notjustthe
linearpredictionfunctionweusedinthisexample.Forexample,itcanbeanentireneural
network.
Bytheway,Iintentionallystructuredthecodeinamodularway,butwecouldhavetrainedan
SVMwithamuchsimplercode.Hereisreallywhatalloftheseclassesandcomputationsboil
downto:
vara=1,b=2,c=1;//initialparameters
varx=data[i][0];
vary=data[i][1];
varlabel=labels[i];
//computepull
varscore=a*x+b*y+c;
varpull=0.0;
if(label===1&&score<1)pull=1;
if(label===1&&score>1)pull=1;
//computegradientandupdateparameters
varstep_size=0.01;
a+=step_size*(x*pulla);//aisfromtheregularization
b+=step_size*(y*pullb);//bisfromtheregularization
c+=step_size*(1*pull);
}
thiscodegivesanidenticalresult.Perhapsbynowyoucanglanceatthecodeandseehow
theseequationscameabout.
Variablepull?Aquicknotetomakeatthispoint:Youmayhavenoticedthatthepullisalways
1,0,or1.Youcouldimaginedoingotherthings,forexamplemakingthispullproportionalto
howbadthemistakewas.ThisleadstoavariationontheSVMthatsomepeoplerefertoas
squaredhingelossSVM,forreasonsthatwilllaterbecomeclear.Dependingonvarious
featuresofyourdataset,thatmayworkbetterorworse.Forexample,ifyouhaveverybad
outliersinyourdata,e.g.anegativedatapointthatgetsascore +100 ,itsinfluencewillbe
relativelyminoronourclassifierbecausewewillonlypullwithforceof 1 regardlessofhow
badthemistakewas.Inpracticewerefertothispropertyofaclassifierasrobustnessto
outliers.
Letsrecap.Weintroducedthebinaryclassificationproblem,wherewearegivenND
dimensionalvectorsandalabel+1/1foreach.Wesawthatwecancombinethesefeatures
withasetofparametersinsidearealvaluedcircuit(suchasaSupportVectorMachinecircuit
inourexample).Then,wecanrepeatedlypassourdatathroughthecircuitandeachtimetweak
theparameterssothatthecircuitsoutputvalueisconsistentwiththeprovidedlabels.The
tweakingrelied,crucially,onourabilitytobackpropagategradientsthroughthecircuit.Inthe
end,thefinalcircuitcanbeusedtopredictvaluesforunseeninstances!
GeneralizingtheSVMintoaNeuralNetwork
OfinterestisthefactthatanSVMisjustaparticulartypeofaverysimplecircuit(circuitthat
computes score=a*x+b*y+c where a,b,c areweightsand x,y aredatapoints).This
canbeeasilyextendedtomorecomplicatedfunctions.Forexample,letswritea2layerNeural
Networkthatdoesthebinaryclassification.Theforwardpasswilllooklikethis:
//assumeinputsx,y
varn1=Math.max(0,a1*x+b1*y+c1);//activationof1sthiddenneuron
varn2=Math.max(0,a2*x+b2*y+c2);//2ndneuron
varn3=Math.max(0,a3*x+b3*y+c3);//3rdneuron
varscore=a4*n1+b4*n2+c4*n3+d4;//thescore
Thespecificationaboveisa2layerNeuralNetworkwith3hiddenneurons(n1,n2,n3)that
usesRectifiedLinearUnit(ReLU)nonlinearityoneachhiddenneuron.Asyoucansee,there
arenowseveralparametersinvolved,whichmeansthatourclassifierismorecomplexandcan
representmoreintricatedecisionboundariesthanjustasimplelineardecisionrulesuchasan
SVM.Anotherwaytothinkaboutitisthateveryoneofthethreehiddenneuronsisalinear
classifierandnowwereputtinganextralinearclassifierontopofthat.Nowwerestartingtogo
deeper:).Okay,letstrainthis2layerNeuralNetwork.ThecodelooksverysimilartotheSVM
examplecodeabove,wejusthavetochangetheforwardpassandthebackwardpass:
//randominitialparameters
vara1=Math.random()0.5;//arandomnumberbetween0.5and0.5
//...similarlyinitializeallotherparameterstorandoms
varx=data[i][0];
vary=data[i][1];
varlabel=labels[i];
//computeforwardpass
varn1=Math.max(0,a1*x+b1*y+c1);//activationof1sthiddenneuron
varn2=Math.max(0,a2*x+b2*y+c2);//2ndneuron
varn3=Math.max(0,a3*x+b3*y+c3);//3rdneuron
varscore=a4*n1+b4*n2+c4*n3+d4;//thescore
//computethepullontop
varpull=0.0;
if(label===1&&score<1)pull=1;//wewanthigheroutput!Pullup.
if(label===1&&score>1)pull=1;//wewantloweroutput!Pulldown.
//nowcomputebackwardpasstoallparametersofthemodel
//backpropthroughthelast"score"neuron
vardscore=pull;
varda4=n1*dscore;
vardn1=a4*dscore;
vardb4=n2*dscore;
vardn2=b4*dscore;
vardc4=n3*dscore;
vardn3=c4*dscore;
vardd4=1.0*dscore;//phew
//backproptheReLUnonlinearities,inplace
//i.e.justsetgradientstozeroiftheneuronsdidnot"fire"
vardn3=n3===0?0:dn3;
//backproptoparametersofneuron1
varda1=x*dn1;
vardb1=y*dn1;
vardc1=1.0*dn1;

varda2=x*dn2;
vardb2=y*dn2;
vardc2=1.0*dn2;
varda3=x*dn3;
vardb3=y*dn3;
vardc3=1.0*dn3;
//phew!Endofbackprop!
//notewecouldhavealsobackproppedintox,y
//butwedonotneedthesegradients.Weonlyusethegradients
//onourparametersintheparameterupdate,andwediscardx,y
//addthepullsfromtheregularization,tuggingallmultiplicative
//parameters(i.e.notthebiases)downward,proportionaltotheirvalue
da1+=a1;da2+=a2;da3+=a3;
db1+=b1;db2+=b2;db3+=b3;
da4+=a4;db4+=b4;dc4+=c4;
//finally,dotheparameterupdate
varstep_size=0.01;
a1+=step_size*da1;
b1+=step_size*db1;
c1+=step_size*dc1;
a2+=step_size*da2;
b2+=step_size*db2;
c2+=step_size*dc2;
a3+=step_size*da3;
b3+=step_size*db3;
c3+=step_size*dc3;
a4+=step_size*da4;
b4+=step_size*db4;
c4+=step_size*dc4;
d4+=step_size*dd4;
//wowthisistedious,pleaseuseforloopsinprod.
//we'redone!
}
Andthatshowyoutrainaneuralnetwork.Obviously,youwanttomodularizeyourcodenicely
butIexpendedthisexampleforyouinthehopethatitmakesthingsmuchmoreconcreteand
simplertounderstand.Later,wewilllookatbestpracticeswhenimplementingthesenetworks
andwewillstructurethecodemuchmoreneatlyinamodularandmoresensibleway.
Butfornow,Ihopeyourtakeawayisthata2layerNeuralNetisreallynotsuchascarything:
wewriteaforwardpassexpression,interpretthevalueattheendasascore,andthenwepull
onthatvalueinapositiveornegativedirectiondependingonwhatwewantthatvaluetobefor
ourcurrentparticularexample.Theparameterupdateafterbackpropwillensurethatwhenwe
seethisparticularexampleinthefuture,thenetworkwillbemorelikelytogiveusavaluewe
desire,nottheoneitgavejustbeforetheupdate.
AmoreConventionalApproach:LossFunctions
Nowthatweunderstandthebasicsofhowthesecircuitsfunctionwithdata,letsadoptamore
conventionalapproachthatyoumightseeelsewhereontheinternetandinothertutorialsand
books.Youwontseepeopletalkingtoomuchaboutforcespecifications.Instead,Machine
Learningalgorithmsarespecifiedintermsoflossfunctions(orcostfunctions,or
objectives).
AsIdevelopthisformalismIwouldalsoliketostarttobealittlemorecarefulwithhowwename
ourvariablesandparameters.Idliketheseequationstolooksimilartowhatyoumightseeina
bookorsomeothertutorial,soletmeusemorestandardnamingconventions.
Example:2DSupportVectorMachine
Letsstartwithanexampleofa2dimensionalSVM.WearegivenadatasetofN examples
(xi0 , xi1 ) andtheircorrespondinglabelsyi whichareallowedtobeeither+1/ 1 for
positiveornegativeexamplerespectively.Mostimportantly,asyourecallwehavethree
parameters(w0 , w1 , w2 ) .TheSVMlossfunctionisthendefinedasfollows:
N 2 2
L = [ _i = 1 max(0, y_i(w_0x_i0 + w_1x_i1 + w_2) + 1)] + [w_0 + w_1 ]
Noticethatthisexpressionisalwayspositive,duetothethresholdingatzerointhefirst
expressionandthesquaringintheregularization.Theideaisthatwewillwantthisexpression
tobeassmallaspossible.Beforewediveintosomeofitssubtletiesletmefirsttranslateitto
code:
varX=[[1.2,0.7],[0.3,0.5],[3,2.5]]//arrayof2dimensionaldata
vary=[1,1,1]//arrayoflabels
varw=[0.1,0.2,0.3]//example:randomnumbers
varalpha=0.1;//regularizationstrength
functioncost(X,y,w){

vartotal_cost=0.0;//L,inSVMlossfunctionabove
N=X.length;
for(vari=0;i<N;i++){
//loopoveralldatapointsandcomputetheirscore
varxi=X[i];
varscore=w[0]*xi[0]+w[1]*xi[1]+w[2];

//accumulatecostbasedonhowcompatiblethescoreiswiththelabel
varyi=y[i];//label
varcosti=Math.max(0,yi*score+1);
console.log('example'+i+':xi=('+xi+')andlabel='+yi);
console.log('scorecomputedtobe'+score.toFixed(3));
console.log('=>costcomputedtobe'+costi.toFixed(3));
total_cost+=costi;
}
//regularizationcost:wewantsmallweights
reg_cost=alpha*(w[0]*w[0]+w[1]*w[1])
console.log('regularizationcostforcurrentmodelis'+reg_cost.toFixed(3));
total_cost+=reg_cost;
console.log('totalcostis'+total_cost.toFixed(3));
returntotal_cost;
}
Andhereistheoutput:
costforexample0is0.440
regularizationcostforcurrentmodelis0.005
totalcostis1.815
Noticehowthisexpressionworks:ItmeasureshowbadourSVMclassifieris.Letsstepthrough
thisexplicitly:
Thefirstdatapoint xi=[1.2,0.7] withlabel yi=1 willgivescore 0.1*1.2+

0.2*0.7+0.3 ,whichis 0.56 .Notice,thisisapositiveexamplesowewanttothe
scoretobegreaterthan +1 . 0.56 isnotenough.Andindeed,theexpressionforcost
forthisdatapointwillcompute: costi=Math.max(0,1*0.56+1) ,whichis 0.44 .
YoucanthinkofthecostasquantifyingtheSVMsunhappiness.
Theseconddatapoint xi=[0.3,0.5] withlabel yi=1 willgivescore 0.1*
(0.3)+0.2*0.5+0.3 ,whichis 0.37 .Thisisntlookingverygood:Thisscoreis
veryhighforanegativeexample.Itshouldbelessthan1.Indeed,whenwecomputethe
cost: costi=Math.max(0,1*0.37+1) ,weget 1.37 .Thatsaveryhighcostfrom

thisexample,asitisbeingmisclassified.
Thelastexample xi=[3,2.5] withlabel yi=1 givesscore 0.1*3+0.2*2.5+
0.3 ,andthatis 1.1 .Inthiscase,theSVMwillcompute costi=Math.max(0,1*1.1
+1) ,whichisinfactzero.Thisdatapointisbeingclassifiedcorrectlyandthereisnocost
associatedwithit.
Acostfunctionisanexpressionthatmeasuresshowbadyourclassifieris.Whenthetraining
setisperfectlyclassified,thecost(ignoringtheregularization)willbezero.
Noticethatthelastterminthelossistheregularizationcost,whichsaysthatourmodel
parametersshouldbesmallvalues.Duetothistermthecostwillneveractuallybecomezero
(becausethiswouldmeanallparametersofthemodelexceptthebiasareexactlyzero),butthe
closerweget,thebetterourclassifierwillbecome.
ThemajorityofcostfunctionsinMachineLearningconsistoftwoparts:1.Apartthat
measureshowwellamodelfitsthedata,and2:Regularization,whichmeasuressomenotion
ofhowcomplexorlikelyamodelis.
IhopeIconvincedyouthen,thattogetaverygoodSVMwereallywanttomakethecostas
smallaspossible.Soundsfamiliar?Weknowexactlywhattodo:Thecostfunctionwritten
aboveisourcircuit.Wewillforwardallexamplesthroughthecircuit,computethebackward
passandupdateallparameterssuchthatthecircuitwilloutputasmallercostinthefuture.
Specifically,wewillcomputethegradientandthenupdatetheparametersintheopposite
directionofthegradient(sincewewanttomakethecostsmall,notlarge).
Weknowexactlywhattodo:Thecostfunctionwrittenaboveisourcircuit.
todo:cleanupthissectionandfleshitoutabit
Chapter3:BackpropinPractice
Buildingupalibrary
Example:PracticalNeuralNetworkClassifier
Multiclass:StructuredSVM
Multiclass:LogisticRegression,Softmax
Example:Regression
Tinychangesneededtocostfunction.L2regularization.
Example:StructuredPrediction
Basicideaistotrainan(unnormalized)energymodel
VectorizedImplementations
WritingaNeuralNetclassfierinPythonwithnumpy.
Backpropinpractice:Tips/Tricks
MonitoringofCostfunction
Monitoringtraining/validationperformance
Tweakinginitiallearningrates,learningrateschedules
Optimization:UsingMomentum
Optimization:LBFGS,Nesterovacceleratedgradient
ImportanceofInitialization:weightsandbiases
Regularization:L2,L1,Groupsparsity,Dropout
Hyperparametersearch,crossvalidations
Commonpitfalls:(e.g.dyingReLUs)
Handlingunbalanceddatasets
Approachestodebuggingnetswhensomethingdoesntwork
Chapter4:NetworksintheWild
Casestudiesofmodelsthatworkwellinpracticeandhavebeendeployedinthewild.
CaseStudy:ConvolutionalNeuralNetworksforimages
Convolutionallayers,pooling,AlexNet,etc.
CaseStudy:RecurrentNeuralNetworksforSpeechandText
CaseStudy:RecurrentNeuralNetworksforSpeechandText
VanillaRecurrentnets,bidirectionalrecurrentnets.MaybeoverviewofLSTM
CaseStudy:Word2Vec
TrainingwordvectorrepresentationsinNLP
CaseStudy:tSNE
Trainingembeddingsforvisualizingdata
Acknowledgements
Thanksalottothefollowingpeoplewhomadethisguidebetter:wodenokoto(HN),zackmorris
(HN).
Comments
ThisguideisaworkinprogressandIappreciatefeedback,especiallyregardingpartsthatwere
unclearoronlymadehalfsense.Thankyou!
SomeoftheJavascriptcodeinthistutorialhasbeentranslatedtoPythonbyAjit,finditoveron
Github.
8Comments Andrej'sBlog
1 Login
Recommend 5 Share SortbyBest
Jointhediscussion
maxkhesin2yearsago
Thisisawesome,canwehazmoar?
6 Reply Share
GauravKumar3monthsago
moar
Reply Share

Hacker's guide to Neural Networks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hacker's guide to Neural Networks

Uploaded by

Copyright:

Available Formats

3/17/2017 Hacker'sguidetoNeuralNetworks

AndrejKarpathyblog About Hacker'sguidetoNeuralNetworks

Thecircuittakestworealvaluedinputs x and y andcomputes x*y withthe * gate.

1.Weprovideagivencircuitsomespecificinputvalues(e.g. x=2 , y=3 )

Inthiscase,inwhatdirectionshouldwechange x,y togetanumberlargerthan 6 ?Note

WhenIrunthis,Iget best_x=1.9928 , best_y=2.9901 ,and best_out=5.9588 .

Letswalkthrough x forexample.Weturnedtheknobfrom x to x+h andthecircuit

f (x, y) f (x + h, y) f (x, y) (x + h)y xy xy + hy xy hy

Thatsinteresting.Thederivativewithrespectto x isjustequalto y .Didyounoticethe

Theexpressionwearecomputingnowisf (x, y, z) = (x + y)z .Letsstructurethecodeas

Intheabove,Iamusing a and b asthelocalvariablesinthegatefunctionssothatwedont

Simpleenough:thesearetheexpressionsforthegradientwithrespectto q and z .Butwait,

Thatsright,thederivativesarejust1,regardlessoftheactualvaluesof x and y .Ifyouthink

f (q, z) q(x, y) f (q, z)

Nowwegettothesecond, + gatewhichoutputs q .Bydefault,the + gatecomputesits

andweget [4,4,3] ,ascomputedwithbackprop.phew!:)

Forexample,iftheinputtothesigmoidgateis x=3 ,thegatewillcomputeoutput f=1.0/

Letscreateasimple Unit structurethatwillstorethesetwovaluesoneverywire.Ourgates

InadditiontoUnitswealsoneed3gates: + , * and sig (sigmoid).Letsstartoutby

Notethat,again,the backward functioninallcasesjustcomputesthelocalderivativewith

Notethatthefirstlinesetsthegradientattheoutput(verylastunit)tobe 1.0 tostartoffthe

Success! 0.8825 ishigherthanthepreviousvalue, 0.8808 .Finally,letsverifythatwe

Where 1.0 isthelocalgradient,andthemultiplicationisourchainrule.Whataboutadding

Ifyoudontseehowtheabovehappened,introduceatemporaryvariable q=a*b andthen

Youcanthinkofthisasvalue a flowingtothe * gate,butthewiregetssplitandbecomes

Okaythisismakingaverysimplethinghardtoread.The max functionpassesonthevalueof

Here,wehave N=6 datapoints,whereeverydatapointhastwofeatures( D=2 ).Threeof

Inthisexpressionwethinkof x and y astheinputs(the2Dvectors)and a,b,c asthe

Ifwefeedthepoint [1.2,0.7] ,theSVMwillcomputescore 1*1.2+(2)*0.7

Thatsacircuitthatsimplycomputes a*x+b*y+c andcanalsocomputethegradient.It

Thefirstdatapoint xi=[1.2,0.7] withlabel yi=1 willgivescore 0.1*1.2+

cost: costi=Math.max(0,1*0.37+1) ,weget 1.37 .Thatsaveryhighcostfrom

Recommend 5 Share SortbyBest

You might also like

Thecircuittakestworealvaluedinputs x and y andcomputes xy withthe gate.

Ifwefeedthepoint [1.2,0.7] ,theSVMwillcomputescore 11.2+(2)0.7

Thatsacircuitthatsimplycomputes ax+by+c andcanalsocomputethegradient.It