Dplyr-Window Functions and Grouped Mutate - Filter

4/28/2015
Windowfunctionsandgroupedmutate/filter
Windowfunctionsandgrouped
mutate/filter
20150113
Awindowfunctionisavariationonanaggregationfunction.Whereanaggregationfunction,likesum()
andmean(),takesninputsandreturnasinglevalue,awindowfunctionreturnsnvalues.Theoutputofa
windowfunctiondependsonallitsinputvalues,sowindowfunctionsdontincludefunctionsthatwork
elementwise,like+orround().Windowfunctionsincludevariationsonaggregatefunctions,likecumsum()
andcummean(),functionsforrankingandordering,likerank(),andfunctionsfortakingoffsets,likelead()
andlag().
Windowfunctionsareusedinconjunctionwithmutateandfiltertosolveawiderangeofproblems,some
ofwhichareshownbelow:
library(Lahman)
batting<select(tbl_df(Batting),playerID,yearID,teamID,G,AB:H)
batting<arrange(batting,playerID,yearID,teamID)
players<group_by(batting,playerID)
#Foreachplayer,findthetwoyearswithmosthits
filter(players,min_rank(desc(H))<=2&H>0)
#Withineachplayer,rankeachyearbythenumberofgamesplayed
mutate(players,G_rank=min_rank(G))
#Foreachplayer,findeveryyearthatwasbetterthanthepreviousyear
filter(players,G>lag(G))
#Foreachplayer,computeavgchangeingamesplayedperyear
mutate(players,G_change=(Glag(G))/(yearIDlag(yearID)))
#Foreachplayer,findallwheretheyplayedmoregamesthanaverage
filter(players,G>mean(G))
#Foreach,playercomputeazscorebasedonnumberofgamesplayed
mutate(players,G_z=(Gmean(G))/sd(G))
Thisvignetteisbrokendownintotwosections.Firstyoulllearnaboutthefivefamiliesofwindowfunctions
inR,andwhatyoucanusethemfor.Ifyoureonlyworkingwithlocaldatasources,youcanstopthere.
Otherwise,continueontolearnaboutwindowfunctionsinSQL.Theyarerelativelynew,butaresupported
byPostgres,AmazonsRedshiftandGooglesbigquery.Thewindowfunctionsthemselvesarebasicallythe
same(moduloafewnameconflicts),buttheirspecificationisalittledifferent.Illbriefyreviewhowthey
work,andthenshowhowdplyrtranslatestheirRequivalentstoSQL.
Beforereadingthisvignette,youshouldbefamiliarwithmutate()andfilter().Ifyouwanttousewindow
functionswithSQLdatabases,youshouldalsobefamiliarwiththebasicsofdplyrsSQLtranslation.
Typesofwindowfunctions
Therearefivemainfamiliesofwindowfunctions.Twofamiliesareunrelatedtoaggregationfunctions:
Rankingandorderingfunctions:row_number(),min_rank(RANKinSQL),dense_rank(),cume_dist(),
http://cran.rstudio.com/web/packages/dplyr/vignettes/windowfunctions.html
1/9
4/28/2015
percent_rank(),andntile().Thesefunctionsalltakeavectortoorderby,andreturnvarioustypesof
ranks.
Offsetslead()andlag()allowyoutoaccessthepreviousandnextvaluesinavector,makingiteasy
tocomputedifferencesandtrends.
Theotherthreefamiliesarevariationsonfamiliaraggregatefunctions:
Cumulativeaggregates:cumsum(),cummin(),cummax()(frombaseR),andcumall(),cumany(),and
cummean()(fromdplyr).
Rollingaggregatesoperateinafixedwidthwindow.YouwontfindtheminbaseRorindplyr,but
therearemanyimplementationsinotherpackages,suchasRcppRoll.
Recycledaggregates,whereanaggregateisrepeatedtomatchthelengthoftheinput.Theseare
notneededinRbecausevectorrecyclingautomaticallyrecyclesaggregateswhereneeded.They
areimportantinSQL,becausethepresenceofanaggregationfunctionusuallytellsthedatabaseto
returnonlyonerowpergroup.
Eachfamilyisdescribedinmoredetailbelow,focussingonthegeneralgoalsandhowtousethemwith
dplyr.Formoredetails,refertotheindividualfunctiondocumentation.
Rankingfunctions
Therankingfunctionsarevariationsonatheme,differinginhowtheyhandleties:
x<c(1,1,2,2,2)
row_number(x)
#>[1]12345
min_rank(x)
#>[1]11333
dense_rank(x)
#>[1]11222
IfyourefamiliarwithR,youmayrecognisethatrow_number()andmin_rank()canbecomputedwiththe
baserank()functionandvariousvaluesoftheties.methodargument.Thesefunctionsareprovidedtosave
alittletyping,andtomakeiteasiertoconvertbetweenRandSQL.
Twootherrankingfunctionsreturnnumbersbetween0and1.percent_rank()givesthepercentageofthe
rankcume_dist()givestheproportionofvalueslessthanorequaltothecurrentvalue.
cume_dist(x)
#>[1]0.40.41.01.01.0
percent_rank(x)
#>[1]0.00.00.50.50.5
Theseareusefulifyouwanttoselect(forexample)thetop10%ofrecordswithineachgroup.For
example:
#Selectsbesttwoyears
filter(players,min_rank(desc(G))<2)
#Selectsbest10%ofyears
filter(players,cume_dist(desc(G))<0.1)
2/9
4/28/2015
Finally,ntile()dividesthedataupintonevenlysizedbuckets.Itsacoarseranking,anditcanbeusedin
withmutate()todividethedataintobucketsforfurthersummary.Forexample,wecouldusentile()to
dividetheplayerswithinateamintofourrankedgroups,andcalculatetheaveragenumberofgameswithin
eachgroup.
by_team_player<group_by(batting,teamID,playerID)
by_team<summarise(by_team_player,G=sum(G))
by_team_quartile<group_by(by_team,quartile=ntile(G,4))
summarise(by_team_quartile,mean(G))
quartile mean(G)
1
5.355776
24.912267
77.288933
373.693195
Allrankingfunctionsrankfromlowesttohighestsothatsmallinputvaluesgetsmallranks.Usedesc()to
rankfromhighesttolowest.
Leadandlag
lead()andlag()produceoffsetversionsofainputvectorthatiseitheraheadoforbehindtheoriginal
vector.
x<1:5
lead(x)
#>[1]2345NA
lag(x)
#>[1]NA1234
Youcanusethemto:
Computedifferencesorpercentchanges.
#Computetherelativechangeingamesplayed
mutate(players,G_delta=Glag(G))
Usinglag()ismoreconvenientthandiff()becauseforninputsdiff()returnsn1outputs.
Findoutwhenavaluechanges.
#Findwhenaplayerchangedteams
filter(players,teamID!=lag(teamID))
lead()andlag()haveanoptionalargumentorder_by.Ifset,insteadofusingtherowordertodetermine
whichvaluecomesbeforeanother,theywilluseanothervariable.Thisimportantifyouhavenotalready
sortedthedata,oryouwanttosortonewayandlaganother.
Heresasimpleexampleofwhathappensifyoudontspecifyorder_bywhenyouneedit:
df<data.frame(year=2000:2005,value=(0:5)^2)
3/9
4/28/2015
scrambled<df[sample(nrow(df)),]
wrong<mutate(scrambled,running=cumsum(value))
arrange(wrong,year)
#>yearvaluerunning
#>1200005
#>2200115
#>3200244
#>42003955
#>520041646
#>620052530
right<mutate(scrambled,running=order_by(year,cumsum(value)))
arrange(right,year)
#>yearvaluerunning
#>1200000
#>2200111
#>3200245
#>42003914
#>520041630
#>620052555
Cumulativeaggregates
BaseRprovidescumulativesum(cumsum()),cumulativemin(cummin())andcumulativemax(cummax()).(It
alsoprovidescumprod()butthatisrarelyuseful).Othercommonaccumulatingfunctionsarecumany()and
cumall(),cumulativeversionsof||and&&,andcummean(),acumulativemean.Thesearenotincludedin
baseR,butefficientversionsareprovidedbydplyr.
cumany()andcumall()areusefulforselectingallrowsupto,orallrowsafter,aconditionistrueforthefirst
(orlast)time.Forexample,wecanusecumany()tofindallrecordsforaplayeraftertheyplayedayearwith
150games:
filter(players,cumany(G>150))
Likeleadandlag,youmaywanttocontroltheorderinwhichtheaccumulationoccurs.Noneofthebuiltin
functionshaveanorder_byargumentsodplyrprovidesahelper:order_by().Yougiveitthevariableyou
wanttoorderby,andthenthecalltothewindowfunction:
x<1:10
y<10:1
order_by(y,cumsum(x))
#>[1]55545249454034271910
Thisfunctionusesabitofnonstandardevaluation,soIwouldntrecommendusingitinsideanother
functionusethesimplerbutlessconcisewith_order()instead.
Recycledaggregates
Rsvectorrecyclingmakeiteasytoselectvaluesthatarehigherorlowerthanasummary.Icallthisa
recycledaggregatebecausethevalueoftheaggregateisrecycledtobethesamelengthastheoriginal
vector.Recycledaggregatesareusefulifyouwanttofindallrecordsgreaterthanthemeanorlessthan
4/9
4/28/2015
themedian:
filter(players,G>mean(G))
filter(players,G<median(G))
WhilemostSQLdatabasesdonthaveanequivalentofmedian()orquantile(),whenfilteringyoucan
achievethesameeffectwithntile().Forexample,x>median(x)isequivalenttontile(x,2)==2
x>quantile(x,75)isequivalenttontile(x,100)>75orntile(x,4)>3.
filter(players,ntile(G,2)==2)
Youcanalsousethisideatoselecttherecordswiththehighest(x==max(x))orlowestvalue(x==min(x))
forafield,buttherankingfunctionsgiveyoumorecontroloverties,andallowyoutoselectanynumberof
records.
Recycledaggregatesarealsousefulinconjunctionwithmutate().Forexample,withthebattingdata,we
couldcomputethecareeryear,thenumberofyearsaplayerhasplayedsincetheyenteredtheleague:
mutate(players,career_year=yearIDmin(yearID)+1)
playerID
yearID teamID G
AB R H career_year
aardsda01 2004
SFN
11 0
0 0 1
aardsda01 2006
CHN
45 2
0 0 3
aardsda01 2007
CHA
25 0
0 0 4
aardsda01 2008
BOS
47 1
0 0 5
..
..
Or,asintheintroductoryexample,wecouldcomputeazscore:
mutate(players,G_z=(Gmean(G))/sd(G))
playerID
yearID teamID G
AB R H G_z
aardsda01 2004
SFN
11 0
0 0 1.1167685
aardsda01 2006
CHN
45 2
0 0 0.3297126
aardsda01 2007
CHA
25 0
0 0 0.5211586
aardsda01 2008
BOS
47 1
0 0 0.4147997
..
..
WindowfunctionsinSQL
WindowfunctionshaveaslightlydifferentflavourinSQL.Thesyntaxisalittledifferent,andthecumulative,
rollingandrecycledaggregatefunctionsareallbasedonthesimpleaggregatefunction.Thegoalinthis
sectionisnottotellyoueverythingyouneedtoknowaboutwindowfunctionsinSQL,buttoremindyouof
thebasicsandshowyouhowdplyrtranslatesyourRexpressionsintoSQL.
5/9
4/28/2015
StructureofawindowfunctioninSQL
InSQL,windowfunctionshavetheform
[expression]OVER([partitionclause][orderclause][frame_clause]):
Theexpressionisacombinationofvariablenamesandwindowfunctions.Supportforwindow
functionsvariesfromdatabasetodatabase,butmostsupporttherankingfunctions,lead,lag,nth,
first,last,count,min,max,sum,avgandstddev.dplyrgeneratesthisfromtheRexpressioninyour
mutateorfiltercall.
Thepartitionclausespecifieshowthewindowfunctionisbrokendownovergroups.Itplaysan
analogousroletoGROUPBYforaggregatefunctions,andgroup_by()indplyr.Itispossiblefordifferent
windowfunctionstobepartitionedintodifferentgroups,butnotalldatabasessupportit,andneither
doesdplyr.
Theorderclausecontrolstheordering(whenitmakesadifference).Thisisimportantforthe
rankingfunctionssinceitspecifieswhichvariablestorankby,butitsalsoneededforcumulative
functionsandlead.WheneveryourethinkingaboutbeforeandafterinSQL,youmustalwaystellit
whichvariabledefinestheorder.Indplyryoudothiswitharrange().Iftheorderclauseismissing
whenneeded,somedatabasesfailwithanerrormessagewhileothersreturnnondeterministic
results.
Theframeclausedefineswhichrows,orframe,thatarepassedtothewindowfunction,describing
whichrows(relativetothecurrentrow)shouldbeincluded.Theframeclauseprovidestwooffsets
whichdeterminethestartandendofframe.Therearethreespecialvalues:Infmeanstoincludeall
preceedingrows(inSQL,unboundedpreceding),0meansthecurrentrow(currentrow),andInf
meansallfollowingrows(unboundedfollowing).Thecompletesetofoptionsiscomprehensive,but
fairlyconfusing,andissummarisedvisuallybelow.
Ofthemanypossiblespecifications,thereareonlythreethatcommonlyused.Theyselectbetween
aggregationvariants:
Recycled:BETWEENUNBOUNDPRECEEDINGANDUNBOUNDFOLLOWING
Cumulative:BETWEENUNBOUNDPRECEEDINGANDCURRENTROW
Rolling:BETWEEN2PRECEEDINGAND2FOLLOWING
dplyrgeneratestheframeclausebasedonwhetheryourusingarecycledaggregateoracumulative
aggregate.
Itseasiesttounderstandthesespecificationsbylookingatafewexamples.Simpleexamplesjustneedthe
partitionandorderclauses:
Rankeachyearwithinaplayerbynumberofhomeruns:
RANK()OVER(PARTITIONBYplayerIDORDERBYdesc(H))
6/9
4/28/2015
Computechangeinnumberofgamesfromoneyeartothenext:
GLAG(G)OVER(PARTITIONGplayerIDORDERBYyearID)
Aggregatevariantsaremoreverbosebecausewealsoneedtosupplytheframeclause:
RunningsumofGforeachplayer:
SUM(G)OVER(PARTITIONBYplayerIDORDERBYyearIDBETWEENUNBOUNDPRECEEDINGANDCURRENTROW)
Computethecareeryear:
YearIDmin(YearID)OVER(PARTITIONBYplayerIDBETWEENUNBOUNDPRECEEDINGANDUNBOUNDFOLLOWING)+1
Computearollingaverageofgamesplayer:
MEAN(G)OVER(PARTITIONBYplayerIDORDERBYyearIDBETWEEN2PRECEEDINGAND2FOLLOWING)
YoullnoticethatwindowfunctionsinSQLaremoreverbosethaninR.Thisisbecausedifferentwindow
functionscanhavedifferentpartitions,andtheframespecificationismoregeneralthanthetwoaggregate
variants(recycledandcumulative)providedbydplyr.dplyrmakesatradeoff:youcantaccessrarelyused
windowfunctioncapabilities(unlessyouwriterawSQL),butinreturn,commonoperationsaremuchmore
succinct.
TranslatingdplyrtoSQL
ToseehowindividualwindowfunctionsaretranslatedtoSQL,wecanusetranslate_sql()withthe
argumentwindow=TRUE.
if(has_lahman("postgres")){
players_db<group_by(tbl(lahman_postgres(),"Batting"),playerID)
print(translate_sql(mean(G),tbl=players_db,window=TRUE))
print(translate_sql(cummean(G),tbl=players_db,window=TRUE))
print(translate_sql(rank(G),tbl=players_db,window=TRUE))
print(translate_sql(ntile(G,2),tbl=players_db,window=TRUE))
print(translate_sql(lag(G),tbl=players_db,window=TRUE))
}
#><SQL>avg("G")OVER(PARTITIONBY"playerID"ROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING)
#><SQL>mean("G")OVER(PARTITIONBY"playerID"ROWSUNBOUNDEDPRECEDING)
#><SQL>rank()OVER(PARTITIONBY"playerID"ORDERBY"G")
#><SQL>NTILE(2.0)OVER(PARTITIONBY"playerID"ORDERBY"G")
#><SQL>LAG("G",1,NULL)OVER(PARTITIONBY"playerID")
Ifthetblhasbeenarrangedpreviously,thenthatorderingwillbeusedfortheorderclause:
if(has_lahman("postgres")){
players_by_year<arrange(players_db,yearID)
print(translate_sql(cummean(G),tbl=players_by_year,window=TRUE))
print(translate_sql(rank(),tbl=players_by_year,window=TRUE))
print(translate_sql(lag(G),tbl=players_by_year,window=TRUE))
}
#><SQL>mean("G")OVER(PARTITIONBY"playerID"ORDERBY"yearID"ROWSUNBOUNDEDPRECEDING)
#><SQL>rank()OVER(PARTITIONBY"playerID"ORDERBY"yearID")
#><SQL>LAG("G",1,NULL)OVER(PARTITIONBY"playerID"ORDERBY"yearID")
TherearesomechallengeswhentranslatingwindowfunctionsbetweenRandSQL,becausedplyrtriesto
keepthewindowfunctionsassimilaraspossibletoboththeexistingRanaloguesandtotheSQLfunctions.
Thismeansthattherearethreewaystocontroltheorderclausedependingonwhichwindowfunction
7/9
4/28/2015
youreusing:
Forrankingfunctions,theorderingvariableisthefirstargument:rank(x),ntile(y,2).Ifomittedor
NULL,willusethedefaultorderingassociatedwiththetbl(assetbyarrange()).
Accumulatingaggegatesonlytakeasingleargument(thevectortoaggregate).Tocontrolordering,
useorder_by().
Aggregatesimplementedindplyr(lead,lag,nth_value,first_value,last_value)haveanorder_by
argument.Supplyittooverridethedefaultordering.
Thethreeoptionsareillustratedinthesnippetbelow:
mutate(players,
min_rank(yearID),
order_by(yearID,cumsum(G)),
lead(order_by=yearID,G)
)
Currentlythereisnowaytoorderbymultiplevariables,exceptbysettingthedefaultorderingwith
arrange().Thiswillbeaddedinafuturerelease.
Translatingfiltersbasedonwindowfunctions
TherearesomerestrictionsonwindowfunctionsinSQLthatmaketheirusewithWHEREsomewhat
challenging.Takethissimpleexample,wherewewanttofindtheyeareachplayerplayedthemostgames:
filter(players,rank(G)==1)
Thefollowingstraightforwardtranslationdoesnotworkbecausewindowfunctionsareonlyallowedin
SELECTandORDER_BY.
SELECT*
FROMBatting
WHERErank()OVER(PARTITIONBY"playerID"ORDERBY"G")=1;
ComputingthewindowfunctioninSELECTandreferringtoitinWHEREorHAVINGdoesntworkeither,because
WHEREandHAVINGarecomputedbeforewindowingfunctions.
SELECT*,rank()OVER(PARTITIONBY"playerID"ORDERBY"G")asrank
FROMBatting
WHERErank=1;
FROMBatting
HAVINGrank=1;
Instead,wemustuseasubquery:
SELECT*
FROM(
FROMBatting
)tmp
WHERErank=1;
8/9
4/28/2015
Andeventhatqueryisaslightlysimplificationbecauseitwillalsoaddarankcolumntotheoriginal
columns.dplyrtakescareofgeneratingthefull,verbose,query,soyoucanfocusonyourdataanalysis
challenges.
9/9

Dplyr-Window Functions and Grouped Mutate - Filter

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dplyr-Window Functions and Grouped Mutate - Filter

Uploaded by

Copyright:

Available Formats

4/28/2015

You might also like