Professional Documents
Culture Documents
Advancedcacheoptimizations
ECE154B
Dmitri Strukov
DmitriStrukov
#1:WayPrediction
HowtocombinefasthittimeofDirectMappedandhavethelowerconflict
missesof2waySAcache?
Wayprediction:keepextrabitsincachetopredicttheway,orblockwithin
Way prediction: keep extra bits in cache to predict the way or block within
theset,ofnextcacheaccess.
Multiplexorissetearlytoselectdesiredblock,only1tagcomparisonperformedthatclock
cycleinparallelwithreadingthecachedata
Miss
Miss 1st checkotherblocksformatchesinnextclockcycle
check other blocks for matches in next clock cycle
HitTime
WayMissHitTime
MissPenalty
Accuracy
Accuracy 85%
Drawback:CPUpipelineishardifhittakes1or2cycles
Usedforinstructioncachesvs.L1datacaches
AlsousedonMIPSR10KforoffchipL2unifiedcache,waypredictiontableonchip
Efficientforthrashingproblemindirectmappedcaches
Remove20%90%cachemissestoL1cache
L1 d Vi ti
L1andVictimcacheareexclusive
h
l i
MisstoL1buthitinVC;missinL1andVC
#4:Nonblocking Cache:BasicIdea
Nonblocking Cache
Nonblockingcache orlockupfreecache allow
d
datacachetocontinuetosupplycachehitsduring
h
i
l
h hi d i
amiss
hitundermissreducestheeffectivemiss
t u de
ss educes t e e ect e ss
penaltybyworkingduringmissvs.ignoringCPU
requests
hitundermultiplemiss
under multiple missor
or miss
missundermiss
under miss
hit
mayfurtherlowertheeffectivemisspenaltyby
overlappingmultiplemisses
Pentium
PentiumProallows4outstandingmemorymisses
Pro allows 4 outstanding memory misses
(CrayX1Evectorsupercomputerallows2,048
outstandingmemorymisses)
NonBlockingCache
Figure2.5Theeffectivenessofanonblocking cacheisevaluatedbyallowing1,2,or64hitsunderacachemisswith9SPECINT(on
)
(
g )
Thedatamemorysystemmodeled
y y
aftertheInteli7consistsofa32KBL1cache
theleft)and9SPECFP(ontheright)benchmarks.
withafourcycleaccesslatency.TheL2cache(sharedwithinstructions)is256KBwitha10clockcycleaccesslatency.The L3 is2MB
anda36cycleaccesslatency.Allthecachesareeightwaysetassociativeandhavea64byteblocksize.Allowingonehitundermiss
reducesthemisspenaltyby9%fortheintegerbenchmarksand12.5%forthefloatingpoint.Allowingasecondhitimprovesthese
resultsto10%and16%,andallowing64resultsinlittleadditionalimprovement.
Nonblocking CacheImplementation
Cache Implementation
requiresoutoforderexecution
q
significantlyincreasesthecomplexityofthecache
controllerastherecanbemultipleoutstanding
memory accesses
memoryaccesses
requirespipelinedorbankedmemorysystem
(otherwisecannotsupport)
Nonblocking CacheExample
Cache Example
Maximum
Maximumnumberofoutstandingreferences
number of outstanding references
tomaintainpeakbandwidthforasystem?
sustainedtransferrate16GB/sec
sustained
transfer rate 16GB/sec
memoryaccess36ns
block size 64 bytes
blocksize64bytes
50%neednotbeissued
Nonblocking CacheExample
Cache Example
Maximum
Maximumnumberofoutstandingreferences
number of outstanding references
tomaintainpeakbandwidthforasystem?
sustainedtransferrate16GB/sec
/
memoryaccess36ns
blocksize64bytes
50%neednotbeissued
Answer:(16*10)^9/64*36*10^9*2=18
#5:IncreasingCacheBandwidthvia
M lti l B k
MultipleBanks
Ratherthantreatthecacheasasinglemonolithic
block,divideintoindependentbanksthatcansupport
,
p
pp
simultaneousaccesses
4inL1and8inL2forIntelcorei7
Banking
Bankingworksbestwhenaccessesnaturallyspread
works best when accesses naturally spread
themselvesacrossbanks mappingofaddressesto
banksaffectsbehaviorofmemorysystem
Simplemappingthatworkswellis
Simple mapping that works well is sequential
sequential
interleaving
Spreadblockaddressessequentiallyacrossbanks
E,g,ifthere4banks,Bank0hasallblockswhoseaddress
E g if there 4 banks Bank 0 has all blocks whose address
modulo4is0;bank1hasallblockswhoseaddress
modulo4is1;
#6:ReduceMissPenalty:
Early Restart and Critical Word First
EarlyRestartandCriticalWordFirst
DontwaitforfullblockbeforerestartingCPU
EarlyrestartAssoonastherequestedwordoftheblock
Early restart As soon as the requested word of the block
arrives,sendittotheCPUandlettheCPUcontinue
execution
Spatial
Spatiallocality
locality tendtowantnextsequentialword,sonot
tend to want next sequential word so not
clearsizeofbenefitofjustearlyrestart
CriticalWordFirstRequestthemissedwordfirstfrom
memory and send it to the CPU as soon as it arrives; let the
memoryandsendittotheCPUassoonasitarrives;letthe
CPUcontinueexecutionwhilefillingtherestofthewordsin
theblock
Longblocksmorepopulartoday
Long blocks more popular today CriticalWord1
Critical Word 1st Widelyused
Widely used
bl k
block
#7:MergingWriteBufferto
ReduceMissPenalty
d
i
l
Writebuffertoallowprocessortocontinue
p
whilewaitingtowritetomemory
Ifbuffercontainsmodifiedblocks,theaddresses
can be checked to see if address of new data
canbecheckedtoseeifaddressofnewdata
matchestheaddressofavalidwritebufferentry
Ifso,newdataarecombinedwiththatentry
If so, new data are combined with that entry
Increasesblocksizeofwriteforwritethrough
cacheofwritestosequentialwords,bytessince
multiwordwritesmoreefficienttomemory
li
d i
ffi i
TheSunT1(Niagara)processor,amongmany
others uses write merging
others,useswritemerging
MergingWriteBufferExample
Figure2.7Toillustratewritemerging,thewritebufferontopdoesnotuseitwhilethewritebufferonthebottomdoes. The
fourwritesaremergedintoasinglebufferentrywithwritemerging;withoutit,thebufferisfulleventhoughthreefourthsof
g
g
y
g g
g
eachentryiswasted.Thebufferhasfourentries,andeachentryholdsfour64bitwords.Theaddressforeachentryisonthe
left,withavalidbit(V)indicatingwhetherthenextsequential8bytesinthisentryareoccupied.(Withoutwritemerging, the
wordstotherightintheupperpartofthefigurewouldonlybeusedforinstructionsthatwrotemultiplewordsatthesame
time.)
Interestingissuewithconflictingdesignobjectives,i.e.ejectingassoonaspossiblevs.
keepinglongerformerging
#8:CompilerOptimizations
LoopInterchange
LoopFusion
Blocking
Blocking
#9:Prefetching
IssuesinPrefetching
HardwareInstructionPrefetching
HardwareDataPrefetching
SoftwarePrefetching
SoftwarePrefetching Issues
Summary
Acknowledgements
Someoftheslidescontainmaterialdeveloped
Some
of the slides contain material developed
andcopyrightedbyArvind,Emer(MIT),
Asanovic (UCB/MIT)andinstructormaterial
(UCB/MIT) and instructor material
forthetextbook
28