Basics Compilers

Compilers
PGI
Pathscale
Recommendedrstcompile/run fastssetpbarcelona64 GetdiagnosDcs MinfoMneginfo Inlining

Mipa=fast,inline
RecognizeOpenMPdirecDves mp=nonuma
Automatic parallelization -Mconcur
Recommendedrstcompile/run FtnO3OPT:Ofast march=barcelona GetDiagnosDcs LNO:simd_verbose=ON Inlining ipa RecognizeOpenMPdirecDves mp

Automatic parallelization -apo
7/17/09
PGIBasicCompilerUsage
Acompilerdriverinterpretsop8onsandinvokespreprocessors,compilers, assembler,linker,etc. Op8onsprecedence:ifop8onsconict,lastop8ononcommandlinetakes precedence UseMinfotoseealis8ngofop8miza8onsandtransforma8onsperformed bythecompiler Usehelptolistallop8onsorseedetailsonhowtouseagivenop8on,e.g. pgf90Mvecthelp Usemanpagesformoredetailsonop8ons,e.g.manpgf90 Usevtoseeunderthehood
Flagstosupportlanguagedialects
Fortran pgf77,pgf90,pgf95,pghpftools Suxes.f,.F,.for,.fpp,.f90,.F90,.f95,.F95,.hpf,.HPF Mextend,Mxed,Mfreeform Typesizei2,i4,i8,r4,r8,etc. Mcray,Mbyteswapio,Mupcase,Mnomain,Mrecursive,etc. C/C++ pgcc,pgCC,akapgcpp Suxes.c,.C,.cc,.cpp,.i B,c89,c9x,Xa,Xc,Xs,Xt Msignextend,Mfcon,Msingle,Muchar,Mgccbugs
Specifyingthetargetarchitecture
Usethetpswitch.DontneedforDualCore tpk864ortpp764ortpcore264for64bitcode. tpamd64eforAMDopteronrevEorlater tpx64foruniedbinary tpk832,k7,p7,piv,piii,p6,p5,pxfor32bitcode tpbarcelona64
Flagsfordebuggingaids
ggeneratessymbolicdebuginforma8onusedbyadebugger goptgeneratesdebuginforma8oninthepresenceofop8miza8on Mboundsaddsarrayboundschecking vgivesverboseoutput,usefulfordebuggingsystemorbuildproblems Mlistwillgeneratealis8ng Minfoprovidesfeedbackonop8miza8onsmadebythecompiler SorMkeepasmtoseetheexactassemblygenerated
Basicop8miza8onswitches
Tradi8onalop8miza8oncontrolledthroughO[<n>],nis0to4. fastswitchcombinescommonsetintoonesimpleswitch,isequaltoO2 Munroll=c:1MnoframeMlre ForMunroll,cspeciescompletelyunrollloopswiththisloopcountor less Munroll=n:<m>saysunrollotherloopsm8mes Mlreisloopcarriedredundancyelimina8on
Basicop8miza8onswitches,cont.
fastsseswitchiscommonlyused,extendsfasttoSSEhardware,and vectoriza8on fastsseisequaltoO2Munroll=c:1MnoframeMlre(fast)plus Mvect=sse,MscalarsseMcache_align,Mushz Mcache_alignalignstoplevelarraysandobjectsoncachelineboundaries MushzushesSSEdenormalnumberstozero
Nodeleveltuning
Vectorization
packed SSE instructions maximize performance Analysis (IPA) use it! motivating examples
Interprocedural Function
Inlining especially important for C and C++ for Cray multi-core processors Optimizations hit or miss, but worth a try
Parallelization Miscellaneous
7/17/09
WhatcanInterproceduralAnalysisand Op8miza8onwithMipadoforYou?

Interprocedural constant propagation Pointer disambiguation Alignment detection, Alignment propagation Global variable mod/ref detection F90 shape propagation Function inlining IPA optimization of libraries, including inlining
7/17/09
10
EectofIPAon theWUPWISEBenchmark
PGF95 Compiler Options fastsse fastsse Mipa=fast fastsse Mipa=fast,inline Execution Time in Seconds 156.49 121.65 91.72
Mipa=fast => constant propagation => compiler sees complex

matrices are all 4x3 => completely unrolls loops
Mipa=fast,inline => small matrix multiplies are all inlined

11
7/17/09
UsingInterproceduralAnalysis

Must be used at both compile time and link time Non-disruptive to development process edit/build/run Speed-ups of 5% - 10% are common Mipa=safe:<name> - safe to optimize functions which call or are called from unknown function/library name Mipa=libopt perform IPA optimizations on libraries Mipa=libinline perform IPA inlining from libraries
7/17/09
12
ExplicitFunc8onInlining
Minline[=[lib:]<inlib> | [name:]<func> | except:<func> | size:<n> | levels:<n>] [lib:]<inlib> [name:]<func> except:<func> size:<n> levels:<n> Inline extracted functions from inlib Inline function func Do not inline function func Inline only functions smaller than n statements (approximate) Inline n levels of functions
For C++ Codes, PGI Recommends IPA-based inlining or Minline=levels:10!

7/17/09 13
OtherC++recommenda8ons

Encapsulation, Data Hiding - small functions, inline! Exception Handling use no_exceptions until 7.0 Overloaded operators, overloaded functions - okay Pointer Chasing - -Msafeptr, restrict qualifer, 32 bits? Templates, Generic Programming now okay
Inheritance, polymorphism, virtual functions runtime lookup or check, no inlining, potential performance penalties
7/17/09
14
SMPParalleliza8on
Mconcur for auto-parallelization on multi-core

Compiler strives for parallel outer loops, vector SSE inner loops Mconcur=innermost forces a vector/parallel innermost loop Mconcur=cncall enables parallelization of loops with calls
mp to enable OpenMP 2.5 parallel programming model

See PGI Users Guide or OpenMP 2.5 standard OpenMP programs compiled w/out mp=nonuma
Mconcur and mp can be used together!
7/17/09
15
7/17/09
16
7/17/09
17
7/17/09
18
7/17/09
19
7/17/09
20
7/17/09
21
7/17/09
22
7/17/09
23
IntroducDontotheCraycompiler Example GTC Overow PARQUET
Cray Inc. Confidential
Slide 25
CrayhasalongtradiDonofhighperformancecompilers VectorizaDon ParallelizaDon CodetransformaDon More BeganinternalinvesDgaDonleveraginganopensource
compilercalledLLVM IniDalresultsandprogressbe_erthanexpected DecidedtomoveforwardwithCrayX86compiler 7.0releasedinDecember2008 7.1willbereleasedQ22009

Slide 26
Fortran Source
C and C++ Source
Fortran Front End
C & C++ Front End
C and C++ Front End supplied by Edison Design Group, with Cray-developed code for extensions and interface support
Interprocedural Analysis
Compiler
Cray Inc. Compiler Technology
Optimization and Parallelization
X86 Code Generator
Cray X2 Code Generator
Object File
X86 Code Generation from Open Source LLVM, with additional Cray-developed optimizations and interface support
Cray Inc. Proprietary Slide 27
Makesureitisavailable moduleavailPrgEnvcray ToaccesstheCraycompiler moduleloadPrgEnvcray TotargettheBarcelonachip moduleloadxtpequadcore OnceyouhaveloadedthemoduleccandfnaretheCray
compilers RecommendjustusingdefaultopDons Userm(fortran)andhlist=m(C)tondoutwhathappened mancrayfn

Cray Inc. Confidential Slide 28
ExcellentVectorizaDon Vectorizemoreloopsthanothercompilers OpenMP 2.0standard NesDng PGAS:FuncDonalUPCandCAFavailabletoday. ExcellentCacheopDmizaDons AutomaDcBlocking AutomaDcManagementofwhatstaysincache Prefetching,Interchange,Fusion,andmuchmore
C++Support AutomaDcParallelizaDon ModernizedversionofCrayX1streamingcapability InteractswithOMPdirecDves OpenMP3.0 OpDmizedPGAS WillrequireGemininetworktoreallygofast ImprovedVectorizaDon ImproveCacheopDmizaDons

PlasmaFusionSimulaDon 3DParDcleincellcode(PIC)intoroidalgeometry DevelopedbyProf.ZhihongLin(nowatUCIrvine) CodehasseveraldierentcharacterisDcs Stride1copies StridedmemoryoperaDons ComputaDonallyintensive Gather/Sca_er SorDngandPacking MainrouDneisknownasthepusher
MainPusherkernelconsistsof2mainloopnests Firstloopnestcontainsgroupsof4statementswhichinclude
signicantindirectaddressing
e1=e1+wp0*wt00*(wz0*gradphi(1,0,ij)+wz1*gradphi(1,1,ij)) e2=e2+wp0*wt00*(wz0*gradphi(2,0,ij)+wz1*gradphi(2,1,ij)) e3=e3+wp0*wt00*(wz0*gradphi(3,0,ij)+wz1*gradphi(3,1,ij)) e4=e4+wp0*wt00*(wz0*phit(0,ij)+wz1*phit(1,ij))
Turn4statementsinto1vectorshortloop
ev(1:4)=ev(1:4)+wp0*wt00*(wz0*tempphi(1:4,0,ij)+wz1*tempphi(1:4,1,ij))
Secondloopislarge,computaDonallyintensive,butcontains
stridedloadsandcomputedgather CCEautomaDcallyvectorizesloop
Slide 32
GTCPusherperformance 3200MPIranksand4OMPthreads
40.0 35.0
BillionPar8clesPushed/Sec
30.0
25.0
20.0
CCE PreviousBest
15.0
10.0
5.0
GTCperformance 3200MPIranksand4OMPthreads
16.0 14.0
BillionPar8clesPushed/Sec
12.0
10.0
8.0
CCE PreviousBest
6.0
4.0
2.0
OverowisaNASAdevelopedNavierStokesowsolverfor
unstructuredgrids SubrouDnesconsistoftwoorthreesimplynestedloops Innerloopstendtobehighlyvectorizedandhave2050 Fortranstatements MPIisusedforparallelprocessing SolverautomaDcallysplitsgridblocksforloadbalancing Scalingislimitedduetoloadbalancingat>1024 CodeisthreadedatahighlevelviaOpenMP
OverowScaling
4096
2048
TimeinSeconds
1024
PreviousMPI CCEMPI CCEOMP2thr CCEOMP4thr
512
256 256 512 1024 NumberofCores 2048 4096 8192
MaterialsSciencecode Scalesto1000sofMPIranksbeforeitrunsoutofparallelism WanttousesharedmemoryparallelismacrossenDrenode Mainkernelconsistsof4independentzgemms WanttousemulDlevelOMPtoscaleacrossthenode
!$ompparalleldo doi=1,4 callcomplex_matmul() enddo SubrouDnecomplex_matmul() !$ompparalleldoprivate(j,jend,jsize)!num_threads(p2) doj=1,n,nb jend=min(n,j+nb1) jsize=jendj+1 callzgemm(transA,transB,m,jsize,k,& alpha,A,ldA,B(j,1),ldb,beta,C(1,j),ldC) enddo
ZGEMM1000x1000
80 70 60
GFlops
50 40 30 20 10 0 SerialZGEMM HighLevelOMP ZGEMM4x1 NestedOMP ZGEMM3x3 NestedOMP ZGEMM4x2 NestedOMP ZGEMM2x4 LowlevelOMP ZGEMM1x8
ParallelmethodandNthreadsateachlevel
ZGEMM100x100
35 30 25
GFlops
20 15 10 5 0
SerialZGEMM
HighLevelOMP ZGEMM4x1
NestedOMP ZGEMM3x3
NestedOMP ZGEMM4x2
LowLevelZGEMM 1x8
ParallelmethodandNthreadsateachlevel
TheCrayCompilingEnvironmentisanew,dierent,and
interesDngcompilerwithseveraluniquecapabiliDes SeveralcodesarealreadytakingadvantageofCCE Developmentisongoing ConsidertryingCCEifyouthinkyoucouldtake advantageofitscapabiliDes

Basics Compilers

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basics Compilers

Uploaded by

Copyright:

Available Formats

Compilers

Recommendedrstcompile/run fastssetpbarcelona64 GetdiagnosDcs MinfoMneginfo Inlining

Recommendedrstcompile/run FtnO3OPT:Ofast march=barcelona GetDiagnosDcs LNO:simd_verbose=ON Inlining ipa RecognizeOpenMPdirecDves mp

Mipa=fast => constant propagation => compiler sees complex

Mipa=fast,inline => small matrix multiplies are all inlined

For C++ Codes, PGI Recommends IPA-based inlining or Minline=levels:10!

Mconcur for auto-parallelization on multi-core

mp to enable OpenMP 2.5 parallel programming model

Mconcur and mp can be used together!

IntroducDontotheCraycompiler Example GTC Overow PARQUET

Cray Inc. Confidential

CrayhasalongtradiDonofhighperformancecompilers VectorizaDon ParallelizaDon CodetransformaDon More BeganinternalinvesDgaDonleveraginganopensource

compilercalledLLVM IniDalresultsandprogressbe_erthanexpected DecidedtomoveforwardwithCrayX86compiler 7.0releasedinDecember2008 7.1willbereleasedQ22009

C and C++ Source

Fortran Front End

C & C++ Front End

Cray Inc. Compiler Technology

Optimization and Parallelization

X86 Code Generator

Cray X2 Code Generator

Makesureitisavailable moduleavailPrgEnvcray ToaccesstheCraycompiler moduleloadPrgEnvcray TotargettheBarcelonachip moduleloadxtpequadcore OnceyouhaveloadedthemoduleccandfnaretheCray

compilers RecommendjustusingdefaultopDons Userm(fortran)andhlist=m(C)tondoutwhathappened mancrayfn

C++Support AutomaDcParallelizaDon ModernizedversionofCrayX1streamingcapability InteractswithOMPdirecDves OpenMP3.0 OpDmizedPGAS WillrequireGemininetworktoreallygofast ImprovedVectorizaDon ImproveCacheopDmizaDons

Cray Inc. Confidential Slide 31

e1=e1+wp0*wt00*(wz0*gradphi(1,0,ij)+wz1*gradphi(1,1,ij)) e2=e2+wp0*wt00*(wz0*gradphi(2,0,ij)+wz1*gradphi(2,1,ij)) e3=e3+wp0*wt00*(wz0*gradphi(3,0,ij)+wz1*gradphi(3,1,ij)) e4=e4+wp0*wt00*(wz0*phit(0,ij)+wz1*phit(1,ij))

Cray Inc. Confidential Slide 33

Cray Inc. Confidential Slide 34

unstructuredgrids SubrouDnesconsistoftwoorthreesimplynestedloops Innerloopstendtobehighlyvectorizedandhave2050 Fortranstatements MPIisusedforparallelprocessing SolverautomaDcallysplitsgridblocksforloadbalancing Scalingislimitedduetoloadbalancingat>1024 CodeisthreadedatahighlevelviaOpenMP

Cray Inc. Confidential Slide 35

PreviousMPI CCEMPI CCEOMP2thr CCEOMP4thr

256 256 512 1024 NumberofCores 2048 4096 8192

MaterialsSciencecode Scalesto1000sofMPIranksbeforeitrunsoutofparallelism WanttousesharedmemoryparallelismacrossenDrenode Mainkernelconsistsof4independentzgemms WanttousemulDlevelOMPtoscaleacrossthenode

Cray Inc. Confidential Slide 37

interesDngcompilerwithseveraluniquecapabiliDes SeveralcodesarealreadytakingadvantageofCCE Developmentisongoing ConsidertryingCCEifyouthinkyoucouldtake advantageofitscapabiliDes

You might also like

e1=e1+wp0wt00(wz0gradphi(1,0,ij)+wz1gradphi(1,1,ij)) e2=e2+wp0wt00(wz0gradphi(2,0,ij)+wz1gradphi(2,1,ij)) e3=e3+wp0wt00(wz0gradphi(3,0,ij)+wz1gradphi(3,1,ij)) e4=e4+wp0wt00(wz0phit(0,ij)+wz1phit(1,ij))