Professional Documents
Culture Documents
PGI
Pathscale
RecognizeOpenMPdirecDves mp=nonuma
Automatic parallelization
-Mconcur
7/17/09
PGIBasicCompilerUsage
Acompilerdriverinterpretsop8onsandinvokespreprocessors,compilers, assembler,linker,etc. Op8onsprecedence:ifop8onsconict,lastop8ononcommandlinetakes precedence UseMinfotoseealis8ngofop8miza8onsandtransforma8onsperformed bythecompiler Usehelptolistallop8onsorseedetailsonhowtouseagivenop8on,e.g. pgf90Mvecthelp Usemanpagesformoredetailsonop8ons,e.g.manpgf90 Usevtoseeunderthehood
Flagstosupportlanguagedialects
Fortran pgf77,pgf90,pgf95,pghpftools Suxes.f,.F,.for,.fpp,.f90,.F90,.f95,.F95,.hpf,.HPF Mextend,Mxed,Mfreeform Typesizei2,i4,i8,r4,r8,etc. Mcray,Mbyteswapio,Mupcase,Mnomain,Mrecursive,etc. C/C++ pgcc,pgCC,akapgcpp Suxes.c,.C,.cc,.cpp,.i B,c89,c9x,Xa,Xc,Xs,Xt Msignextend,Mfcon,Msingle,Muchar,Mgccbugs
Specifyingthetargetarchitecture
Usethetpswitch.DontneedforDualCore tpk864ortpp764ortpcore264for64bitcode. tpamd64eforAMDopteronrevEorlater tpx64foruniedbinary tpk832,k7,p7,piv,piii,p6,p5,pxfor32bitcode tpbarcelona64
Flagsfordebuggingaids
ggeneratessymbolicdebuginforma8onusedbyadebugger goptgeneratesdebuginforma8oninthepresenceofop8miza8on Mboundsaddsarrayboundschecking vgivesverboseoutput,usefulfordebuggingsystemorbuildproblems Mlistwillgeneratealis8ng Minfoprovidesfeedbackonop8miza8onsmadebythecompiler SorMkeepasmtoseetheexactassemblygenerated
Basicop8miza8onswitches
Tradi8onalop8miza8oncontrolledthroughO[<n>],nis0to4. fastswitchcombinescommonsetintoonesimpleswitch,isequaltoO2 Munroll=c:1MnoframeMlre ForMunroll,cspeciescompletelyunrollloopswiththisloopcountor less Munroll=n:<m>saysunrollotherloopsm8mes Mlreisloopcarriedredundancyelimina8on
Basicop8miza8onswitches,cont.
fastsseswitchiscommonlyused,extendsfasttoSSEhardware,and vectoriza8on fastsseisequaltoO2Munroll=c:1MnoframeMlre(fast)plus Mvect=sse,MscalarsseMcache_align,Mushz Mcache_alignalignstoplevelarraysandobjectsoncachelineboundaries MushzushesSSEdenormalnumberstozero
Nodeleveltuning
Vectorization
packed SSE instructions maximize performance Analysis (IPA) use it! motivating examples
Interprocedural Function
Inlining especially important for C and C++ for Cray multi-core processors Optimizations hit or miss, but worth a try
Parallelization Miscellaneous
7/17/09
WhatcanInterproceduralAnalysisand Op8miza8onwithMipadoforYou?
Interprocedural constant propagation Pointer disambiguation Alignment detection, Alignment propagation Global variable mod/ref detection F90 shape propagation Function inlining IPA optimization of libraries, including inlining
7/17/09
10
EectofIPAon theWUPWISEBenchmark
PGF95 Compiler Options fastsse fastsse Mipa=fast fastsse Mipa=fast,inline Execution Time in Seconds 156.49 121.65 91.72
7/17/09
UsingInterproceduralAnalysis
Must be used at both compile time and link time Non-disruptive to development process edit/build/run Speed-ups of 5% - 10% are common Mipa=safe:<name> - safe to optimize functions which call or are called from unknown function/library name Mipa=libopt perform IPA optimizations on libraries Mipa=libinline perform IPA inlining from libraries
7/17/09
12
ExplicitFunc8onInlining
Minline[=[lib:]<inlib> | [name:]<func> | except:<func> | size:<n> | levels:<n>] [lib:]<inlib> [name:]<func> except:<func> size:<n> levels:<n> Inline extracted functions from inlib Inline function func Do not inline function func Inline only functions smaller than n statements (approximate) Inline n levels of functions
OtherC++recommenda8ons
Encapsulation, Data Hiding - small functions, inline! Exception Handling use no_exceptions until 7.0 Overloaded operators, overloaded functions - okay Pointer Chasing - -Msafeptr, restrict qualifer, 32 bits? Templates, Generic Programming now okay
Inheritance, polymorphism, virtual functions runtime lookup or check, no inlining, potential performance penalties
7/17/09
14
SMPParalleliza8on
Compiler strives for parallel outer loops, vector SSE inner loops Mconcur=innermost forces a vector/parallel innermost loop Mconcur=cncall enables parallelization of loops with calls
See PGI Users Guide or OpenMP 2.5 standard OpenMP programs compiled w/out mp=nonuma
7/17/09
15
7/17/09
16
7/17/09
17
7/17/09
18
7/17/09
19
7/17/09
20
7/17/09
21
7/17/09
22
7/17/09
23
Slide 25
Slide 26
Fortran Source
C and C++ Front End supplied by Edison Design Group, with Cray-developed code for extensions and interface support
Interprocedural Analysis
Compiler
Object File
X86 Code Generation from Open Source LLVM, with additional Cray-developed optimizations and interface support
Cray Inc. Proprietary Slide 27
ExcellentVectorizaDon Vectorizemoreloopsthanothercompilers OpenMP 2.0standard NesDng PGAS:FuncDonalUPCandCAFavailabletoday. ExcellentCacheopDmizaDons AutomaDcBlocking AutomaDcManagementofwhatstaysincache Prefetching,Interchange,Fusion,andmuchmore
Cray Inc. Confidential Slide 29
PlasmaFusionSimulaDon 3DParDcleincellcode(PIC)intoroidalgeometry DevelopedbyProf.ZhihongLin(nowatUCIrvine) CodehasseveraldierentcharacterisDcs Stride1copies StridedmemoryoperaDons ComputaDonallyintensive Gather/Sca_er SorDngandPacking MainrouDneisknownasthepusher
MainPusherkernelconsistsof2mainloopnests Firstloopnestcontainsgroupsof4statementswhichinclude
signicantindirectaddressing
Turn4statementsinto1vectorshortloop
ev(1:4)=ev(1:4)+wp0*wt00*(wz0*tempphi(1:4,0,ij)+wz1*tempphi(1:4,1,ij))
Secondloopislarge,computaDonallyintensive,butcontains
stridedloadsandcomputedgather CCEautomaDcallyvectorizesloop
Cray Inc. Confidential
Slide 32
GTCPusherperformance 3200MPIranksand4OMPthreads
40.0 35.0
BillionPar8clesPushed/Sec
30.0
25.0
20.0
CCE PreviousBest
15.0
10.0
5.0
GTCperformance 3200MPIranksand4OMPthreads
16.0 14.0
BillionPar8clesPushed/Sec
12.0
10.0
8.0
CCE PreviousBest
6.0
4.0
2.0
OverowisaNASAdevelopedNavierStokesowsolverfor
OverowScaling
4096
2048
TimeinSeconds
1024
512
!$ompparalleldo doi=1,4 callcomplex_matmul() enddo SubrouDnecomplex_matmul() !$ompparalleldoprivate(j,jend,jsize)!num_threads(p2) doj=1,n,nb jend=min(n,j+nb1) jsize=jendj+1 callzgemm(transA,transB,m,jsize,k,& alpha,A,ldA,B(j,1),ldb,beta,C(1,j),ldC) enddo
Cray Inc. Confidential Slide 38
ZGEMM1000x1000
80 70 60
GFlops
50 40 30 20 10 0 SerialZGEMM HighLevelOMP ZGEMM4x1 NestedOMP ZGEMM3x3 NestedOMP ZGEMM4x2 NestedOMP ZGEMM2x4 LowlevelOMP ZGEMM1x8
ParallelmethodandNthreadsateachlevel
Cray Inc. Confidential Slide 39
ZGEMM100x100
35 30 25
GFlops
20 15 10 5 0
SerialZGEMM
HighLevelOMP ZGEMM4x1
NestedOMP ZGEMM3x3
NestedOMP ZGEMM4x2
LowLevelZGEMM 1x8
ParallelmethodandNthreadsateachlevel
Cray Inc. Confidential Slide 40
TheCrayCompilingEnvironmentisanew,dierent,and