You are on page 1of 41

Compilers

PGI

Pathscale

Recommendedrstcompile/run fastssetpbarcelona64 GetdiagnosDcs MinfoMneginfo Inlining


Mipa=fast,inline

RecognizeOpenMPdirecDves mp=nonuma
Automatic parallelization -Mconcur

Recommendedrstcompile/run FtnO3OPT:Ofast march=barcelona GetDiagnosDcs LNO:simd_verbose=ON Inlining ipa RecognizeOpenMPdirecDves mp


Automatic parallelization -apo

7/17/09

PGIBasicCompilerUsage
Acompilerdriverinterpretsop8onsandinvokespreprocessors,compilers, assembler,linker,etc. Op8onsprecedence:ifop8onsconict,lastop8ononcommandlinetakes precedence UseMinfotoseealis8ngofop8miza8onsandtransforma8onsperformed bythecompiler Usehelptolistallop8onsorseedetailsonhowtouseagivenop8on,e.g. pgf90Mvecthelp Usemanpagesformoredetailsonop8ons,e.g.manpgf90 Usevtoseeunderthehood

Flagstosupportlanguagedialects
Fortran pgf77,pgf90,pgf95,pghpftools Suxes.f,.F,.for,.fpp,.f90,.F90,.f95,.F95,.hpf,.HPF Mextend,Mxed,Mfreeform Typesizei2,i4,i8,r4,r8,etc. Mcray,Mbyteswapio,Mupcase,Mnomain,Mrecursive,etc. C/C++ pgcc,pgCC,akapgcpp Suxes.c,.C,.cc,.cpp,.i B,c89,c9x,Xa,Xc,Xs,Xt Msignextend,Mfcon,Msingle,Muchar,Mgccbugs

Specifyingthetargetarchitecture
Usethetpswitch.DontneedforDualCore tpk864ortpp764ortpcore264for64bitcode. tpamd64eforAMDopteronrevEorlater tpx64foruniedbinary tpk832,k7,p7,piv,piii,p6,p5,pxfor32bitcode tpbarcelona64

Flagsfordebuggingaids
ggeneratessymbolicdebuginforma8onusedbyadebugger goptgeneratesdebuginforma8oninthepresenceofop8miza8on Mboundsaddsarrayboundschecking vgivesverboseoutput,usefulfordebuggingsystemorbuildproblems Mlistwillgeneratealis8ng Minfoprovidesfeedbackonop8miza8onsmadebythecompiler SorMkeepasmtoseetheexactassemblygenerated

Basicop8miza8onswitches
Tradi8onalop8miza8oncontrolledthroughO[<n>],nis0to4. fastswitchcombinescommonsetintoonesimpleswitch,isequaltoO2 Munroll=c:1MnoframeMlre ForMunroll,cspeciescompletelyunrollloopswiththisloopcountor less Munroll=n:<m>saysunrollotherloopsm8mes Mlreisloopcarriedredundancyelimina8on

Basicop8miza8onswitches,cont.
fastsseswitchiscommonlyused,extendsfasttoSSEhardware,and vectoriza8on fastsseisequaltoO2Munroll=c:1MnoframeMlre(fast)plus Mvect=sse,MscalarsseMcache_align,Mushz Mcache_alignalignstoplevelarraysandobjectsoncachelineboundaries MushzushesSSEdenormalnumberstozero

Nodeleveltuning
Vectorization

packed SSE instructions maximize performance Analysis (IPA) use it! motivating examples

Interprocedural Function

Inlining especially important for C and C++ for Cray multi-core processors Optimizations hit or miss, but worth a try

Parallelization Miscellaneous

7/17/09

WhatcanInterproceduralAnalysisand Op8miza8onwithMipadoforYou?

Interprocedural constant propagation Pointer disambiguation Alignment detection, Alignment propagation Global variable mod/ref detection F90 shape propagation Function inlining IPA optimization of libraries, including inlining

7/17/09

10

EectofIPAon theWUPWISEBenchmark
PGF95 Compiler Options fastsse fastsse Mipa=fast fastsse Mipa=fast,inline Execution Time in Seconds 156.49 121.65 91.72

Mipa=fast => constant propagation => compiler sees complex


matrices are all 4x3 => completely unrolls loops

Mipa=fast,inline => small matrix multiplies are all inlined


11

7/17/09

UsingInterproceduralAnalysis

Must be used at both compile time and link time Non-disruptive to development process edit/build/run Speed-ups of 5% - 10% are common Mipa=safe:<name> - safe to optimize functions which call or are called from unknown function/library name Mipa=libopt perform IPA optimizations on libraries Mipa=libinline perform IPA inlining from libraries

7/17/09

12

ExplicitFunc8onInlining
Minline[=[lib:]<inlib> | [name:]<func> | except:<func> | size:<n> | levels:<n>] [lib:]<inlib> [name:]<func> except:<func> size:<n> levels:<n> Inline extracted functions from inlib Inline function func Do not inline function func Inline only functions smaller than n statements (approximate) Inline n levels of functions

For C++ Codes, PGI Recommends IPA-based inlining or Minline=levels:10!


7/17/09 13

OtherC++recommenda8ons

Encapsulation, Data Hiding - small functions, inline! Exception Handling use no_exceptions until 7.0 Overloaded operators, overloaded functions - okay Pointer Chasing - -Msafeptr, restrict qualifer, 32 bits? Templates, Generic Programming now okay

Inheritance, polymorphism, virtual functions runtime lookup or check, no inlining, potential performance penalties

7/17/09

14

SMPParalleliza8on

Mconcur for auto-parallelization on multi-core


Compiler strives for parallel outer loops, vector SSE inner loops Mconcur=innermost forces a vector/parallel innermost loop Mconcur=cncall enables parallelization of loops with calls

mp to enable OpenMP 2.5 parallel programming model


See PGI Users Guide or OpenMP 2.5 standard OpenMP programs compiled w/out mp=nonuma

Mconcur and mp can be used together!

7/17/09

15

7/17/09

16

7/17/09

17

7/17/09

18

7/17/09

19

7/17/09

20

7/17/09

21

7/17/09

22

7/17/09

23

IntroducDontotheCraycompiler Example GTC Overow PARQUET

Cray Inc. Confidential

Slide 25

CrayhasalongtradiDonofhighperformancecompilers VectorizaDon ParallelizaDon CodetransformaDon More BeganinternalinvesDgaDonleveraginganopensource

compilercalledLLVM IniDalresultsandprogressbe_erthanexpected DecidedtomoveforwardwithCrayX86compiler 7.0releasedinDecember2008 7.1willbereleasedQ22009


Cray Inc. Confidential

Slide 26

Fortran Source

C and C++ Source

Fortran Front End

C & C++ Front End

C and C++ Front End supplied by Edison Design Group, with Cray-developed code for extensions and interface support

Interprocedural Analysis

Compiler

Cray Inc. Compiler Technology

Optimization and Parallelization

X86 Code Generator

Cray X2 Code Generator

Object File

X86 Code Generation from Open Source LLVM, with additional Cray-developed optimizations and interface support
Cray Inc. Proprietary Slide 27

Makesureitisavailable moduleavailPrgEnvcray ToaccesstheCraycompiler moduleloadPrgEnvcray TotargettheBarcelonachip moduleloadxtpequadcore OnceyouhaveloadedthemoduleccandfnaretheCray

compilers RecommendjustusingdefaultopDons Userm(fortran)andhlist=m(C)tondoutwhathappened mancrayfn


Cray Inc. Confidential Slide 28

ExcellentVectorizaDon Vectorizemoreloopsthanothercompilers OpenMP 2.0standard NesDng PGAS:FuncDonalUPCandCAFavailabletoday. ExcellentCacheopDmizaDons AutomaDcBlocking AutomaDcManagementofwhatstaysincache Prefetching,Interchange,Fusion,andmuchmore
Cray Inc. Confidential Slide 29

C++Support AutomaDcParallelizaDon ModernizedversionofCrayX1streamingcapability InteractswithOMPdirecDves OpenMP3.0 OpDmizedPGAS WillrequireGemininetworktoreallygofast ImprovedVectorizaDon ImproveCacheopDmizaDons


Cray Inc. Confidential Slide 30

PlasmaFusionSimulaDon 3DParDcleincellcode(PIC)intoroidalgeometry DevelopedbyProf.ZhihongLin(nowatUCIrvine) CodehasseveraldierentcharacterisDcs Stride1copies StridedmemoryoperaDons ComputaDonallyintensive Gather/Sca_er SorDngandPacking MainrouDneisknownasthepusher

Cray Inc. Confidential Slide 31

MainPusherkernelconsistsof2mainloopnests Firstloopnestcontainsgroupsof4statementswhichinclude

signicantindirectaddressing

e1=e1+wp0*wt00*(wz0*gradphi(1,0,ij)+wz1*gradphi(1,1,ij)) e2=e2+wp0*wt00*(wz0*gradphi(2,0,ij)+wz1*gradphi(2,1,ij)) e3=e3+wp0*wt00*(wz0*gradphi(3,0,ij)+wz1*gradphi(3,1,ij)) e4=e4+wp0*wt00*(wz0*phit(0,ij)+wz1*phit(1,ij))

Turn4statementsinto1vectorshortloop
ev(1:4)=ev(1:4)+wp0*wt00*(wz0*tempphi(1:4,0,ij)+wz1*tempphi(1:4,1,ij))

Secondloopislarge,computaDonallyintensive,butcontains

stridedloadsandcomputedgather CCEautomaDcallyvectorizesloop
Cray Inc. Confidential

Slide 32

GTCPusherperformance 3200MPIranksand4OMPthreads
40.0 35.0

BillionPar8clesPushed/Sec

30.0

25.0

20.0

CCE PreviousBest

15.0

10.0

5.0

Cray Inc. Confidential Slide 33

GTCperformance 3200MPIranksand4OMPthreads
16.0 14.0

BillionPar8clesPushed/Sec

12.0

10.0

8.0

CCE PreviousBest

6.0

4.0

2.0

Cray Inc. Confidential Slide 34

OverowisaNASAdevelopedNavierStokesowsolverfor

unstructuredgrids SubrouDnesconsistoftwoorthreesimplynestedloops Innerloopstendtobehighlyvectorizedandhave2050 Fortranstatements MPIisusedforparallelprocessing SolverautomaDcallysplitsgridblocksforloadbalancing Scalingislimitedduetoloadbalancingat>1024 CodeisthreadedatahighlevelviaOpenMP

Cray Inc. Confidential Slide 35

OverowScaling
4096

2048

TimeinSeconds

1024

PreviousMPI CCEMPI CCEOMP2thr CCEOMP4thr

512

256 256 512 1024 NumberofCores 2048 4096 8192

MaterialsSciencecode Scalesto1000sofMPIranksbeforeitrunsoutofparallelism WanttousesharedmemoryparallelismacrossenDrenode Mainkernelconsistsof4independentzgemms WanttousemulDlevelOMPtoscaleacrossthenode

Cray Inc. Confidential Slide 37

!$ompparalleldo doi=1,4 callcomplex_matmul() enddo SubrouDnecomplex_matmul() !$ompparalleldoprivate(j,jend,jsize)!num_threads(p2) doj=1,n,nb jend=min(n,j+nb1) jsize=jendj+1 callzgemm(transA,transB,m,jsize,k,& alpha,A,ldA,B(j,1),ldb,beta,C(1,j),ldC) enddo
Cray Inc. Confidential Slide 38

ZGEMM1000x1000
80 70 60

GFlops

50 40 30 20 10 0 SerialZGEMM HighLevelOMP ZGEMM4x1 NestedOMP ZGEMM3x3 NestedOMP ZGEMM4x2 NestedOMP ZGEMM2x4 LowlevelOMP ZGEMM1x8

ParallelmethodandNthreadsateachlevel
Cray Inc. Confidential Slide 39

ZGEMM100x100
35 30 25

GFlops

20 15 10 5 0

SerialZGEMM

HighLevelOMP ZGEMM4x1

NestedOMP ZGEMM3x3

NestedOMP ZGEMM4x2

LowLevelZGEMM 1x8

ParallelmethodandNthreadsateachlevel
Cray Inc. Confidential Slide 40

TheCrayCompilingEnvironmentisanew,dierent,and

interesDngcompilerwithseveraluniquecapabiliDes SeveralcodesarealreadytakingadvantageofCCE Developmentisongoing ConsidertryingCCEifyouthinkyoucouldtake advantageofitscapabiliDes

You might also like