Professional Documents
Culture Documents
JeffBeisheim ANSYS,Inc
HPCDefined
HighPerformanceComputing(HPC)atANSYS: Anongoingeffortdesignedtoremove computinglimitationsfromengineerswho usecomputeraidedengineeringinallphases ofdesign,analysis,andtesting.
Itisahardwareandsoftware initiative!
NeedforSpeed
Assemblies CADtomesh Capturefidelity
AHistoryofHPCPerformance
2012 2010
2012
GPUacceleration(multipleGPUs; DMP)
2010
GPUacceleration(singleGPU; SMP)
2007 2009 Optimizedformulticoreprocessors Teraflopperformanceat512cores 20052007 DistributedPCGsolver DistributedANSYS(DMP)released Distributedsparsesolver Variational Technology SupportforclustersusingWindowsHPC
2000
2004
1st companytosolve100MstructuralDOF
1999 2000
64bitlargememoryaddressing
1994
1990 1980
IterativePCGSolverintroducedforlargeanalyses
1990
SharedMemoryMultiprocessing(SMP)available
1980s
VectorProcessingonMainframes
HPCRevolution
Recentadvancementshaverevolutionizedthe computationalspeedavailableonthedesktop Multicoreprocessors
Everycoreisreallyanindependentprocessor
LargeamountsofRAMandSSDs GPUs
ParallelProcessing Hardware
2Typesofmemorysystems
Sharedmemoryparallel(SMP) singlebox,workstation/server Distributedmemoryparallel(DMP) multipleboxes,cluster
Workstation
Cluster
ParallelProcessing Software
2TypesofparallelprocessingforMechanicalAPDL Sharedmemoryparallel(np >1)
Firstavailableinv4.3 Canonlybeusedonsinglemachine
Distributedmemoryparallel(dis np >1)
Firstavailableinv6.0withtheDDSsolver Canbeusedonsinglemachineorcluster
GPUacceleration(acc)
Firstavailableinv13.0usingNVIDIAGPUs SupportsusingeithersingleGPUormultipleGPUs Canbeusedonsinglemachineorcluster
Distributed ANSYSDesignRequirements
Nolimitationinsimulationcapability
Mustsupportallfeatures Continuallyworkingtoaddmorefunctionalitywitheachrelease
Reproducibleandconsistentresults
Sameanswersachievedusing1coreor100cores SamequalitychecksandtestingaredoneaswithSMPversion UsesthesamecodebaseasSMPversionofANSYS
Supportallmajorplatforms
Mostwidelyusedprocessors,operatingsystems,andinterconnects SupportssameplatformsthatSMPversionsupports UseslatestversionsofMPIsoftwarewhichsupportthelatest interconnects
Distributed ANSYSDesign
Distributedsteps(dis np N)
Atstartoffirstloadstep,decompose FEAmodelintoNpieces(domains) Eachdomaingoestoadifferentcore tobesolved Solutionisnotindependent!!
Lotsofcommunicationrequiredto achievesolution Lotsofsynchronizationrequiredto keepallprocessestogether
Distributed ANSYSCapabilities
Staticlinearornonlinearanalyses Bucklinganalyses Modalanalyses HarmonicresponseanalysesusingFULLmethod TransientresponseanalysesusingFULLmethod Singlefieldstructuralandthermalanalyses Lowfrequencyelectromagneticanalysis Highfrequencyelectromagneticanalysis Coupledfieldanalyses Allwidelyusedelementtypesandmaterials Superelements(usepass) NLGEOM,SOLC,LNSRCH,AUTOTS,IC,INISTATE, LinearPerturbation Multiframerestarts Cyclicsymmetryanalyses UserProgrammablefeatures(UPFs)
DistributedANSYSEquationSolvers
Sparsedirectsolver(default)
SupportsSMP,DMP,andGPUacceleration Canhandleallanalysistypesandoptions FoundationforBlockLanczos,Unsymmetric,Damped,andQR dampedeigensolvers
PCGiterativesolver
SupportsSMP,DMP,andGPUacceleration Symmetric,realvaluematricesonly(i.e.,static/fulltransient) FoundationforPCGLanczoseigensolver
JCG/ICCGiterativesolvers
SupportsSMPonly
DistributedANSYSEigensolvers
BlockLanczoseigensolver(includingQRdamp)
SupportsSMPandGPUacceleration
PCGLanczoseigensolver
SupportsSMP,DMP,andGPUacceleration Greatforlargemodels(>5MDOF)withrelativelyfewmodes(<50)
Supernode eigensolver
SupportsSMPonly Optimalchoicewhenrequestinghundredsorthousandsofmodes
Subspaceeigensolver
SupportsSMP,DMP,andGPUacceleration Currentlyonlysupportsbucklinganalyses;betaformodalinR14.5
Unsymmetric/Dampedeigensolvers
SupportsSMP,DMP,andGPUacceleration
DistributedANSYSBenefits
Betterarchitecture
Morecomputationsperformedinparallel fastersolutiontime
BetterspeedupsthanSMP
Canachieve>10xon16cores(trygettingthatwithSMP!) Canbeusedforjobsrunningon1000+CPUcores
Cantakeadvantageofresourcesonmultiplemachines
Memoryusageandbandwidthscales Disk(I/O)usagescales Wholenewclassofproblemscanbesolved!
DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors
Twomaincharacteristics foreachinterconnect:latencyandbandwidth DistributedANSYSishighlybandwidthbound
+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+
Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz Total number of cores available : Number of physical cores available : Number of cores requested : MPI Type: INTELMPI 32 32 4 (Distributed Memory Parallel)
Core Machine Name Working Directory ---------------------------------------------------0 hpclnxsmc00 /data1/ansyswork 1 hpclnxsmc00 /data1/ansyswork 2 hpclnxsmc01 /data1/ansyswork 3 hpclnxsmc01 /data1/ansyswork Latency time from master to core Latency time from master to core Latency time from master to core 1 = 2 = 3 = 1.171 microseconds 2.251 microseconds 2.225 microseconds 1 = 2 = 3 = 7934.49 MB/sec 3011.09 MB/sec 3235.00 MB/sec Same machine QDR Infiniband QDR Infiniband
Communication speed from master to core Communication speed from master to core Communication speed from master to core
DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors
60
InterconnectPerformance
GigabitEthernet DDRInfiniband
Rating(runs/day)
50 40 30
DistributedANSYSPerformance
Needfastharddrivestofeedfastprocessors
Checkthebandwidthspecs
ANSYSMechanicalcanbehighlyI/Obandwidthbound
SparsesolverintheoutofcorememorymodedoeslotsofI/O
DistributedANSYScanbehighlyI/Olatencybound
Seektimetoread/writeeachsetoffilescausesoverhead
ConsiderSSDs
Highbandwidthandextremelylowseektimes
DistributedANSYSPerformance
Needfastharddrivestofeedfastprocessors
30
HardDrivePerformance
HDD SSD
Rating(runs/day)
25 20 15 10 5 0 1core
2cores
4cores
8cores
Beisheim,J.R.,BoostingMemoryCapacitywithSSDs,ANSYSAdvantageMagazine,VolumeIV,Issue1,pp.37,2010.
DistributedANSYSPerformance
AvoidwaitingforI/Otocomplete! ChecktoseeifjobisI/Oboundorcomputebound
CheckoutputfileforCPUandElapsedtimes
WhenElapsedtime>>mainthreadCPUtime
Total CPU time for main thread . . . . . . Elapsed Time (sec) = 388.000 : 167.8 seconds
I/Obound
Date
08/21/2012
ConsideraddingmoreRAMorfasterharddriveconfiguration
DistributedANSYSPerformance
ANSYS11.0 Thermal(fullmodel) 3MDOF Thermomechanical Simulation (fullmodel) 7.8MDOF Time Cores Time Iterations Cores Interpolation of BoundaryConditions Time LoadSteps Submodel:CreepStrain Analysis5.5MDOF Time Iterations Cores TotalTime 4 hours 8 ~5.5days 163 8 37hours 16 ~5.5days 492 18 2weeks ANSYS12.0 4hours 8 34.3hours 164 20 37hours 16 38.5hours 492 16 5days ANSYS12.1 4hours 8 12.5hours 195 64 37hours 16 8.5hours 492 76 2 days ANSYS13.0SP2 4hours 8 9.9 hours 195 64 0.2hour Improved algorithm 6.1hours 488 128 1day ANSYS 14.0 1hour 8+1GPU 7.5hours 195 128 0.2hour 16 5.9hours 498 64+8GPU 0.5day 4.2hours 498 256 0.8hour 32
ResultsCourtesyofMicroConsult Engineering,GmbH
DistributedANSYSPerformance
Minimumtimetosolutionmoreimportantthanscaling
25 20
SolutionScalability
Speedup
15
16
24
32
40
48
56
64
DistributedANSYSPerformance
Minimumtimetosolutionmoreimportantthanscaling
45000
SolutionScalability
11hrs,48 mins
SolutionElapsedTime
1hr,20mins 30mins
16
24
32
40
48
56
64
GPUAcceleratorCapability
Graphicsprocessingunits(GPUs)
Widelyusedforgaming,graphicsrendering Recentlybeenmadeavailableasgeneralpurposeaccelerators
Supportfordoubleprecisioncomputations PerformanceexceedingthelatestmulticoreCPUs
SohowcanANSYSmakeuseofthisnewtechnology toreducetheoveralltimetosolution??
GPUAcceleratorCapability
AccelerateSparsedirectsolver(SMP&DMP)
GPUisusedtofactormanydensefrontalmatrices DecisionismadeautomaticallyonwhentosenddatatoGPU
Frontalmatrixtoosmall,toomuchoverhead,staysonCPU Frontalmatrixtoolarge,exceedsGPUmemory,onlypartially accelerated
AcceleratePCG/JCGiterativesolvers(SMP&DMP)
GPUisonlyusedforsparsematrixvectormultiply(SpMV kernel) DecisionismadeautomaticallyonwhentosenddatatoGPU
Modeltoosmall,toomuchoverhead,staysonCPU Modeltoolarge,exceedsGPUmemory,onlypartiallyaccelerated
GPUAcceleratorCapability
Supportedhardware
CurrentlysupportNVIDIATesla20series,Quadro6000,and QuadroK5000cards NextgenerationNVIDIATeslacards(Kepler)shouldworkwith R14.5 InstallingaGPUrequiresthefollowing:
Largerpowersupply(singlecardneeds~250W) Open2xformfactorPCIe x162.0(or3.0)slot
Supportedplatforms
WindowsandLinux64bitplatformsonly
DoesnotincludeLinuxItanium(IA64)platform
GPUAcceleratorCapability
Targetedhardware
NVIDIA TeslaC2075
Power(W) Memory Memory Bandwidth (GB/s) PeakSpeed SP/DP (GFlops)
1030/515
1331/665
1030/515
2290/95
4577/190
5184/1728
TheseNVIDIAKeplerbasedproductsarenotreleasedyet,sospecificationsmaybeincorrect
GPUAcceleratorCapability
GPUscanoffersignificantlyfastertime tosolution
GPUPerformance
4.0 3.5 3.8x
RelativeSpeedup
2.6x
8cores (noGPU)
8cores (1GPU)
GPUAcceleratorCapability
GPUscanoffersignificantlyfastertime tosolution
GPUPerformance
6.0 5.2x 5.0 4.0 3.0 2.0 1.0 0.0 2cores (noGPU) 8cores (1GPU) 16cores (4GPUs) 2.7x
RelativeSpeedup
GPUAcceleratorCapability
SupportsmajorityofANSYSusers
CoversbothsparsedirectandPCGiterativesolvers Onlyafewminorlimitations
Easeofuse
RequiresatleastonesupportedGPUcardtobeinstalled RequiresatleastoneHPCpacklicense Norebuild,noadditionalinstallationsteps
Performance
~1025%reductionintimetosolutionwhenusing8CPUcores Shouldneverslowdownyoursimulation!
DesignOptimization
Howwillyouuseallofthiscomputingpower?
Higherfidelity
Fullassemblies
Morenonlinear
DesignOptimizationStudies
HPCLicensing
ANSYSHPCPacksenable highfidelityinsight
Eachsimulationconsumesone ormorepacks Parallelenabledincreases quicklywithaddedpacks
Parallel Enabled (Cores)
64 GPU + 4 GPU + 16 GPU +
256 GPU +
2048
512
1GPU +
128
32
2 3 4 5 PacksperSimulation
HPCParametricPackLicensing
Scalable,likeANSYSHPC Packs
Enhancesthecustomersability toincludemany designpointsas partofasingle study Ensuresoundproductdecision making
Numberof Simultaneous DesignPoints Enabled
64
32
Amplifiescompleteworkflow
Designpointscaninclude executionofmultipleproducts (pre,solve,HPC,post) Packagedtoencourageadoption ofthepathtorobustdesign!
16 8 4
NumberofHPCParametricPackLicenses
HPCRevolution
HDDvs.SSDs
SMPvs.DMP
GPUs
Interconnects
HPCRevolution