Professional Documents
Culture Documents
SimulationsusingANSYS
JeffBeisheim
ANSYS,Inc
HPCDefined
HighPerformanceComputing(HPC)atANSYS:
Anongoingeffortdesignedtoremove
computinglimitationsfromengineerswho
usecomputeraidedengineeringinallphases
ofdesign,analysis,andtesting.
Itisahardwareandsoftware initiative!
NeedforSpeed
Assemblies
CADtomesh
Capturefidelity
Impactproductdesign
Enablelargemodels
Allowparametricstudies
Modal
Nonlinear
Multiphysics
Dynamics
AHistoryofHPCPerformance
2012
2012
GPUacceleration(multipleGPUs;DMP)
2010
GPUacceleration(singleGPU;SMP)
2010
2007 2009
Optimizedformulticoreprocessors
Teraflopperformanceat512cores
20052007
DistributedPCGsolver
DistributedANSYS(DMP)released
Distributedsparsesolver
Variational Technology
SupportforclustersusingWindowsHPC
2004
2000
1st companytosolve100MstructuralDOF
1999 2000
64bitlargememoryaddressing
1994
1990
IterativePCGSolverintroducedforlargeanalyses
1990
1980
SharedMemoryMultiprocessing(SMP)available
1980s
VectorProcessingonMainframes
HPCRevolution
Recentadvancementshaverevolutionizedthe
computationalspeedavailableonthedesktop
Multicoreprocessors
Everycoreisreallyanindependentprocessor
LargeamountsofRAMandSSDs
GPUs
ParallelProcessing Hardware
2Typesofmemorysystems
Sharedmemoryparallel(SMP) singlebox,workstation/server
Distributedmemoryparallel(DMP) multipleboxes,cluster
Workstation
Cluster
ParallelProcessing Software
2TypesofparallelprocessingforMechanicalAPDL
Sharedmemoryparallel(np >1)
Firstavailableinv4.3
Canonlybeusedonsinglemachine
Distributedmemoryparallel(dis np >1)
Firstavailableinv6.0withtheDDSsolver
Canbeusedonsinglemachineorcluster
GPUacceleration(acc)
Firstavailableinv13.0usingNVIDIAGPUs
SupportsusingeithersingleGPUormultipleGPUs
Canbeusedonsinglemachineorcluster
Distributed ANSYSDesignRequirements
Nolimitationinsimulationcapability
Mustsupportallfeatures
Continuallyworkingtoaddmorefunctionalitywitheachrelease
Reproducibleandconsistentresults
Sameanswersachievedusing1coreor100cores
SamequalitychecksandtestingaredoneaswithSMPversion
UsesthesamecodebaseasSMPversionofANSYS
Supportallmajorplatforms
Mostwidelyusedprocessors,operatingsystems,andinterconnects
SupportssameplatformsthatSMPversionsupports
UseslatestversionsofMPIsoftwarewhichsupportthelatest
interconnects
Distributed ANSYSDesign
Distributedsteps(dis np N)
Atstartoffirstloadstep,decompose
FEAmodelintoNpieces(domains)
Eachdomaingoestoadifferentcore
tobesolved
Solutionisnotindependent!!
Lotsofcommunicationrequiredto
achievesolution
Lotsofsynchronizationrequiredto
keepallprocessestogether
Eachprocesswritesitsownsetsof
files(file0*,file1*,file2*,,file[N1]*)
Resultsareautomaticallycombinedat
endofsolution
Facilitatespostprocessingin/POST1,
/POST26,orWorkBench
Distributed ANSYSCapabilities
Widevarietyof
features&analysis
capabilitiesare
supported
Staticlinearornonlinearanalyses
Bucklinganalyses
Modalanalyses
HarmonicresponseanalysesusingFULLmethod
TransientresponseanalysesusingFULLmethod
Singlefieldstructuralandthermalanalyses
Lowfrequencyelectromagneticanalysis
Highfrequencyelectromagneticanalysis
Coupledfieldanalyses
Allwidelyusedelementtypesandmaterials
Superelements(usepass)
NLGEOM,SOLC,LNSRCH,AUTOTS,IC,INISTATE,
LinearPerturbation
Multiframerestarts
Cyclicsymmetryanalyses
UserProgrammablefeatures(UPFs)
DistributedANSYSEquationSolvers
Sparsedirectsolver(default)
SupportsSMP,DMP,andGPUacceleration
Canhandleallanalysistypesandoptions
FoundationforBlockLanczos,Unsymmetric,Damped,andQR
dampedeigensolvers
PCGiterativesolver
SupportsSMP,DMP,andGPUacceleration
Symmetric,realvaluematricesonly(i.e.,static/fulltransient)
FoundationforPCGLanczoseigensolver
JCG/ICCGiterativesolvers
SupportsSMPonly
DistributedANSYSEigensolvers
BlockLanczoseigensolver(includingQRdamp)
SupportsSMPandGPUacceleration
PCGLanczoseigensolver
SupportsSMP,DMP,andGPUacceleration
Greatforlargemodels(>5MDOF)withrelativelyfewmodes(<50)
Supernode eigensolver
SupportsSMPonly
Optimalchoicewhenrequestinghundredsorthousandsofmodes
Subspaceeigensolver
SupportsSMP,DMP,andGPUacceleration
Currentlyonlysupportsbucklinganalyses;betaformodalinR14.5
Unsymmetric/Dampedeigensolvers
SupportsSMP,DMP,andGPUacceleration
DistributedANSYSBenefits
Betterarchitecture
Morecomputationsperformedinparallel fastersolutiontime
BetterspeedupsthanSMP
Canachieve>10xon16cores(trygettingthatwithSMP!)
Canbeusedforjobsrunningon1000+CPUcores
Cantakeadvantageofresourcesonmultiplemachines
Memoryusageandbandwidthscales
Disk(I/O)usagescales
Wholenewclassofproblemscanbesolved!
DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors
Twomaincharacteristics foreachinterconnect:latencyandbandwidth
DistributedANSYSishighlybandwidthbound
+--------- D I S T R I B U T E D
A N S Y S
Release: 14.5
Build: UP20120802
Date Run: 08/09/2012
Time: 23:07
S T A T I S T I C S ------------+
32
32
4 (Distributed Memory Parallel)
Core
Machine Name
Working Directory
---------------------------------------------------0
hpclnxsmc00
/data1/ansyswork
1
hpclnxsmc00
/data1/ansyswork
2
hpclnxsmc01
/data1/ansyswork
3
hpclnxsmc01
/data1/ansyswork
Latency time from master to core
Latency time from master to core
Latency time from master to core
1 =
2 =
3 =
1.171 microseconds
2.251 microseconds
2.225 microseconds
1 =
2 =
3 =
7934.49 MB/sec
3011.09 MB/sec
3235.00 MB/sec
Same machine
QDR Infiniband
QDR Infiniband
DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors
Rating(runs/day)
60
50
InterconnectPerformance
GigabitEthernet
DDRInfiniband
40
30
Turbinemodel
20
2.1millionDOF
10
SOLID187elements
Nonlinearstaticanalysis
0
Sparsesolver(DMP)
Linuxcluster(8corespernode)
DistributedANSYSPerformance
Needfastharddrivestofeedfastprocessors
Checkthebandwidthspecs
ANSYSMechanicalcanbehighlyI/Obandwidthbound
SparsesolverintheoutofcorememorymodedoeslotsofI/O
DistributedANSYScanbehighlyI/Olatencybound
Seektimetoread/writeeachsetoffilescausesoverhead
ConsiderSSDs
Highbandwidthandextremelylowseektimes
ConsiderRAIDconfigurations
RAID0 forspeed
RAID1,5 forredundancy
RAID10 forspeedandredundancy
DistributedANSYSPerformance
Needfastharddrivestofeedfastprocessors
8millionDOF
Linearstaticanalysis
Sparsesolver(DMP)
DellT5500workstation
12IntelXeonx5675cores,48GBRAM,
single7.2krpmHDD,singleSSD,Win7)
Rating(runs/day)
30
HardDrivePerformance
HDD
SSD
25
20
15
10
5
0
1core
2cores
4cores
8cores
Beisheim,J.R.,BoostingMemoryCapacitywithSSDs,ANSYSAdvantageMagazine,VolumeIV,Issue1,pp.37,2010.
DistributedANSYSPerformance
AvoidwaitingforI/Otocomplete!
ChecktoseeifjobisI/Oboundorcomputebound
CheckoutputfileforCPUandElapsedtimes
WhenElapsedtime>>mainthreadCPUtime
Total CPU time for main thread
. . .
. . .
Elapsed Time (sec) =
388.000
I/Obound
167.8 seconds
Date
08/21/2012
ConsideraddingmoreRAMorfasterharddriveconfiguration
WhenElapsedtimemainthreadCPUtimeComputebound
Consideringmovingsimulationtoamachinewithfasterprocessors
ConsiderusingDistributedANSYS(DMP)insteadofSMP
ConsiderrunningonmorecoresorpossiblyusingGPU(s)
DistributedANSYSPerformance
Thermal(fullmodel)
3MDOF
Thermomechanical
Simulation (fullmodel)
7.8MDOF
Interpolation of
BoundaryConditions
Submodel:CreepStrain
Analysis5.5MDOF
TotalTime
ANSYS11.0
ANSYS12.0
ANSYS12.1
ANSYS13.0SP2
ANSYS 14.0
Time
4 hours
4hours
4hours
4hours
1hour
0.8hour
Cores
8+1GPU
32
Time
~5.5days
34.3hours
12.5hours
9.9 hours
7.5hours
Iterations
163
164
195
195
195
Cores
20
64
64
128
Time
37hours
37hours
37hours
0.2hour
0.2hour
LoadSteps
16
16
16
Improved
algorithm
16
Time
~5.5days
38.5hours
8.5hours
6.1hours
5.9hours
4.2hours
Iterations
492
492
492
488
498
498
Cores
18
16
76
128
64+8GPU
256
2weeks
5days
2 days
1day
0.5day
AllrunswithSparsesolver
Hardware12.0:dualX5460(3.16GHzHarpertown IntelXeon)64GBRAMpernode
Hardware12.1+13.0:dualX5570(2.93GHzNehalemIntelXeon)72GBRAMpernode
ANSYS12.0to14.0runswithDDRInfinibandinterconnect
ANSYS14.0creeprunswithNROPT,CRPL+DDOPT,METIS
ResultsCourtesyofMicroConsult Engineering,GmbH
DistributedANSYSPerformance
Minimumtimetosolutionmoreimportantthanscaling
SolutionScalability
25
Speedup
20
15
10
Turbinemodel
2.1millionDOF
5
Nonlinearstaticanalysis
1Loadstep,7substeps,
0
25equilibriumiterations
0
Linuxcluster(8corespernode)
16
24
32
40
48
56
64
DistributedANSYSPerformance
Minimumtimetosolutionmoreimportantthanscaling
SolutionScalability
SolutionElapsedTime
45000
40000
11hrs,48 mins
35000
30000
25000
20000
Turbinemodel
15000
2.1millionDOF
10000
Nonlinearstaticanalysis
5000
1Loadstep,7substeps,
0
25equilibriumiterations
0
Linuxcluster(8corespernode)
1hr,20mins
30mins
16
24
32
40
48
56
64
GPUAcceleratorCapability
Graphicsprocessingunits(GPUs)
Widelyusedforgaming,graphicsrendering
Recentlybeenmadeavailableasgeneralpurposeaccelerators
Supportfordoubleprecisioncomputations
PerformanceexceedingthelatestmulticoreCPUs
SohowcanANSYSmakeuseofthisnewtechnology
toreducetheoveralltimetosolution??
GPUAcceleratorCapability
AccelerateSparsedirectsolver(SMP&DMP)
GPUisusedtofactormanydensefrontalmatrices
DecisionismadeautomaticallyonwhentosenddatatoGPU
Frontalmatrixtoosmall,toomuchoverhead,staysonCPU
Frontalmatrixtoolarge,exceedsGPUmemory,onlypartially
accelerated
AcceleratePCG/JCGiterativesolvers(SMP&DMP)
GPUisonlyusedforsparsematrixvectormultiply(SpMV
kernel)
DecisionismadeautomaticallyonwhentosenddatatoGPU
Modeltoosmall,toomuchoverhead,staysonCPU
Modeltoolarge,exceedsGPUmemory,onlypartiallyaccelerated
GPUAcceleratorCapability
Supportedhardware
CurrentlysupportNVIDIATesla20series,Quadro6000,and
QuadroK5000cards
NextgenerationNVIDIATeslacards(Kepler)shouldworkwith
R14.5
InstallingaGPUrequiresthefollowing:
Largerpowersupply(singlecardneeds~250W)
Open2xformfactorPCIe x162.0(or3.0)slot
Supportedplatforms
WindowsandLinux64bitplatformsonly
DoesnotincludeLinuxItanium(IA64)platform
GPUAcceleratorCapability
Targetedhardware
NVIDIA
TeslaC2075
NVIDIA
Tesla
M2090
NVIDIA
Quadro
6000
NVIDIA
Quadro
K5000
NVIDIA
Tesla
K10
NVIDIA
Tesla
K20
Power(W)
225
250
225
122
250
250
Memory
6GB
6GB
6GB
4GB
8GB
6to24GB
Memory
Bandwidth
(GB/s)
144
177.4
144
173
320
288
PeakSpeed
SP/DP
(GFlops)
1030/515
1331/665
1030/515
2290/95
4577/190
5184/1728
TheseNVIDIAKeplerbasedproductsarenotreleasedyet,sospecificationsmaybeincorrect
GPUAcceleratorCapability
GPUscanoffersignificantlyfastertime tosolution
GPUPerformance
4.0
3.8x
6.5millionDOF
Linearstaticanalysis
Sparsesolver(DMP)
2IntelXeonE52670(2.6GHz,
16corestotal),128GBRAM,
SSD,4TeslaC2075,Win7
RelativeSpeedup
3.5
3.0
2.6x
2.5
2.0
1.5
1.0
0.5
0.0
2cores
(noGPU)
8cores
(noGPU)
8cores
(1GPU)
GPUAcceleratorCapability
GPUscanoffersignificantlyfastertime tosolution
GPUPerformance
11.8millionDOF
Linearstaticanalysis
PCGsolver(DMP)
2IntelXeonE52670(2.6GHz,
16corestotal),128GBRAM,
SSD,4TeslaC2075,Win7
RelativeSpeedup
6.0
5.2x
5.0
4.0
2.7x
3.0
2.0
1.0
0.0
2cores
8cores
16cores
(noGPU)
(1GPU)
(4GPUs)
GPUAcceleratorCapability
SupportsmajorityofANSYSusers
CoversbothsparsedirectandPCGiterativesolvers
Onlyafewminorlimitations
Easeofuse
RequiresatleastonesupportedGPUcardtobeinstalled
RequiresatleastoneHPCpacklicense
Norebuild,noadditionalinstallationsteps
Performance
~1025%reductionintimetosolutionwhenusing8CPUcores
Shouldneverslowdownyoursimulation!
DesignOptimization
Howwillyouuseallofthiscomputingpower?
Higherfidelity
Fullassemblies
DesignOptimizationStudies
Morenonlinear
HPCLicensing
ANSYSHPCPacksenable
highfidelityinsight
Eachsimulationconsumesone
ormorepacks
Parallelenabledincreases
quicklywithaddedpacks
Singlesolutionforallphysics
andanyleveloffidelity
FlexibilityasyourHPC
resourcesgrow
ReallocatePacks,asresources
allow
Parallel
Enabled
(Cores)
64 GPU
+
1GPU
+
4 GPU
+
16 GPU
+
256 GPU
+
2048
512
128
32
2
3
4
5
PacksperSimulation
HPCParametricPackLicensing
Scalable,likeANSYSHPC
Packs
64
Numberof
Simultaneous
DesignPoints
Enabled
Enhancesthecustomersability
toincludemany designpointsas
partofasingle study
Ensuresoundproductdecision
making
32
Amplifiescompleteworkflow
Designpointscaninclude
executionofmultipleproducts
(pre,solve,HPC,post)
Packagedtoencourageadoption
ofthepathtorobustdesign!
16
8
4
NumberofHPCParametricPackLicenses
HPCRevolution
HDDvs.SSDs
SMPvs.DMP
GPUs
Therightcombinationof
algorithmsand hardware
leadstomaximum
efficiency
Clusters
Interconnects
HPCRevolution
Everycomputertodayisaparallelcomputer
EverysimulationinANSYScanbenefitfrom
parallelprocessing