HPC For Mechanical Ansys

HighPerformanceComputingforMechanical SimulationsusingANSYS
JeffBeisheim ANSYS,Inc
HPCDefined
HighPerformanceComputing(HPC)atANSYS: Anongoingeffortdesignedtoremove computinglimitationsfromengineerswho usecomputeraidedengineeringinallphases ofdesign,analysis,andtesting.
Itisahardwareandsoftware initiative!
NeedforSpeed
Assemblies CADtomesh Capturefidelity
Impactproductdesign Enablelargemodels Allowparametricstudies Modal Nonlinear Multiphysics Dynamics
AHistoryofHPCPerformance
2012 2010
2012
GPUacceleration(multipleGPUs; DMP)
2010
GPUacceleration(singleGPU; SMP)
2007 2009 Optimizedformulticoreprocessors Teraflopperformanceat512cores 20052007 DistributedPCGsolver DistributedANSYS(DMP)released Distributedsparsesolver Variational Technology SupportforclustersusingWindowsHPC
2000
2004
1st companytosolve100MstructuralDOF
1999 2000
64bitlargememoryaddressing
1994
1990 1980
IterativePCGSolverintroducedforlargeanalyses
1990
SharedMemoryMultiprocessing(SMP)available
1980s
VectorProcessingonMainframes
HPCRevolution
Recentadvancementshaverevolutionizedthe computationalspeedavailableonthedesktop Multicoreprocessors
Everycoreisreallyanindependentprocessor
LargeamountsofRAMandSSDs GPUs
ParallelProcessing Hardware
2Typesofmemorysystems
Sharedmemoryparallel(SMP) singlebox,workstation/server Distributedmemoryparallel(DMP) multipleboxes,cluster
Workstation
Cluster
ParallelProcessing Software
2TypesofparallelprocessingforMechanicalAPDL Sharedmemoryparallel(np >1)
Firstavailableinv4.3 Canonlybeusedonsinglemachine
Distributedmemoryparallel(dis np >1)
Firstavailableinv6.0withtheDDSsolver Canbeusedonsinglemachineorcluster
GPUacceleration(acc)
Firstavailableinv13.0usingNVIDIAGPUs SupportsusingeithersingleGPUormultipleGPUs Canbeusedonsinglemachineorcluster
Distributed ANSYSDesignRequirements
Nolimitationinsimulationcapability
Mustsupportallfeatures Continuallyworkingtoaddmorefunctionalitywitheachrelease
Reproducibleandconsistentresults
Sameanswersachievedusing1coreor100cores SamequalitychecksandtestingaredoneaswithSMPversion UsesthesamecodebaseasSMPversionofANSYS
Supportallmajorplatforms
Mostwidelyusedprocessors,operatingsystems,andinterconnects SupportssameplatformsthatSMPversionsupports UseslatestversionsofMPIsoftwarewhichsupportthelatest interconnects
Distributed ANSYSDesign
Distributedsteps(dis np N)
Atstartoffirstloadstep,decompose FEAmodelintoNpieces(domains) Eachdomaingoestoadifferentcore tobesolved Solutionisnotindependent!!
Lotsofcommunicationrequiredto achievesolution Lotsofsynchronizationrequiredto keepallprocessestogether
Eachprocesswritesitsownsetsof files(file0*,file1*,file2*,,file[N1]*) Resultsareautomaticallycombinedat endofsolution

Facilitatespostprocessingin/POST1, /POST26,orWorkBench
Distributed ANSYSCapabilities
Staticlinearornonlinearanalyses Bucklinganalyses Modalanalyses HarmonicresponseanalysesusingFULLmethod TransientresponseanalysesusingFULLmethod Singlefieldstructuralandthermalanalyses Lowfrequencyelectromagneticanalysis Highfrequencyelectromagneticanalysis Coupledfieldanalyses Allwidelyusedelementtypesandmaterials Superelements(usepass) NLGEOM,SOLC,LNSRCH,AUTOTS,IC,INISTATE, LinearPerturbation Multiframerestarts Cyclicsymmetryanalyses UserProgrammablefeatures(UPFs)
Widevarietyof features&analysis capabilitiesare supported
DistributedANSYSEquationSolvers
Sparsedirectsolver(default)
SupportsSMP,DMP,andGPUacceleration Canhandleallanalysistypesandoptions FoundationforBlockLanczos,Unsymmetric,Damped,andQR dampedeigensolvers
PCGiterativesolver
SupportsSMP,DMP,andGPUacceleration Symmetric,realvaluematricesonly(i.e.,static/fulltransient) FoundationforPCGLanczoseigensolver
JCG/ICCGiterativesolvers
SupportsSMPonly
DistributedANSYSEigensolvers
BlockLanczoseigensolver(includingQRdamp)
SupportsSMPandGPUacceleration
PCGLanczoseigensolver
SupportsSMP,DMP,andGPUacceleration Greatforlargemodels(>5MDOF)withrelativelyfewmodes(<50)
Supernode eigensolver
SupportsSMPonly Optimalchoicewhenrequestinghundredsorthousandsofmodes
Subspaceeigensolver
SupportsSMP,DMP,andGPUacceleration Currentlyonlysupportsbucklinganalyses;betaformodalinR14.5
Unsymmetric/Dampedeigensolvers
SupportsSMP,DMP,andGPUacceleration
DistributedANSYSBenefits
Betterarchitecture
Morecomputationsperformedinparallel fastersolutiontime
BetterspeedupsthanSMP
Canachieve>10xon16cores(trygettingthatwithSMP!) Canbeusedforjobsrunningon1000+CPUcores
Cantakeadvantageofresourcesonmultiplemachines
Memoryusageandbandwidthscales Disk(I/O)usagescales Wholenewclassofproblemscanbesolved!
DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors
Twomaincharacteristics foreachinterconnect:latencyandbandwidth DistributedANSYSishighlybandwidthbound
+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+
Release: 14.5 Build: UP20120802 Date Run: 08/09/2012 Time: 23:07
Platform: LINUX x64
Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz Total number of cores available : Number of physical cores available : Number of cores requested : MPI Type: INTELMPI 32 32 4 (Distributed Memory Parallel)
Core Machine Name Working Directory ---------------------------------------------------0 hpclnxsmc00 /data1/ansyswork 1 hpclnxsmc00 /data1/ansyswork 2 hpclnxsmc01 /data1/ansyswork 3 hpclnxsmc01 /data1/ansyswork Latency time from master to core Latency time from master to core Latency time from master to core 1 = 2 = 3 = 1.171 microseconds 2.251 microseconds 2.225 microseconds 1 = 2 = 3 = 7934.49 MB/sec 3011.09 MB/sec 3235.00 MB/sec Same machine QDR Infiniband QDR Infiniband
Communication speed from master to core Communication speed from master to core Communication speed from master to core
Needfastinterconnectstofeedfastprocessors
60
InterconnectPerformance
GigabitEthernet DDRInfiniband
Rating(runs/day)
50 40 30
Turbinemodel 20 2.1millionDOF 10 SOLID187elements Nonlinearstaticanalysis 0 Sparsesolver(DMP) Linuxcluster(8corespernode)
8cores 16cores 32cores 64cores 128cores
Needfastharddrivestofeedfastprocessors
Checkthebandwidthspecs
ANSYSMechanicalcanbehighlyI/Obandwidthbound
SparsesolverintheoutofcorememorymodedoeslotsofI/O
DistributedANSYScanbehighlyI/Olatencybound
Seektimetoread/writeeachsetoffilescausesoverhead
ConsiderSSDs
Highbandwidthandextremelylowseektimes
ConsiderRAIDconfigurations RAID0 forspeed RAID1,5 forredundancy RAID10 forspeedandredundancy
Needfastharddrivestofeedfastprocessors
30
HardDrivePerformance
HDD SSD
Rating(runs/day)
25 20 15 10 5 0 1core
8millionDOF Linearstaticanalysis Sparsesolver(DMP) DellT5500workstation

12IntelXeonx5675cores,48GBRAM, single7.2krpmHDD,singleSSD,Win7)
2cores
4cores
8cores
Beisheim,J.R.,BoostingMemoryCapacitywithSSDs,ANSYSAdvantageMagazine,VolumeIV,Issue1,pp.37,2010.
AvoidwaitingforI/Otocomplete! ChecktoseeifjobisI/Oboundorcomputebound
CheckoutputfileforCPUandElapsedtimes
WhenElapsedtime>>mainthreadCPUtime
Total CPU time for main thread . . . . . . Elapsed Time (sec) = 388.000 : 167.8 seconds
I/Obound
Date
08/21/2012
ConsideraddingmoreRAMorfasterharddriveconfiguration
WhenElapsedtimemainthreadCPUtimeComputebound Consideringmovingsimulationtoamachinewithfasterprocessors ConsiderusingDistributedANSYS(DMP)insteadofSMP ConsiderrunningonmorecoresorpossiblyusingGPU(s)
ANSYS11.0 Thermal(fullmodel) 3MDOF Thermomechanical Simulation (fullmodel) 7.8MDOF Time Cores Time Iterations Cores Interpolation of BoundaryConditions Time LoadSteps Submodel:CreepStrain Analysis5.5MDOF Time Iterations Cores TotalTime 4 hours 8 ~5.5days 163 8 37hours 16 ~5.5days 492 18 2weeks ANSYS12.0 4hours 8 34.3hours 164 20 37hours 16 38.5hours 492 16 5days ANSYS12.1 4hours 8 12.5hours 195 64 37hours 16 8.5hours 492 76 2 days ANSYS13.0SP2 4hours 8 9.9 hours 195 64 0.2hour Improved algorithm 6.1hours 488 128 1day ANSYS 14.0 1hour 8+1GPU 7.5hours 195 128 0.2hour 16 5.9hours 498 64+8GPU 0.5day 4.2hours 498 256 0.8hour 32
AllrunswithSparsesolver Hardware12.0:dualX5460(3.16GHzHarpertown IntelXeon)64GBRAMpernode Hardware12.1+13.0:dualX5570(2.93GHzNehalemIntelXeon)72GBRAMpernode ANSYS12.0to14.0runswithDDRInfinibandinterconnect ANSYS14.0creeprunswithNROPT,CRPL+DDOPT,METIS
ResultsCourtesyofMicroConsult Engineering,GmbH
Minimumtimetosolutionmoreimportantthanscaling
25 20
SolutionScalability
Speedup
15
10 Turbinemodel 2.1millionDOF 5 Nonlinearstaticanalysis 1Loadstep,7substeps, 0 25equilibriumiterations 0 Linuxcluster(8corespernode)
16
24
32
40
48
56
64
Minimumtimetosolutionmoreimportantthanscaling
45000
SolutionScalability
11hrs,48 mins
SolutionElapsedTime
40000 35000 30000 25000 20000
Turbinemodel 15000 2.1millionDOF 10000 Nonlinearstaticanalysis 5000 1Loadstep,7substeps, 0 25equilibriumiterations 0 Linuxcluster(8corespernode)
1hr,20mins 30mins
16
24
32
40
48
56
64
GPUAcceleratorCapability
Graphicsprocessingunits(GPUs)
Widelyusedforgaming,graphicsrendering Recentlybeenmadeavailableasgeneralpurposeaccelerators
Supportfordoubleprecisioncomputations PerformanceexceedingthelatestmulticoreCPUs
SohowcanANSYSmakeuseofthisnewtechnology toreducetheoveralltimetosolution??
AccelerateSparsedirectsolver(SMP&DMP)
GPUisusedtofactormanydensefrontalmatrices DecisionismadeautomaticallyonwhentosenddatatoGPU
Frontalmatrixtoosmall,toomuchoverhead,staysonCPU Frontalmatrixtoolarge,exceedsGPUmemory,onlypartially accelerated
AcceleratePCG/JCGiterativesolvers(SMP&DMP)
GPUisonlyusedforsparsematrixvectormultiply(SpMV kernel) DecisionismadeautomaticallyonwhentosenddatatoGPU
Modeltoosmall,toomuchoverhead,staysonCPU Modeltoolarge,exceedsGPUmemory,onlypartiallyaccelerated
Supportedhardware
CurrentlysupportNVIDIATesla20series,Quadro6000,and QuadroK5000cards NextgenerationNVIDIATeslacards(Kepler)shouldworkwith R14.5 InstallingaGPUrequiresthefollowing:
Largerpowersupply(singlecardneeds~250W) Open2xformfactorPCIe x162.0(or3.0)slot
Supportedplatforms
WindowsandLinux64bitplatformsonly
DoesnotincludeLinuxItanium(IA64)platform
Targetedhardware
NVIDIA TeslaC2075
Power(W) Memory Memory Bandwidth (GB/s) PeakSpeed SP/DP (GFlops)
NVIDIA Tesla M2090 250 6GB 177.4
NVIDIA Quadro 6000 225 6GB 144
NVIDIA Quadro K5000 122 4GB 173
NVIDIA Tesla K10 250 8GB 320
NVIDIA Tesla K20 250 6to24GB 288
225 6GB 144
1030/515
1331/665
1030/515
2290/95
4577/190
5184/1728
TheseNVIDIAKeplerbasedproductsarenotreleasedyet,sospecificationsmaybeincorrect
GPUscanoffersignificantlyfastertime tosolution
GPUPerformance
4.0 3.5 3.8x
RelativeSpeedup
3.0 2.5 2.0 1.5 1.0 0.5 0.0 2cores (noGPU)
2.6x
6.5millionDOF Linearstaticanalysis Sparsesolver(DMP) 2IntelXeonE52670(2.6GHz, 16corestotal),128GBRAM, SSD,4TeslaC2075,Win7
8cores (noGPU)
8cores (1GPU)
GPUscanoffersignificantlyfastertime tosolution
GPUPerformance
6.0 5.2x 5.0 4.0 3.0 2.0 1.0 0.0 2cores (noGPU) 8cores (1GPU) 16cores (4GPUs) 2.7x
11.8millionDOF Linearstaticanalysis PCGsolver(DMP) 2IntelXeonE52670(2.6GHz, 16corestotal),128GBRAM, SSD,4TeslaC2075,Win7
RelativeSpeedup
SupportsmajorityofANSYSusers
CoversbothsparsedirectandPCGiterativesolvers Onlyafewminorlimitations
Easeofuse
RequiresatleastonesupportedGPUcardtobeinstalled RequiresatleastoneHPCpacklicense Norebuild,noadditionalinstallationsteps
Performance
~1025%reductionintimetosolutionwhenusing8CPUcores Shouldneverslowdownyoursimulation!
DesignOptimization
Howwillyouuseallofthiscomputingpower?
Higherfidelity
Fullassemblies
Morenonlinear
DesignOptimizationStudies
HPCLicensing
ANSYSHPCPacksenable highfidelityinsight
Eachsimulationconsumesone ormorepacks Parallelenabledincreases quicklywithaddedpacks
Parallel Enabled (Cores)
64 GPU + 4 GPU + 16 GPU +
256 GPU +
2048
512
1GPU +
128
32
Singlesolutionforallphysics andanyleveloffidelity FlexibilityasyourHPC resourcesgrow

ReallocatePacks,asresources allow
2 3 4 5 PacksperSimulation
HPCParametricPackLicensing
Scalable,likeANSYSHPC Packs
Enhancesthecustomersability toincludemany designpointsas partofasingle study Ensuresoundproductdecision making
Numberof Simultaneous DesignPoints Enabled
64
32
Amplifiescompleteworkflow
Designpointscaninclude executionofmultipleproducts (pre,solve,HPC,post) Packagedtoencourageadoption ofthepathtorobustdesign!
16 8 4
NumberofHPCParametricPackLicenses
HPCRevolution
HDDvs.SSDs
SMPvs.DMP
Therightcombinationof algorithmsand hardware leadstomaximum efficiency

Clusters
GPUs
Interconnects
HPCRevolution
Everycomputertodayisaparallelcomputer EverysimulationinANSYScanbenefitfrom parallelprocessing

HPC For Mechanical Ansys

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HPC For Mechanical Ansys

Uploaded by

Copyright:

Available Formats

HighPerformanceComputingforMechanical SimulationsusingANSYS

Impactproductdesign Enablelargemodels Allowparametricstudies Modal Nonlinear Multiphysics Dynamics

Eachprocesswritesitsownsetsof files(file0*,file1*,file2*,,file[N1]*) Resultsareautomaticallycombinedat endofsolution

Widevarietyof features&analysis capabilitiesare supported

Release: 14.5 Build: UP20120802 Date Run: 08/09/2012 Time: 23:07

Platform: LINUX x64

Turbinemodel 20 2.1millionDOF 10 SOLID187elements Nonlinearstaticanalysis 0 Sparsesolver(DMP) Linuxcluster(8corespernode)

8cores 16cores 32cores 64cores 128cores

ConsiderRAIDconfigurations RAID0 forspeed RAID1,5 forredundancy RAID10 forspeedandredundancy

8millionDOF Linearstaticanalysis Sparsesolver(DMP) DellT5500workstation

WhenElapsedtimemainthreadCPUtimeComputebound Consideringmovingsimulationtoamachinewithfasterprocessors ConsiderusingDistributedANSYS(DMP)insteadofSMP ConsiderrunningonmorecoresorpossiblyusingGPU(s)

AllrunswithSparsesolver Hardware12.0:dualX5460(3.16GHzHarpertown IntelXeon)64GBRAMpernode Hardware12.1+13.0:dualX5570(2.93GHzNehalemIntelXeon)72GBRAMpernode ANSYS12.0to14.0runswithDDRInfinibandinterconnect ANSYS14.0creeprunswithNROPT,CRPL+DDOPT,METIS

10 Turbinemodel 2.1millionDOF 5 Nonlinearstaticanalysis 1Loadstep,7substeps, 0 25equilibriumiterations 0 Linuxcluster(8corespernode)

40000 35000 30000 25000 20000

Turbinemodel 15000 2.1millionDOF 10000 Nonlinearstaticanalysis 5000 1Loadstep,7substeps, 0 25equilibriumiterations 0 Linuxcluster(8corespernode)

NVIDIA Tesla M2090 250 6GB 177.4

NVIDIA Quadro 6000 225 6GB 144

NVIDIA Quadro K5000 122 4GB 173

NVIDIA Tesla K10 250 8GB 320

NVIDIA Tesla K20 250 6to24GB 288

225 6GB 144

3.0 2.5 2.0 1.5 1.0 0.5 0.0 2cores (noGPU)

6.5millionDOF Linearstaticanalysis Sparsesolver(DMP) 2IntelXeonE52670(2.6GHz, 16corestotal),128GBRAM, SSD,4TeslaC2075,Win7

11.8millionDOF Linearstaticanalysis PCGsolver(DMP) 2IntelXeonE52670(2.6GHz, 16corestotal),128GBRAM, SSD,4TeslaC2075,Win7

Singlesolutionforallphysics andanyleveloffidelity FlexibilityasyourHPC resourcesgrow

Therightcombinationof algorithmsand hardware leadstomaximum efficiency

Everycomputertodayisaparallelcomputer EverysimulationinANSYScanbenefitfrom parallelprocessing

You might also like

Eachprocesswritesitsownsetsof files(file0,file1,file2,,file[N1]) Resultsareautomaticallycombinedat endofsolution