You are on page 1of 33

HighPerformanceComputingforMechanical

SimulationsusingANSYS
JeffBeisheim
ANSYS,Inc

HPCDefined
HighPerformanceComputing(HPC)atANSYS:
Anongoingeffortdesignedtoremove
computinglimitationsfromengineerswho
usecomputeraidedengineeringinallphases
ofdesign,analysis,andtesting.

Itisahardwareandsoftware initiative!

NeedforSpeed
Assemblies
CADtomesh
Capturefidelity

Impactproductdesign
Enablelargemodels
Allowparametricstudies
Modal
Nonlinear
Multiphysics
Dynamics

AHistoryofHPCPerformance

2012

2012
GPUacceleration(multipleGPUs;DMP)

2010
GPUacceleration(singleGPU;SMP)

2010

2007 2009
Optimizedformulticoreprocessors
Teraflopperformanceat512cores
20052007
DistributedPCGsolver
DistributedANSYS(DMP)released
Distributedsparsesolver
Variational Technology
SupportforclustersusingWindowsHPC
2004

2000

1st companytosolve100MstructuralDOF

1999 2000
64bitlargememoryaddressing

1994

1990

IterativePCGSolverintroducedforlargeanalyses

1990

1980

SharedMemoryMultiprocessing(SMP)available

1980s
VectorProcessingonMainframes

HPCRevolution
Recentadvancementshaverevolutionizedthe
computationalspeedavailableonthedesktop
Multicoreprocessors
Everycoreisreallyanindependentprocessor

LargeamountsofRAMandSSDs
GPUs

ParallelProcessing Hardware
2Typesofmemorysystems
Sharedmemoryparallel(SMP) singlebox,workstation/server
Distributedmemoryparallel(DMP) multipleboxes,cluster

Workstation

Cluster

ParallelProcessing Software
2TypesofparallelprocessingforMechanicalAPDL
Sharedmemoryparallel(np >1)
Firstavailableinv4.3
Canonlybeusedonsinglemachine

Distributedmemoryparallel(dis np >1)
Firstavailableinv6.0withtheDDSsolver
Canbeusedonsinglemachineorcluster

GPUacceleration(acc)
Firstavailableinv13.0usingNVIDIAGPUs
SupportsusingeithersingleGPUormultipleGPUs
Canbeusedonsinglemachineorcluster

Distributed ANSYSDesignRequirements
Nolimitationinsimulationcapability
Mustsupportallfeatures
Continuallyworkingtoaddmorefunctionalitywitheachrelease

Reproducibleandconsistentresults
Sameanswersachievedusing1coreor100cores
SamequalitychecksandtestingaredoneaswithSMPversion
UsesthesamecodebaseasSMPversionofANSYS

Supportallmajorplatforms
Mostwidelyusedprocessors,operatingsystems,andinterconnects
SupportssameplatformsthatSMPversionsupports
UseslatestversionsofMPIsoftwarewhichsupportthelatest
interconnects

Distributed ANSYSDesign
Distributedsteps(dis np N)
Atstartoffirstloadstep,decompose
FEAmodelintoNpieces(domains)
Eachdomaingoestoadifferentcore
tobesolved
Solutionisnotindependent!!
Lotsofcommunicationrequiredto
achievesolution
Lotsofsynchronizationrequiredto
keepallprocessestogether

Eachprocesswritesitsownsetsof
files(file0*,file1*,file2*,,file[N1]*)
Resultsareautomaticallycombinedat
endofsolution
Facilitatespostprocessingin/POST1,
/POST26,orWorkBench

Distributed ANSYSCapabilities

Widevarietyof
features&analysis
capabilitiesare
supported

Staticlinearornonlinearanalyses
Bucklinganalyses
Modalanalyses
HarmonicresponseanalysesusingFULLmethod
TransientresponseanalysesusingFULLmethod
Singlefieldstructuralandthermalanalyses
Lowfrequencyelectromagneticanalysis
Highfrequencyelectromagneticanalysis
Coupledfieldanalyses
Allwidelyusedelementtypesandmaterials
Superelements(usepass)
NLGEOM,SOLC,LNSRCH,AUTOTS,IC,INISTATE,
LinearPerturbation
Multiframerestarts
Cyclicsymmetryanalyses
UserProgrammablefeatures(UPFs)

DistributedANSYSEquationSolvers
Sparsedirectsolver(default)
SupportsSMP,DMP,andGPUacceleration
Canhandleallanalysistypesandoptions
FoundationforBlockLanczos,Unsymmetric,Damped,andQR
dampedeigensolvers

PCGiterativesolver
SupportsSMP,DMP,andGPUacceleration
Symmetric,realvaluematricesonly(i.e.,static/fulltransient)
FoundationforPCGLanczoseigensolver

JCG/ICCGiterativesolvers
SupportsSMPonly

DistributedANSYSEigensolvers
BlockLanczoseigensolver(includingQRdamp)
SupportsSMPandGPUacceleration

PCGLanczoseigensolver
SupportsSMP,DMP,andGPUacceleration
Greatforlargemodels(>5MDOF)withrelativelyfewmodes(<50)

Supernode eigensolver
SupportsSMPonly
Optimalchoicewhenrequestinghundredsorthousandsofmodes

Subspaceeigensolver
SupportsSMP,DMP,andGPUacceleration
Currentlyonlysupportsbucklinganalyses;betaformodalinR14.5

Unsymmetric/Dampedeigensolvers
SupportsSMP,DMP,andGPUacceleration

DistributedANSYSBenefits
Betterarchitecture
Morecomputationsperformedinparallel fastersolutiontime

BetterspeedupsthanSMP
Canachieve>10xon16cores(trygettingthatwithSMP!)
Canbeusedforjobsrunningon1000+CPUcores

Cantakeadvantageofresourcesonmultiplemachines
Memoryusageandbandwidthscales
Disk(I/O)usagescales
Wholenewclassofproblemscanbesolved!

DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors
Twomaincharacteristics foreachinterconnect:latencyandbandwidth
DistributedANSYSishighlybandwidthbound
+--------- D I S T R I B U T E D

A N S Y S

Release: 14.5
Build: UP20120802
Date Run: 08/09/2012
Time: 23:07

S T A T I S T I C S ------------+

Platform: LINUX x64

Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz


Total number of cores available
:
Number of physical cores available :
Number of cores requested
:
MPI Type: INTELMPI

32
32
4 (Distributed Memory Parallel)

Core
Machine Name
Working Directory
---------------------------------------------------0
hpclnxsmc00
/data1/ansyswork
1
hpclnxsmc00
/data1/ansyswork
2
hpclnxsmc01
/data1/ansyswork
3
hpclnxsmc01
/data1/ansyswork
Latency time from master to core
Latency time from master to core
Latency time from master to core

1 =
2 =
3 =

Communication speed from master to core


Communication speed from master to core
Communication speed from master to core

1.171 microseconds
2.251 microseconds
2.225 microseconds
1 =
2 =
3 =

7934.49 MB/sec
3011.09 MB/sec
3235.00 MB/sec

Same machine
QDR Infiniband
QDR Infiniband

DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors

Rating(runs/day)

60
50

InterconnectPerformance
GigabitEthernet
DDRInfiniband

40
30

Turbinemodel
20
2.1millionDOF
10
SOLID187elements
Nonlinearstaticanalysis
0
Sparsesolver(DMP)
Linuxcluster(8corespernode)

8cores 16cores 32cores 64cores 128cores

DistributedANSYSPerformance
Needfastharddrivestofeedfastprocessors
Checkthebandwidthspecs
ANSYSMechanicalcanbehighlyI/Obandwidthbound
SparsesolverintheoutofcorememorymodedoeslotsofI/O

DistributedANSYScanbehighlyI/Olatencybound
Seektimetoread/writeeachsetoffilescausesoverhead

ConsiderSSDs
Highbandwidthandextremelylowseektimes

ConsiderRAIDconfigurations
RAID0 forspeed
RAID1,5 forredundancy
RAID10 forspeedandredundancy

DistributedANSYSPerformance
Needfastharddrivestofeedfastprocessors

8millionDOF
Linearstaticanalysis
Sparsesolver(DMP)
DellT5500workstation
12IntelXeonx5675cores,48GBRAM,
single7.2krpmHDD,singleSSD,Win7)

Rating(runs/day)

30

HardDrivePerformance
HDD
SSD

25
20
15
10
5
0
1core

2cores

4cores

8cores

Beisheim,J.R.,BoostingMemoryCapacitywithSSDs,ANSYSAdvantageMagazine,VolumeIV,Issue1,pp.37,2010.

DistributedANSYSPerformance
AvoidwaitingforI/Otocomplete!
ChecktoseeifjobisI/Oboundorcomputebound
CheckoutputfileforCPUandElapsedtimes
WhenElapsedtime>>mainthreadCPUtime
Total CPU time for main thread
. . .
. . .
Elapsed Time (sec) =
388.000

I/Obound

167.8 seconds

Date

08/21/2012

ConsideraddingmoreRAMorfasterharddriveconfiguration

WhenElapsedtimemainthreadCPUtimeComputebound
Consideringmovingsimulationtoamachinewithfasterprocessors
ConsiderusingDistributedANSYS(DMP)insteadofSMP
ConsiderrunningonmorecoresorpossiblyusingGPU(s)

DistributedANSYSPerformance
Thermal(fullmodel)
3MDOF
Thermomechanical
Simulation (fullmodel)
7.8MDOF

Interpolation of
BoundaryConditions

Submodel:CreepStrain
Analysis5.5MDOF

TotalTime

ANSYS11.0

ANSYS12.0

ANSYS12.1

ANSYS13.0SP2

ANSYS 14.0

Time

4 hours

4hours

4hours

4hours

1hour

0.8hour

Cores

8+1GPU

32

Time

~5.5days

34.3hours

12.5hours

9.9 hours

7.5hours

Iterations

163

164

195

195

195

Cores

20

64

64

128

Time

37hours

37hours

37hours

0.2hour

0.2hour

LoadSteps

16

16

16

Improved
algorithm

16

Time

~5.5days

38.5hours

8.5hours

6.1hours

5.9hours

4.2hours

Iterations

492

492

492

488

498

498

Cores

18

16

76

128

64+8GPU

256

2weeks

5days

2 days

1day

0.5day

AllrunswithSparsesolver
Hardware12.0:dualX5460(3.16GHzHarpertown IntelXeon)64GBRAMpernode
Hardware12.1+13.0:dualX5570(2.93GHzNehalemIntelXeon)72GBRAMpernode
ANSYS12.0to14.0runswithDDRInfinibandinterconnect
ANSYS14.0creeprunswithNROPT,CRPL+DDOPT,METIS

ResultsCourtesyofMicroConsult Engineering,GmbH

DistributedANSYSPerformance
Minimumtimetosolutionmoreimportantthanscaling
SolutionScalability

25

Speedup

20
15

10
Turbinemodel
2.1millionDOF
5
Nonlinearstaticanalysis
1Loadstep,7substeps,
0
25equilibriumiterations
0
Linuxcluster(8corespernode)

16

24

32

40

48

56

64

DistributedANSYSPerformance
Minimumtimetosolutionmoreimportantthanscaling
SolutionScalability

SolutionElapsedTime

45000
40000

11hrs,48 mins

35000
30000
25000
20000

Turbinemodel
15000
2.1millionDOF
10000
Nonlinearstaticanalysis
5000
1Loadstep,7substeps,
0
25equilibriumiterations
0
Linuxcluster(8corespernode)

1hr,20mins
30mins

16

24

32

40

48

56

64

GPUAcceleratorCapability
Graphicsprocessingunits(GPUs)
Widelyusedforgaming,graphicsrendering
Recentlybeenmadeavailableasgeneralpurposeaccelerators
Supportfordoubleprecisioncomputations
PerformanceexceedingthelatestmulticoreCPUs

SohowcanANSYSmakeuseofthisnewtechnology
toreducetheoveralltimetosolution??

GPUAcceleratorCapability
AccelerateSparsedirectsolver(SMP&DMP)
GPUisusedtofactormanydensefrontalmatrices
DecisionismadeautomaticallyonwhentosenddatatoGPU
Frontalmatrixtoosmall,toomuchoverhead,staysonCPU
Frontalmatrixtoolarge,exceedsGPUmemory,onlypartially
accelerated

AcceleratePCG/JCGiterativesolvers(SMP&DMP)
GPUisonlyusedforsparsematrixvectormultiply(SpMV
kernel)
DecisionismadeautomaticallyonwhentosenddatatoGPU
Modeltoosmall,toomuchoverhead,staysonCPU
Modeltoolarge,exceedsGPUmemory,onlypartiallyaccelerated

GPUAcceleratorCapability
Supportedhardware
CurrentlysupportNVIDIATesla20series,Quadro6000,and
QuadroK5000cards
NextgenerationNVIDIATeslacards(Kepler)shouldworkwith
R14.5
InstallingaGPUrequiresthefollowing:
Largerpowersupply(singlecardneeds~250W)
Open2xformfactorPCIe x162.0(or3.0)slot

Supportedplatforms
WindowsandLinux64bitplatformsonly
DoesnotincludeLinuxItanium(IA64)platform

GPUAcceleratorCapability
Targetedhardware
NVIDIA
TeslaC2075

NVIDIA
Tesla
M2090

NVIDIA
Quadro
6000

NVIDIA
Quadro
K5000

NVIDIA
Tesla
K10

NVIDIA
Tesla
K20

Power(W)

225

250

225

122

250

250

Memory

6GB

6GB

6GB

4GB

8GB

6to24GB

Memory
Bandwidth
(GB/s)

144

177.4

144

173

320

288

PeakSpeed
SP/DP
(GFlops)

1030/515

1331/665

1030/515

2290/95

4577/190

5184/1728

TheseNVIDIAKeplerbasedproductsarenotreleasedyet,sospecificationsmaybeincorrect

GPUAcceleratorCapability
GPUscanoffersignificantlyfastertime tosolution
GPUPerformance
4.0

3.8x

6.5millionDOF
Linearstaticanalysis
Sparsesolver(DMP)
2IntelXeonE52670(2.6GHz,
16corestotal),128GBRAM,
SSD,4TeslaC2075,Win7

RelativeSpeedup

3.5
3.0

2.6x

2.5
2.0
1.5
1.0
0.5
0.0
2cores
(noGPU)

8cores
(noGPU)

8cores
(1GPU)

GPUAcceleratorCapability
GPUscanoffersignificantlyfastertime tosolution
GPUPerformance

11.8millionDOF
Linearstaticanalysis
PCGsolver(DMP)
2IntelXeonE52670(2.6GHz,
16corestotal),128GBRAM,
SSD,4TeslaC2075,Win7

RelativeSpeedup

6.0

5.2x

5.0
4.0
2.7x

3.0
2.0
1.0
0.0
2cores

8cores

16cores

(noGPU)

(1GPU)

(4GPUs)

GPUAcceleratorCapability
SupportsmajorityofANSYSusers
CoversbothsparsedirectandPCGiterativesolvers
Onlyafewminorlimitations

Easeofuse
RequiresatleastonesupportedGPUcardtobeinstalled
RequiresatleastoneHPCpacklicense
Norebuild,noadditionalinstallationsteps

Performance
~1025%reductionintimetosolutionwhenusing8CPUcores
Shouldneverslowdownyoursimulation!

DesignOptimization
Howwillyouuseallofthiscomputingpower?

Higherfidelity

Fullassemblies

DesignOptimizationStudies

Morenonlinear

HPCLicensing
ANSYSHPCPacksenable
highfidelityinsight
Eachsimulationconsumesone
ormorepacks
Parallelenabledincreases
quicklywithaddedpacks

Singlesolutionforallphysics
andanyleveloffidelity
FlexibilityasyourHPC
resourcesgrow
ReallocatePacks,asresources
allow

Parallel
Enabled
(Cores)
64 GPU
+

1GPU
+

4 GPU
+

16 GPU
+

256 GPU
+

2048

512

128

32

2
3
4
5
PacksperSimulation

HPCParametricPackLicensing
Scalable,likeANSYSHPC
Packs

64

Numberof
Simultaneous
DesignPoints
Enabled

Enhancesthecustomersability
toincludemany designpointsas
partofasingle study
Ensuresoundproductdecision
making

32

Amplifiescompleteworkflow
Designpointscaninclude
executionofmultipleproducts
(pre,solve,HPC,post)
Packagedtoencourageadoption
ofthepathtorobustdesign!

16
8
4

NumberofHPCParametricPackLicenses

HPCRevolution

HDDvs.SSDs

SMPvs.DMP
GPUs

Therightcombinationof
algorithmsand hardware
leadstomaximum
efficiency
Clusters

Interconnects

HPCRevolution

Everycomputertodayisaparallelcomputer
EverysimulationinANSYScanbenefitfrom
parallelprocessing

You might also like