Professional Documents
Culture Documents
X. Sherry Li
xsli@lbl.gov
Lawrence Berkeley National Laboratory
4/27/17 1
Sparse matrix: lots of zeros
fluid dynamics, structural mechanics, chemical process simulation,
circuit simulation, electromagnetic fields, magneto-hydrodynamics,
seismic-imaging, economic modeling, optimization, data analysis,
statistics, . . .
Example: A of dimension 106, 10~100 nonzeros per row
Matlab: > spy(A)
Boeing/msc00726 (structural eng.) Mallya/lhr01 (chemical eng.)
First step of GE
éa wT ù é 1 0ù éa wT ù
A=ê ú=ê ú ×ê ú
ëv B û ëv / a I û ë0 Cû
v × wT
C = B-
Repeat GE on C
a
Result in LU factorization (A = LU)
– L lower triangular with unit diagonal, U upper triangular
3
Sparse factorization
Store A explicitly … many sparse compressed formats
“Fill-in” . . . new nonzeros in L & U
Graph algorithms: directed/undirected graphs, bipartite graphs,
paths, elimination trees, depth-first search, heuristics for NP-hard
problems, cliques, graph partitioning, . . .
Unfriendly to high performance, parallel computing
Irregular memory access, indirect addressing, strong task/data
dependency
Supernodal DAG Multifrontal tree
1
1
2 U 2 3 4 5 L
U
3
4
L 5
6
6
7 8
L
U
L
U
7
9
4
SuperLU direct solver: SW aspects
www.crd.lbl.gov/~xiaoye/SuperLU
First release 1999, serial and MT with Pthreads.
Later: OpenMP, MPI distributed, MPI + OpenMP + CUDA
Single developer to many developers: svn, testing code
4/27/17 5
SuperLU numerical testing
Regression test aims to provide coverage of all routines by testing
all functions of the user-callable routines.
|&|'
BERR = max
$ ( )*+ '
|| ( ,- / ||0
FERR =
| ) |0
4/2717 6
Malloc/free balance check
Debugging mode SUPERLU_MALLOC / SUPERLU_FREE
size
4/27/17 8
SuperLU distributed factorization
• O(N2) flops, O(N4/3) memory for typical 3D problems.
Per-rank Schur complement update
0 1 2 0 1 2 0 Loop through N steps: (Gaussian Elimination)
FOR ( k = 1, N ) {
3 4 5 3 4 5 3
1) Gather sparse blocks A(:, k) and A(k,:) into dense work[]
0 1 2 0 1 2 0 2) Call dense GEMM on work[]
3 4 5 3 4 5 3 3) Scatter work[] into remaining sparse blocks
}
0 1 2 0 1 2 0
3 4 5 3 4 5 3
0 1 2 0 1 2 0 }
look−ahead window
• Graph at step k+1 differs from step k
• Panel factorization on critical path
Developers: Sherrry Li, Jim Demmel, John Gilbert, Laura Grigori, Piush Sao, Meiyue
Shao, Ichitaro Yamazaki.
9
SuperLU deploying GPU accelerator
}
$ % &
}
! " ! " ! "
}
}
}
}
#
10
SuperLU optimization on Intel Xeon Phi
Replacing small independent single-threaded MKL DGEMMs by large
multithreaded MKL DGEMMs: 15-20% faster.
Using nested parallel for and tasking avoids load imbalance and increases
amount of parallelism: 10-15% faster.
Challenges: non-uniform block size, many small blocks.
Factorization time: 3 test matrices, mixing MPI and OpenMP
1 node = 64 8 nodes = 512 32 nodes =
cores cores 2048 cores
MPI, Threads 64p, 32p, 256p, 128p, 512p, 256p,
1t 2t 2t 4t 4t 8t
nlpkkt80 -- 66.7 35.2 27.5 24.2 25.7
Ga19As19H42 129.0 130.7 28.3 25.6 15.6 16.8
11
Examples in EXAMPLE/
§ pddrive.c: Solve one linear system
§ pddrive1.c: Solve the systems with same A but different right-
hand side at different times
§ Reuse the factored form of A
§ pddrive2.c: Solve the systems with the same pattern as A
§ Reuse the sparsity ordering
§ pddrive3.c: Solve the systems with the same sparsity pattern
and similar values
§ Reuse the sparsity ordering and symbolic factorization
§ pddrive4.c: Divide the processes into two subgroups (two
grids) such that each subgroup solves a linear system
independently from the other. 0 1
2 3
4 5
6 7
8 9
1011
4/27/17 12
Domain decomposition, Schur-complement
(PDSLin : http://portal.nersc.gov/project/sparse/pdslin/)
è A21 A22 ø è x2 ø è b2 ø ! A A
# 11 12
&= # &
&
# A21 A22 & #
" % # Dk Ek &
# &
# F1 F2 … Fk A22 &
" %
2. Schur complement
-1 -T T -1
S = A22 – A21 A11 A12 = A22 – (U11 A21 )T (L11 A12 ) = A22 - W × G
where A11 = L11U11
S = interface (separator) variables, no need to form explicitly
13
Hierarchical parallelism
Multiple processors per subdomain
one subdomain with 2x3 procs (e.g. SuperLU_DIST)
D1 P P P(0 : 5) E
(0 : 5) 1
D2 P P(6 : 11) E2
(6 : 11)
D3 P P(12 : 17) E3
(12 : 17)
F1 F2 F3 F4 A22
Advantages:
Constant #subdomains, Schur size, and convergence rate, regardless
of core count.
Need only modest level of parallelism from direct solver.
14
PDSLin configurable as Hybrid, Iterative, or Direct
Default
Subdomain: LU
Schur: Krylov
Options Options
(1) num_doms = 0 (1) Subdomain: LU
Schur = A: Krylov Schur: LU
drop_tol = 0.0
(2) FGMRES Inner-Outer:
Subdomain: ILU (2) num_doms = 1
Schur: Krylov ! D E1 $ Subdomain: LU
# 1 &
# D2 E2 &
# &
# &
# Dk Ek &
# &
# F1 F2 … Fk A22 &
" %
15
PDSLin in Omega3P: accelerator cavity design
Computation results
§ 2.3M elements
§ Second order finite element (p = 2)
- 14M DOFs, 590M non-zeroes.
- Using MUMPS with 400 nodes, 800 cores, solution time: 6:48 min.
- Solution time on Edison using 100 nodes, 2400cores: 6:20 min.
STRUMPACK “inexact” direct solver
portal.nersc.gov/project/sparse/strumpack/
• In addition to structural sparsity, further apply data-sparsity with low- dense
rank compression:
D1 V2
U1
• O(N logN) flops, O(N) memory for 3D elliptic PDEs. "
$
"
$ D1
%
U1B1V2T '
U B VT
%
'
V1 D2 Big U1 0 $
$
$ U B VT D2
'
' 3 3 6 '
U2 U3 = U A ≈$ # 2 2 1 &
17
HSS approximation error vs. drop tolerance
Randomized sampling to reveal rank
1 Pick random matrix Ωnx(k+p), k target rank, p small, e.g. 10
2 Sample matrix S = A Ω, with slight oversampling p
3 Compute Q =ON-basis(S) via rank-revealing QR
1e-08 140
120
1e-10
100 d=16
1e-12 d=64
80 d=256
d=1024
1e-14 60
1 0.01 0.0001 1e-06 1e-08 1e-10 1e-12 1e-14 1 0.01 0.0001 1e-06 1e-08 1e-10 1e-12 1e-14
4/27/17 18
STRUMPACK: parallelism and performance
3 types of tree parallelism: Shared Memory OpenMP Task Parallelism: Intel Ivy
Bridge, compared to MKL Pardiso
• Elimination tree 2
solve time
1.8 factor time
• HSS tree
out of memory
Node of dense kernels tree 0.8
0.6
0.4
0.2
0
PARDISO
PARDISO
PARDISO
PARDISO
PARDISO
PARDISO
PARDISO
PARDISO
PARDISO
MF
MF+HSS
MF
MF+HSS
MF
MF+HSS
MF
MF+HSS
MF
MF+HSS
MF
MF+HSS
MF
MF+HSS
MF
MF+HSS
MF
MF+HSS
atmosmoddGeo 1438 nlpkkt80 tdr190k torso3 Transport A22 Serena spe10-aniso
runtime (s)
100
n log2 n
104 HSS-GEMM
n2
LU
103
n3 10
Time (s)
102
1
101
100 0.1
1 4 24 192 1152
10 1 cores, MPI processes
16000 32000 64000 128000 256000 370000
n
19
Summary
Explore new algorithms that require lower arithmetic complexity,
lower memory/communication, faster convergence rate
Higher-fidelity simulations require higher resolution, faster turn-around
time
20
4/27/17 21