Professional Documents
Culture Documents
C. CANUTO (I)
ABSTRACT Several strategies of parallelism for spectral algorithms are discussed.
The investigation shows that, despite the intrinsic lack of locality of spectral
methods, they are amenable to parallel implementations, even on fine grain
architectures. Typical algorithms for the spectral approximation of the viscous,
incompressible Navier-Stokes equations serve as examples in the discussion.
-
Introduction.
Since their origin in the late sixties, spectral methods in their modern form have
been designed and developed with the aim of solving problems, which could not
be tackled by more conventional numerical methods (finite differences, and later
finite elements). The direct simulation of turbulence for incompressible flows is
the most popularly known example of such applications: the range of phenomena
amenable to a satisfactory numerical simulation has widened during the years
54
C. CANUTO:Parallelism in
under the twofold effect of the increase of the computers' power and the development of sophisticated algorithms of spectral type. The simulation of the same
phenomena by other techniques would have required a computer power larger by
order of magnitudes, hence, it would not have been feasible on the currently
available machines (a discussion of the most significant achievements of spectral
methods in fluid dynamics can be found, e.g., in Chapter 1 of ref. [1]).
Since spectral methods have been constantly used in ~extreme>> applications,
their implementation has taken place on state-of-the-art computer architectures.
The vectorization of spectral algorithms was a fairly easy task. Nowadays, spectral codes for fluid-dynamics run on vector supercomputers such as the Cray
family or the Cyber 205, taking full advantage of their pipeline architectures and
reaching rates ofvectorization well above 80% (we refer, e.g., to Appendix B in
ref. [1]).
On the contrary, the implementation of spectral algorithms on parallel computers is still in its infancy. This is partly due to the fact that multiprocessor
supercomputers are only now becoming available to the scientific community.
But there is also a deeper motivation: it is not yet clear whether and how the
global character of spectral methods will efficiently fit into a highly granular
parallel architecture. Thus, a deep investigation - of both a theoretical and experimental nature - is needed. As a testimony of the present uncertainty on this
topic, we quote the point of view of researchers working at the development of a
multipurpose parallel supercomputer, especially tailored for fluid-dynamics applications, known as the Navier-Stokes Computer (NSC). This is a joint project
between Princeton University and the NASA Langley Research Center, aimed at
building a parallel supercomputer made up of a fairly small number of powerful
nodes. Each node has the performance of a class VI vector supercomputer; the
initial configuration will have 64 of such nodes. Despite the superior accuracy of
spectral methods over finite difference methods, the scientists involved in this
project have chosen to employ low-order finite differences at least in the initial
investigation on how well transition and turbulence algorithms can exploit the
NSC architecture. Indeed ,~the much greater communication demands of the
global discretization may well tip the balance in favor of the less accurate, but
simpler local discretizations>> ([ 12]).
Currently, a number of implementations of spectral algorithms on parallel
architectures is documented. Let us refer here to the work done at the IBM
European Center for Scientific and Engineering Computing (ECSEC) in Rome,
at the Nasa Langley Research Center by Erlebacher, Bokhari and Hussaini [5],
and at ONERA (France) by Leca and Sacchi-Landriani [11]. The IBM con-
Spectral Methods
55
j L(u)=f
in S'2
+ boundary conditions on dl-2,
UN =
~ 1~lk ~k(X),
Ikl~N
where
(1.3)
0k (X) = /-/~i=l ~ I
(Xi)"
E a c h ~m (i) is a smooth global basis function on (al, bi), satisfying the orthogonality
condition
J I bi
(1.4)
f~(im) (x) ~p(1)(x) w (x) dx = c, ~mn
a i
56
C. CANUTO: Parallelism in
(1.5)
(LN(uN), V)N = (f, V)N, VV e YN
Here XN is the space of trial functions, YN is the space of test functions, LN is an
approximation of the differential operator L and (u, V)N is an inner product,
which may depend upon the cut-off number N. In general, when XN ---- YN and
the inner product is the L2(I2) inner product we speak o f a Galerkin method; this is
quite common for periodic boundary value problems. Otherwise, for nonperiodic boundary conditions, we have a tau-method when the inner product is the
L 2- inner product and YN is a space of test functions which do not individually
satisfy the boundary conditions, or a collocation method when the inner product is
an approximation of the L2(g2)-inner product based on a Gaussian quadrature
rule.
In order to have a genuine spectral method, the basis functions in the expasion (1.2) must satisfy a supplementary property, in addition to the orthogonality
condition (1.4): if one expands a smooth function according to this basis, the
~Fourier>> coefficients of the function should decay at a rate which is a monotonically increasing divergent function of the degree of regularity of the function. This
occurs if we approximate a periodic function by the trigonometric system (if u
CS(0,2z), then ilk = 0(]kl-S)). The same property holds if we expand a nonperiodic function according to the eigenfunctions of a singular Sturm-Liouville
problem (such as Jacobi polynomials). The above mentioned property is known
as the spectral accuracyproperty. When it is satisfied, one is in the condition to prove
an error estimate for the approximation (1.5) of problem (1.1) of the form
(1.6)
where the spaces H r form a scale of Hilbert spaces in which the regularity of u is
measured. Estimate (1.6) gives theoretical evidence of the fundamental property
of spectral methods, namely, they guarantee an accurate representation of
smooth, although highly structured, phenomena by a ~minimab> number of unknowns.
Spectral methods owe their success to the availability of ~fast algorithms>~ to
handle complex problems. The discrete solution u ~ is determined by the set of its
Spectral Methods
57
((Fourier coefficients)) {fikl ] k I ~< N} according to the expansion (1.2), but it can
also be uniquely defined by the set of its values {uj = uN(xj)l ~ ] ~ N } at a selected
set GN = {xj I ~ I ~<N} in ~. The points in GN are usually the nodes of Gaussian
formulae in D, such as the points xj = j~/N, j = 0 , ..., 2N-1 in [0,2~] for the
trigonometric system, or the points xj = cos j~/N, j = 0 , ..., N in [-1,1] for the
Chebyshev system. Thus, we have a double representation of UN, one in transform
space, the other in physical space. The discrete transform
(1.7)
{uj}
is a global transformation (each fik depends upon all the uj's, and conversely). For
the Fourier and Chebyshev systems, fast transform algorithms are available to
carry out the transformation in a cheap way. Thus, one can use either representation of the discrete solution within a spectral scheme, depending upon which is
the most appropriate and efficient.
Numerical differentiation, a crucial ingredient of any numerical method for
differential problems, can be executed within a spectral method either in transform space or in physical space. Let us confine ourselves to one dimensional
Fourier or Chebyshev methods.
If
u(x) =
du
dx
N-I
~" Uk eikx
k=-N
(#=
N-1
~(~ e"~,
m=-N
(1.8)
du
dx
with
(xl) =
2N-1
~,
j=o
dO u(xj),
0<l~<2N-1,
58
C . CANUTO: Parallelism in
(1.9)
d~ =
l=j.
k=O
(1.1o)
~(2
du
dx
du
dx
kak
k=m+l
k+m odd
Cm
(here Co= 2,
N, we have
14=j
2N
Cm =
(xt) =
X
j=o
d U u(xj),
O<l<~N,
with
c:
(_-1_)~
x t - xj
-xj
l <.t=j<.N,
(1.11) d # = !
2Ne+ 1
- ~----- ,
_ 2N 2+1
6
(here Cl=CN=2, Cj=I for I<j<N-1).
l=j= 1,
I=j=N
Spectral Methods
59
The previous relations show that spectral differentiation - like a discrete transform - is again a global transformation (with the lucky exception of Fourier differentiation in transform space). T h e global character of spectral methods is coherent with the global structure of the basis functions which are used in the expansion.
Globality is the first feature of spectral methods we have to cope with in
discussing vectorization and parallelization. I f we represent the previous transforms in a matrix-times-vector form, they can be easily implemented on a vector
computer, and they take advantage of this architecture because matrices are
either diagonal, or upper triangular, or full. W h e n the transforms are realized
through the Fast Fourier Transform algorithm, one can use efficiently vectorized
FFT's (see, e.g., [14]).
Conversely, if we are concerned with parallelization, globality implies greater communication demand among processors. This may not be a major problem
on coarse grain, large shared m e m o r y architectures, such as the now commercially available supercomputers (e.g., Cray X M P , Cray 2, ETA t~ ...). We expect
difficulties on the future fine grain, local memory architectures, where information will be spread over the memories of tens or hundreds of processors.
In order to make our analysis more precise, let us focus on perhaps the most
significant application of spectral methods given so far, i.e., the numerical
simulation of a viscous, incompressible flow. Let us assume we want to discretize
the time-dependent Navier-Stokes equations in primitive variables
ut - v A t t + ~7p+(u'V)u=f
div u=O
(1.12)
u=g(or u periodic)
on 052x(0, T],
in ~,
in a bounded d o m a i n g2 C R a ( d = 2 or 3).
So far, nearly all the methods which have been proposed in the literature
(see, e.g., C h a p t e r 7 in [1] for a review) use a spectral method for the discretization in space, and a finite difference scheme to advance the solution in time.
Typically, the convective term is advanced by an explicit scheme (e.g., second order
Adams-Bashforth, or fourth order Runge-Kutta) for two reasons: the stability
limit is always larger than the accuracy limit required to preserve overall spectral
accuracy, and the nonlinear terms are easily handled by the pseudospectral technique (see below). Conversely, the viscous and pressure terms are advanced by an
C. CANUTO*.Parallelism in
60
implicit scheme (e.g., Crank-Nicolson), in order to avoid too strict stability limits.
Thus, at each time level, one has to
1) evaluate the convective term (u-V)u for one, or several, known velocity fields.
The computed terms appear on the right-hand side G of a Stokes' like problem
(1.13)
a u - v A u + ~7p = G
in t2,
div u=O
in Y2,
on Or2,
where a = 1/At;
ii) solve the spectral discretization of problem (1.13).
In most cases, problem (1.13) is reduced to a sequence of Helmholtz problems. These, in turn, are solved by a direct method or an iterative one. In the
latter case, one has to evaluate residuals of spectral approximation of Helmholtz
problems.
We conclude that the main steps in a spectral algorithm are:
A) calculation of differential operators on given functions;
B) solution of linear systems.
When the geometry of the physical domain is not Cartesian, one first has to
reduce the computational domain to a simple geometry. In this case, one has to
resort to one of the existing
C) domain decomposition techniques.
In the next sections, we will examine these three steps in some detail in view
of their implementation on a multiprocessor architecture.
Spectral Methods
61
the representation of (u -~7)u in the same space~. We recall that by representation of a given function v in T r a n s f o r m space we mean the finite set of its
<<Fourier~ coefficients according to the expansion (1.2); this set will be denoted by
9~. Similarl% by representation o f v in Physical space we mean the set of its values
on a grid GN in the physical domain, which uniquely determines v; this representation will be denoted by v.
Each c o m p o n e n t of (u 9 ~7)u is a sum of terms of the form
(2.1)
vDw,
where v and w are components ofu and D denotes differentiation along one space
direction.
An <<approximate~ representation of (2.1) in T r a n s f o r m space can be computed by the so-called pseudospectral technique, which can be described as follows:
" ~' V ' . . . . .
"'".............
.........,,
" vDw
(2.2)
, (vDw) ^
......
.......'""
Dw .........
...............................................................
.~
.,..
,~
(2.3)
"""~
. .,,,.O' v D w
..............""
w ~
~ . . . . --~ D ~
~ Dw ....."'"
62
C. CANUTO" Parallelism in
Spectral Methods
63
products of suitable orthogonal basis functions and grids on intervals of the real
line.
It follows that the elementary transformations (discrete transforms, differentiation, pointwise product, ...) which constitute a spectral method can be
obtained as a sequence (a cascade) of one dimensional transformations of the
same nature. Each of these transformations (e.g., differentiation in the x direction) can be carried out in parallel over parallel rows or columns of the computational domain (either in Physical space or in Transform Space).
Therefore, the simplest strategy of domain decomposition will consist of
assigning ~slices~ of the computational domain (i.e., groups of contiguous rows or
columns, in a two dimensional geometry) to different processors. Once again we
stress that we consider slices both in Physical Space (i.e., rows/columns ofgridvalues) and in Transform Space (i.e., rows/columns of~Wourier~ coefficients). After
a transformation along one space direction has been completed, one has to transpose the computational lattice in order to carry out transformations along the
other directions. Transposition should not be a major problem on architectures
with large shared memory or wide-band buses.
Erlebacher, Bohkari and Hussaini [5] report preliminary experiences of coding a Fourier-Chebyshev method for compressible Navier-Stokes simulations on
a 20 processor Flex/32 computer at the NASA Langley Research Center. Since
the time marching scheme is fully explicit, almost all the work is spent in computing convective or diffusive terms by the spectral technique. Parallelization is
achieved by the strategy described above. The physical variables on the computational domain are stored in shared memory; slices of them are sent to the
processors, which write the results of their computation in shared memory. The
authors' conclusions are summarized in Table 1, where speed-ups (Sp) and efficiencies (Ep) are documented for different choices of the computational grid.
According to the authors, moving variables between shared and local memory
should not cause major overheads even on such a supercomputer as the ETA 1~
Indeed, quoting from [5], ~a good algorithm [on the ETA 1~ should perform at
least 5 floating point operations per word transferred one way from common
memory,s. This minimum work is certainly achieved within a spectral code:
think, for instance, of differetiation in physical space via FFT.
Transposition of the computational lattice will eventually become prohibitive on fine grain, local memory architectures. In this case, small portions of the
computational domain will be permanently resident in local memories, and intercommunication among processors will be the major issue. In order to understand
the communication needs of a spectral method, let us observe that ifL(u) is any
64
C. CANUTO" Parallelism in
Performance
Grid
N,o,x2-'*
128x 16x8
12.7
8x64x32
9.0
6 4 x 16x 16
32x32x16
32x16x16
16x16x16
8.0
6.7
2.7
1.0
r.
sp
F.,
8
4
2
1
365
705
1386
2757
7.55
3.91
1.99
1.00
94.3
97.6
99.4
100.0
8
4
2
1
269
510
1003
1977
7.52
3.87
1.97
1.00
94.0
96.8
98.5
100.0
16
8
4
2
1
138
242
466
916
1786
13.01
7.41
3.84
1.95
1.00
81.3
92.6
95.9
97.7
100.0
16
8
4
2
118
202
388
759
1511
12.82
7.45
3.89
1.99
1.00
80.1
93.2
97.3
99.3
100.0
16
8
4
2
62
92
168
321
9.98
6.72
3.67
1.92
616
1.00
62.4
84.0
91.7
96.0
100.0
16
8
4
2
1
37
43
71
127
244
6.54
5.60
3.44
1.92
1.00
40.9
70.0
86.0
95.8
100.0
65
Spectral Methods
differential operator (of any order, with variable coefficients or non-linear terms,
etc.) then one can compute a spectral approximation to L(u) at a point P of the
computational domain using only information at the points lying on the rows and
columns meeting at P (see Figure 1.a). This means that spectral methods,
although global methods in one space dimension, exhibit a precise sparse structure in multidimensional problems.
x
(~)
(~)
(~)
~)
(~)
|174174174 --|174174174174
P
X
(~)
(~)
at P in the computational
66
C. CANUTO: Parallelism in
P~jo
Pi~jo
Pij
in g?,
(3.1)
u_-0
div u=0
1 on 092;
in f2,
p=2
on
au-vAu = G-Vp
in g2,
u=0
on 0s
(3.2)
00,
Spectral Methods
67
(3.3)
w(1)=a, w(-1)=b.
N
~, ~b~ = a; ~
m=O
^ (2)
Here Wm
O~m<~N-2,
(-1)mrb,~ = b.
m=O
1
Cra
k=m+2
k + m even
Several levels of parallelism can be exploited in the Kleiser-Schumann algorithm. The most obvious one consists of splitting the Fourier modes among the
processors. There is no communication needed to solve (3.2), and a perfect balance of work among processors can be easily achieved. This strategy has been
followed by Leca and Sacchi-Landriani [11]. The next level of parallelism originates from the observation that in each tau system (3.4), the odd Chebyshev modes are uncoupled from the even ones. Hence, the task of solving (3.4) can be split
over two processors. Finally, each of the resulting linear systems can be written in
tridiagonal form (see e.g., [1], Chapter 5 for more details).
The last property also holds for tau approximations of the Poisson equation
in several space dimensions, provided a preliminary diagonalization has been
carried out (see, again, [1], Chapter 5). Thus, the communication demand when
solving linear systems originated by tau approximation is essentially related to
the
(3.5) solution of tridiagonal systems.
C. CANUTO: Parallelism in
68
u=O
on Og2,
(3.6)
(3.7)
(3.8)
Lsp u=f.
n = 0 , 1, 2,. .... ,
Spectral Methods
69
~max (A_lLsp)~ 1 as apposed to ~.,,a, (Lsp) = 0(N4). (Here 2max, resp., 2rain,
~min
~rain
denote the largest, resp., the smallest eigenvalue of the indicated matrix). This is
achieved, for instance, if A is the matrix of a low order finite difference or finite
element method for the Laplace operator on the Chebyshev grid GN. Multilinear
finite elements (Deville and Mund [19]) guarantee exceedingly good preconditioning properties.
The direct solution of the finite difference or finite element system at each
Richardson iteration may be prohibitive for large problems. An approximate
solution is usually enough for preconditioning purposes. Most of the algorithms
proposed in the literature (see., e.g., [1], Chapter 5 for a review) are global sequential algorithms (say, an LU incomplete factorization).
Recently, Pietra and the author [2] have proposed solving approximately the
trilinear finite element system by a small number of ADI iterations. They use an
efficient ADI scheme for tensor product finite elements in dimension three introduced by Douglas [4]. The method can be easily extended to handle general
variable coefficients. As usual, efficiency in an ADI scheme is gained by cycling
the parameters. The ADI parameters can be automatically chosen in such a way
that the cycle length lc(e) needed to reduce the error by a factor e satisfies
70
C. CAmlTO: Parallelism in
son, Saad and Schultz [9] discuss highly efficient implementations of ADI methods on several parallel architectures.
C m + 2 1 ] ( 1 ) + 2 "~
(m+l)1]m+l,
m = N - 1 , ..., 0;
1](1~+~ = 1](~ = 0.
Spectral Methods
71
matrix-vector multiplication. In this case, the Nearest Neighbor Network provides the optimal communication scheme.
It is clear from the previous discussion that several intercommunication
paths should co-exist in order to allow an optimal implementation of spectral
algorithms on parallel architectures. The union of the Perfect Shuffle Network
with the Nearest Neighbor Network (PSNN) is an example of a multi-path
scheme, quite appropriate for spectral methods. The PSNN was first proposed by
Grosch [7] for an efficient parallel implementation of fast Poisson solvers.
72
C. CANUTO: Parallelism in
accuracy for the multi-domain solution, its actual accuracy may be severely degraded if c o m p a r e d to that of the single-domain solution defined by the same
total n u m b e r o f degrees of freedom.
Let us illustrate the situation with a model problem, taken from [2]. Consider the Dirichlet problem for the Poisson equation in the square (-1, 1) 2, whose
exact solution is u(x, y) = cos 2nx cos 2ary. W e divide the domain into four equal
squares, on each of which we set a Chebyshev collocation method, plus we enforce C ~ continuity at the interfaces. T h e results are compared with those produced by a C h e b y s h e v collocation m e t h o d on the original square, which uses the
same total n u m b e r of unknowns. T h e relative L = errors are reported in T a b l e 2.
Table 2. Relative maximum-norm errors for a Chebyshev collocation method (from
[2]).
u(x, y) =
cos 2~ X cossry
4 DOM, 4x4
.62 E0
1 DOM, 8x8
.35 E-1
4 DOM, 8x8
.12 E-2
1 DOM, 16x16
.11 E - 6
4 DOM, 16x16
.49 E-10
1 DOM, 32x32
.38 E - 1 4
Note the loss of four orders of magnitude in replacing the single d o m a i n with
16x 16 nodes by the four domains, each with a 8 x 8 grid. O f course, if we have
four processors and we can reach the theoretical speed-up of four in the d o m a i n
decomposition technique, we can run four 16x 16 subdomains in parallel at the
cost of a single 16 X 16 domain on a single-processor, and gain four order of magnitudes in accuracy. However, if we seek parallelism through the splitting techniques described in Sections 2 and 3, and we maintain a speed-up of four, we can
run for the same cost a 3 2 x 3 2 grid on the single domain, yielding a superior
accurcy again by a factor of 10-4. Thus, it appears that it is better to keep the
Spectral Methods
73
spectral expansion as global as possible, and look for parallelism at the level of the
solution of the algebraic system originated from the discretization method.
We conclude by going back to domain decompositions for the spectral
scheme on each, <<simple>) subdomain, supplemented by suitable continuity conditions at the interfaces. Deville and Mund [3] indicated that this can be done by
an iterative procedure such as (3.9), where A -1 is a <<global preconditioner>), i.e.,
an approximation of the differential problem over the whole domain. If the preconditioner is of finite element type, the interface conditions can be incorporated
in the variational formulation, as shown in [2]. Thus, at each iteration, one has to
compute the spectral residuals separately on each subdomain. This can be done
in parallel. Next, one has to (approximately) solve a finite element system. Again,
this can be carried out in parallel form using one of the existing domain decomposition techniques for finite elements methods. Note that in principle the domain decomposition used at this stage may be totally independent of the one
introduced for setting the spectral approximation.
REFERENCES
[1] C. CANUTO,M. Y. HUSSAINI,A. QUARTERONI,T. A. ZANG Spectral Methods in
Fluid Dynamics, Springer Vertag, New York, 1988.
[2] C. CANUTO,P. PIETRA,Boundary and interface conditions within afinite elementpreconditionerfor spectral methods, I.A.N.-C.N.R. Report n ~ 555, Pavia, 1987.
[3] M. DEVILLE, E. MUND, Chebyshevpseudospectral solution of second-order elliptic equations with finite elementpreconditioning, J. Comput. Phys., 60 (1985), 517-533.
[4] J. DOUOLAS,JR., Alternating direction methodsfor three space variables, Numer. Math.,
4 (1962), 41-63.
[5] G. ERLEBACHER,S. BOKHARI,M. Y. HUSSAINI,Three dimensional compressible transition on a 20processor Flex/32 multicomputer, preprint, NASA Langley Research
Center, 1987.
[6] D. GOTTLIEB, S. A. ORSZAO, Numerical Analysis of Spectral Methods: Theory and
Applications, SIAM-CBMS, Philadelphia, 1977.
[7] C. E. GROSCH, Performance analysis of Poisson solvers on array computers, (1979) Infotech State of the Art Report: Supercomputers (C. Jesshope and R. Hockney,
eds.), Infotech, Maindehead, 147-181.
[8] R. HOCKNEY, C. JESSHOPE, Parallel Computers: Architecture, Programming and Algorithms, Adam Hilger, Bristol, 1981.
[9] S. L. JOHNSSON,Y. SAAD,M. H. SCHULTZ,Alternating direction methods on multiprocessors, Report YALEU/DCS/RR-382, October 1985.
[10] L. KLEISER, U. SCHUMANN,Treatment of incompressibility and boundary conditions in
3-D numerical spectral simulations of plane channel flows, Proc. 3rd GAMM
74
C. CANUTO: Parallelism in