You are on page 1of 16

SIMULATION

PRACTICE = THEORY

: Simulation Practice and Theory 3 (1995) 223-238

Massively parallel semi-Lagrangian advection


S. Thomas *, J. C8tk
Recherche en p&vision numdrique, Environment Canada, 2121. route Transcanadienne.
Dorval, PQ. Canada H9P 153

Received 7 February 1995;revised 14 August 1995

Abstract

The passive advection problem in computational fluid dynamics is examined. Eulerian


methods are restricted by the Courant-Friedrichs-Lewy (CFL) condition and the semi-
Lagrangian method is introduced as an alternative approach for taking longer time steps.
Recent progress in the development of parallel algorithms for semi-Lagrangian advection is
reviewed and implementation strategies for distributed MIMD computation are discussed.
Predicted and actual parallel performance results on the Intel iPSC/860 and the Cray T3D
MPP are presented and it is established that the problem is scalable.

Keywords: Parallel advection; Semi-Lagrangian method; Performance analysis

1. Introduction

To meet the computational demands of high resolution numerical models, it


appears that massively-parallel processors (MPP) will have an important role to
play in atmospheric research. Indeed, the next generation of high-performance soft-
ware for weather prediction may be expected to rely on this technology. It is predicted
that massively parallel implementations of atmospheric models can meet or surpass
the performance levels achievable on current generation vector supercomputers.
The current focus of our work has been to develop a parallel implementation of a
prototype shallow-water model on the sphere. The shallow-water equations have
been used as a vehicle for testing promising numerical methods for many years by
the atmospheric modeling community. These equations contain the essential mecha-
nisms of more complete models, for example, both the slowly propagating Rossby
motion and the fast-moving gravitational oscillations are present. Our starting point
is to develop a parallel algorithm for the advection problem based on a distributed

* Corresponding author.

0928-4869/95/$09.500 1995 - Elsevier Science B.V. All rights reserved


SSDI 0928-4869(95)00033-X
224 S. Thomas, J. C6t6 / Simulation Practice and Theory 3 (1995) 223-238

MIMD model of computation. The algorithm is a straightforward generalisation of


the sequential algorithm and employs data distribution of the underlying computa-
tional grid across a processor mesh. Fixed overlap or “ghost” boundary regions are
maintained. This approach is not unique and other strategies are currently under
investigation. A portable program based on the Parallel Virtual Machine (PVM)
message-passing library illustrates that it is possible to achieve a scaled speedup on
existing MPP machines.

2. Semi-Lagrangian advection

In an Eulerian advection scheme an observer remains in a fixed reference frame


as a fluid flows past. Such schemes work well on Cartesian grids, but the size of
time steps may be limited due to stability considerations. In a Lagrangian scheme
an observer moves with the flow as a fluid particle and in the case of geostrophic
flow this allows for a larger time step to be taken. A disadvantage of this scheme is
that a regularly spaced set of particles can evolve into a highly-irregularly spaced
set at later times and thus certain features of the flow may not be well represented.
To maintain the regular resolution of the former scheme and the enhanced stability
of the latter, semi-Lagrangian integration schemes have been developed [ 14,161. In
this case a different set of particles is selected at each time step and these particles
are chosen such that they arrive at the nodes of a regular Cartesian grid at the end
of a time step. It follows that this approach is not limited by the CFL condition
associated with finite difference Eulerian schemes and larger time steps can be taken
with no loss of accuracy. In atmospheric models, when coupled with a semi-implicit
treatment of the terms responsible for gravitational oscillations, it is possible to
obtain a significant increase over Eulerian schemes in the maximum stable time step
[14]. It is for this reason that the method has been adopted by many numerical
weather prediction centres worldwide including the European Center for Medium
range Weather Forecasts (ECMWF).

2.1. Passive advection

The basic ideas underlying semi-Lagrangian advection can be illustrated by


applying the technique to the equation for passive advection in l-D,
dF L’F
-= ,+U$=O,
dt

where F is a scalar (e.g. temperature) to be advected and


dx
z = U(x, t)

is the velocity field. In a Lagrangian frame of reference dF/dt = 0 and the function
F is constant along the path (trajectory, characteristic) of a fluid particle in the x-t
S. Thomas, J. CM / Simulation Practice and Theory 3 (1995) 223-238 275

plane. Consider the arrival of a fluid particle at a grid point x, at time +?. It is
assumed that F(x, t) is known at all grid points x, at time t, -At and values of
F(x,, t,) are sought. Integrate (1) along a particle path and approximate dF/dt by

F(x,, L) - Fb, -a,, L, -At) = o


, (3)
At
where CI, is the distance the particle arriving at x, travels in time At. If the displace-
ment ~1, is known, then according to (3) the value of F at the arrival point x, at
time t, is simply the value of F at the upstream departure point x, - CI,at time t, - At.
In essence, therefore, the semi-Lagrangian method is a problem in numerical
interpolation since the point x, - CI, will generally lie between grid points. To
compute F(x, - am, t, - At), the displacement tl, must be determined and then F is
interpolated using values at neighbouring grid points. The displacement ~1, can be
computed by using the equation for the velocity (2) as was suggested by Robert
Cl41
CI, = AtU(x, - ci,,,/2,t, - Atj2). (4)
This nonlinear equation can be solved, for example, with a fixed-point iteration

ac+ll= AtU(x, - a$]/2, t, - At/2) (5)


using the value of a, at the previous time step as an initial guess for a:]. In addition,
values of U between grid points must by obtained by linear interpolation. Semi-
Lagrangian schemes are not restricted to the two-time-level scheme represented by
(3) and (4). For example, a three-time-level scheme is also possible. A three-time-
level scheme would be less efficient since it requires time steps half the size of the
two-time-level scheme in order to achieve the same order of time truncation error.
In either case, it is important to maintain O(At’) accuracy comparable to a centred
Eulerian finite difference scheme. In practice, polynomial interpolation is used to
compute F and U between grid points. It is well known, however, that semi-
Lagrangian schemes are diffusive. Several authors have proposed techniques to
preserve mass, monotonicity, or both but these tend to increase diffusion. Bermejo
[l] combines cubic and linear interpolation to better preserve monotonicity. Quasi-
conservative versions of the semi-Lagrangian method were proposed by Priestly
[ 1l] and Gravel and Staniforth [7]. Shape preserving methods based on rational
polynomial interpolation are discussed in Williamson and Rasch [ 181.
Cubic interpolation gives fourth order spatial truncation errors and exhibits an
acceptable level of dissipation or damping. Linear interpolation can result in excessive
dissipation, therefore, the most common approach is cubic Lagrange or cubic spline
interpolation of F [ 161. It has also been found in practice that linear interpolation
for U is sufficient for computing the displacement a,,, when using cubic interpolation
for F. Furthermore, it has been observed that two iterations of (5) are usually
sufficient for convergence. In one dimension, a cubic Lagrange polynomial can be
constructed from the function values at four neighbouring grid points

x1 <x2 < xj < xq, X,d&<Xg,


226 S. Thomas, J. CM / Simulation Practice and Theory 3 (1995) 223-238

where xd = x, - a, is a departure point. Alternatively, a compact form for the cubic


P3(x) can be based on the Newton form of the interpolating polynomial by using
function values and second order divided differences at the two nearest grid points,
resulting in better data locality. For non-uniform spacing the polynomial is given
by

P~(x)=F~(l-~)+F~S?+c~
hi a(&1)[h3+h2(1-i)]
hl + h, + h3

h:
+ CJ $(a - l)(h, + h,Z), (6)
h, + h2 + h,

where

x - x2
hi = AX~, Fj= F(Xi).
f=h,’
ci=[z-2] ’ @xi+;+)
and

AXi=xi+l -Xi, AFi=Fi+,-Fi.

In higher dimensions such an approach has several advantages for computer architec-
tures employing a high-speed cache memory. For example, a bicubic Lagrange
polynomial requires function values at 16 neighbouring grid points, whereas function
values and divided differences at the four nearest grid points are needed in the above
formulation. In fact, precomputing the differences is advantageous since the formulae
for bicubic splines turn out to be quite similar. Traditionally, splines have been the
method of choice for semi-Lagrangian advection schemes, but require the solution
of tridiagonal systems to ensure continuity of derivatives at the grid points and
polynomial interpolation may be more efficient. A detailed comparison of Eulerian
and semi-Lagrangian methods applied to passive advection can be found in [9].
For atmospheric modeling, special care must be taken in spherical coordinate systems
to ensure the numerical stability of semi-Lagrangian methods. The main difficulties
are associated with the metric terms in spherical geometry and estimating the
departure points, see [2,12,18,10]. Recent progress has also been made in understand-
ing the response to stationary forcing [ 131.

2.2. Multidimensional problems and forced advection

The problem of forced advection in two space variables is characterized in [16]


by the generic equation

dF 8F
$ + G(x, t) = R(x, t), dt = at +(u*V)F, (7)
S. Thomas, J. C&k / Simulation Practice and Theory 3 (I 995) 223-238 227

where the velocity vector u = (u, v) is

dx
- = u(x, t). (8)
dt

A semi-implicit, two-time-level semi-Lagrangian time discretisation of Eqs. (7) and


(8) is given by

F+-F”
~ + 4 [G+ + Go] = 4 [R+ + RO],
At
a = Atu(x -a/2, t - At/2),

where the superscripts + and 0 represent evaluation at the arrival point (x, t) and
the departure point (x-u, t-At) respectively. In practice, the terms at different
time levels are grouped together and interpolation is performed on the combined
right-hand side terms as indicated below.

F++$G-R]+=F’-$G-R1”.

To illustrate the practical application of this scheme, consider the shallow-water


equations representing the inviscid flow of a thin layer of fluid in two dimensions,

;+(u.v)u+F:
xu+V#=O, (9)

~+(u.v)t#J+f$v.u=o, (10)

where 4 = 4’ + 4* is the geopotential height. For a semi-Lagrangian treatment of


advection the equations are written in the advective form

$ - fv + 4, = 0, (11)

$+fu+d,=O, (12)

#*zln 2 +d*D=O, (13)


0
where D = u, + vy is the divergence. The functions G and R above are now identified
as the terms responsible for gravitational oscillations and the slow moving Rossby
modes in Eqs. (9) and (10). The time discretized equations corresponding to a two-
time level scheme are
0
=R,, (14)
I
228 S. Thomas, J. C&C 1 Simulation Practice and Theory 3 (1995) 223-238

(16)

where [.I0 implies upstream interpolation in space. Interpolation of R,, R, and R,


at the departure point (x - a, t - At) relies on the O(At2) iterative scheme

c&k+11
= Atu(x - CZ[~]/~,
t - At/z). (17)
Self-advection of momentum in Eqs. (14)-( 16) requires extrapolation in time [ 161.
Several schemes are possible, but a simple Adams-Bashforth method is sufficiently
accurate to obtain midpoint values of the wind at grid points

u(x, t - At/2) = 2 u(x, t - At) - f u(x, t - 2At). (18)


The wind u at the upstream midpoint (x - a/2, t - At/Z) is found by linear interpola-
tion between grid points. Upstream values of R,, R, and R, are then computed with
cubic Lagrange interpolation. At this point there remains an implicit system of
equations to be solved. In a semi-implicit scheme, algebraic elimination of the
momentum equations leads to a nonlinear Helmholtz problem for 4.

2.3. Complexity analysis

It is straightforward to determine the number of floating point operations required


by the semi-L+ -&an algorithm in 2-D. Bicubic interpolation of a scalar field F
employs a generalisation of the polynomial P3(x) in (6) to P3(x, y) and this requires
divided differences Cij denoted F,.,, FYYand FXXYY. These values along with F are
required at each grid point surrounding an upstream departure point (x - a, t - At).
Initially, let us assume that the wind u(x, t) is varying. Divided differences are
computed by tridiagonal matrix multiplication followed by a division, requiring
3 x (5 + 1) = 18 flops. A faster approach uses first-order differences followed by
divisions (two subtractions and two divisions), for 3 x 4 = 12 flops. The coefficients
in P3(x, y) associated with the local coordinate system (a, 9) are computed in 24 flops
and then construction of the interpolating polynomial requires an additional 35 flops.
A trajectory computation (17) based on linear interpolation of the wind for two
iterations results in 2 x 36 = 72 flops. Finally, to obtain the origin of the local
coordinate system surrounding the departure point requires 8 flops. This computation
is particularly simple for uniform grids and a Fortran code fragment for the 1-D
case is given below

xd=x( i) -alpha( i)
ix=int( (xd-x(l))/hx)+l
The total flop count per grid point for one time step of semi-Lagrangian advection
S. Thomas, J. C6t4 / Simulation Practice and Theory 3 ( 1995) 223-238 229

is 157 flops. If it is known that the wind is constant in time, then the operation
count per time step can be reduced since the trajectories, and hence the departure
points, need to be computed only once. In this case 53 flops are needed to compute
divided differences and the bicubic polynomial.

3. Parallel implementation

The Courant number C = 1U IAt/Ax has important implications for distributed


memory implementations of both Eulerian and semi-Lagrangian advection schemes
which employ data distribution of the computational grid. In an Eulerian leapfrog
advection scheme, values of the advected field F are taken from a stencil containing
the four nearest grid points. Thus, a one grid point wide overlap region or guard
ring is maintained on each processor node and an exchange of values at these grid
points is required after each time step. The maximum length of a time step is limited
by the Courant-Friedrichs-Lewy (CFL) condition C < 1 for numerical stability.
Now consider the upstream interpolation of a scalar field F at time level t - At in a
semi-Lagrangian scheme. The Courant number C determines how far a particle may
travel during one time step and it is for this reason that semi-Lagrangian methods
are favoured since they remain stable for C > 1. For a parallel implementation of
the tradiational sequential algorithm, bicubic interpolation at the upstream departure
point requires a 4 x 4 grid point stencil for the Lagrange polynomial or a 2 x 2
stencil of function values and divided differences to form the Newton polynomial. A
two grid point wide overlap region is sufficient in either case to ensure that the data
surrounding a departure point is available when C < 1. From the stand-point of
complexity, there are no tangible gains in efficiency over Eulerian schemes when
C < 1. For each integer increment of C > 1 the overlap region must be increased by
one grid point, increasing the communication overhead. This implies that an estimate
of the maximum wind speed must be available to determine the size of the overlap
region.

3.1. Scalability analysis

The scalability analysis technique described by Foster et al. [S] will be used
throughout the discussion. The terminology and definitions employed by the authors
are widely adopted in the parallel processing literature. Consider the simplified model
of a parallel computer consisting of p processors, each executing at the same speed
and able to exchange data by means of messages sent across a high-speed interconnec-
tion network. During the execution of a parallel program each processor will perform
useful computations, however, there will be overhead associated with communication.
The sequential time Keq is defined to be the execution time of a good sequential
implementation of an algorithm. For a parallel program, the execution time is T =
Tcamp+ LmY where Tromprepresents the time spent computing and T,,,, is the
communications overhead. To simplify the analysis it is assumed that Tcomp= i&/p.
230 S. Thomas, J. C&I? 1 Simulation Practice and Theory 3 (1995) 223-238

Parallel speedup S and efficiency E are defined as

(19)

where f, is usually referred to as the fractional communication overhead. In the


analysis which follows it is assumed that both Tcompand Tcommare defined with
respect to one time step of the algorithm. Communication overhead will be charac-
terized by three parameters, the message startup time or latency t,, the per-hop time
t,, representing the time required to move between two directly connected processors
and the transfer time per word t,, representing the bandwidth of the communication
channel. The time required to send a message of size s bytes is then given by

Tcomm= t, + ht, + st,. (20)


The above equation is an approximation since communication channels may be
shared. In particular, a logical processor topology may not map directly onto a
physical inter-connection network, resulting in several processors attempting to send
messages over the same channel at the same time. A reasonably accurate model of
this behaviour is to scale the transfer rate by the number N of processors concurrently
sending a message [ 61.

TCom = t, + ht, + Nst,. (21)


In fact, the above equation yields quite accurate predictions of the communication
overhead for the PVM implementation of the advection problem on both the Intel
iPSC/860 and the Cray T3D.

3.2. Target architectures

In this article, the performance of the semi-Lagrangian algorithm is analyzed for


the Intel iPSC/860 and Cray T3D. A comparison of existing MPP architectures
reveals that there exist significant variations in the parameters t,, th and t, among
existing machines. In particular, the message latency t, continues to decrease in
current generation machines. It is also interesting to compare single node peak
performance with attainable Mflop rates for typical applications.
The Intel iPSC/860 (also known as the Touchstone Gamma System) is based on
the 64 bit i860 RISC processor from Intel. The theoretical peak speed is 80 Mflops
in single precision floating point and 60 Mflops in double precision at a clock rate
of 40 MHz. The chip contains an 8 kbyte data cache and a 4 kbyte instruction cache.
A memory fetch, a floating point add and a floating point multiply can all be initiated
within a single clock period. Actual performance for the single node Linpack double
precision benchmark is 6.0 Mflops and a single precision Fortran “saxpy” achieves
6.8 Mflops. It has been reported that typical computational fluid dynamics kernels
achieve no better than 10 Mflops [ 151. An iPSC/860 system contains up to 128
computational nodes, each containing an i860 processor, 8 Mbytes of memory and
a separate communications processor. The native Intel NX message-passing library
S. Thomas, J. C&t! / Simulation Practice and Theory 3 (I 995) 223-238 231

is supported by version 3.0 of PVM. The interconnection network is a seven dimen-


sional hypercube with a peak node-to-node aggregate bandwidth of 2.8 Mbytes/s or
t, = 1.4 us, internode hop time of th = 2 us and a message startup time of t, = 136 us.
Empirical studies have shown, however, that these represent optimal values [5,6].
Realistic values of t,, th and t, based on scalability studies of a spectral transform
shallow-water model [6] are summarized in Table 1.
The Cray T3D is a massively parallel processor based on a 3-D periodic torus
mesh network and contains up to 2048 processing elements (PEs) [3]. In essence,
it is a distributed memory MIMD architecture which also supports a shared memory
programming model. Communication is characterized by a 1 us latency, maximum
0.01 us internode hop time, and a 300 Mbyte/s PE-to-PE throughput. However, the
bandwidth as measured from software is actually 140 Mbytes/s. Each PE contains
a DEC Alpha 21064 RISC CPU, rated at 150 Mflops peak double precision for a
total peak performance of up to 300 Gflops. A PE also contains 1 k words = 8 kbytes
of direct mapped data cache and an 8 kbyte instruction cache. An estimate of
achievable performance is given by the single node double precision Linpack rating
of 20 Mflops. Direct support for PVM version 3.0 is provided. Our experiments have
shown that the effective latency for the Cray T3D implementation of PVM is on the
order of 200 us, rather than the optimal value of 1 us. The basic machine parameters
used to predict performance are summarized in Table 1 for w = 8 byte words on the
T3D and w = 4 byte words on the iPSC/860.

3.3. Performance prediction

The problem of passive advection in two dimensions will be considered in the


analysis given below and in the performance results which follow. The scalability
analysis techniques introduced above are employed to predict program performance.
To simplify the analysis it is assumed that p processors are configured in a px x py
processor mesh. An n, x IZ,,periodic grid with uniform spacing is block distributed
across the processor mesh and each processor maintains a fixed [C + 11 grid point
wide overlap region around the local grid partition. Each processor must send and
receive four messages (two in the x-direction and two in the y-direction) after every
time step in order to update these overlap regions. If computation and communica-
tion are not overlapped, then it follows that the communication overhead is

TcOmm= 2 t, + ht, + t,
n,rc + 11 + 2 t, + ht, + t,
n,rc + 11 (22)
( PX > ( PY 1.

Table 1
Target machine parameters (time in ps)

Machine Topology t, ih L w

iPSC/860 Hypercube 200.0 2.0 1.4 4


Cray T3D 3D mesh 200.0 0.01 0.028 8
232 S. Thomas, J. CM 1 Simulation Practice and Theory 3 (199.5) 223-238

If pX= pY= 4 and n, = nY= II, this expression simplifies to

n[C + 11
Tcom = 4 t, + ht, + t, (23)
( & >.

The parallel efficiency of the algorithm is therefore

(24)

where TcomPrepresents the execution time for one time step on a single processor.
The value of Tornp must be “calibrated” for the particular machine and will depend
on the problem size (due to cache effects), the flop count per grid point and the
execution rate of the processor. Let tfl represent the time to complete one floating
point operation for a given program. If the execution rate is l/tR, for the semi-
Lagrangian algorithm, then Tamp is approximately

Tcamp= 53 7 t,. (25)

3.4. Results and analysis

Our implementation is based on a single program multiple data (SPMD) distrib-


uted-memory Fortran program where the grid is block partitioned and local grid
coordinates are computed at run-time. Advection requires a fixed 3 grid point wide
overlap region on each processor, corresponding to C < 2. The distance between
grid points is normalized to Ax = 1.0 and Ay = 1.0 with grid dimensions n, x n,,. The
maximum displacements in the x and y directions are given by CI= 1~1At/Ax and
p = lv(At/Ay and must respect the C < 2 limit. In a realistic fluid flow problem, the
problem size would be increased by decreasing the grid spacing and thus increasing
the model resolution. In this case the time step At would have to decrease to respect
the maximum C or the size of the overlap region would have to increase. A simple
model problem is the passive advection of a Gaussian hill in 2-D on a rectangular
Cartesian grid with periodic boundary conditions. The functional form of such a
cone is given by

F(x, y) = H exp(-r2/a2), r2=(x-x,,)2+(y-yo)2, a=4Ax,


where H is the maximum height. The wind has a constant velocity (u, u) = (1.0, 1.0).
For this example, therefore, departure points can be computed a priori for all grid
points. Fig. 1 is a plot of the advected cone, initially at the origin, after 60 time steps
on a 80 x 80 grid. A prediction of parallel performance is obtained for both target
machines by first executing the programs on a single processor for 100 time steps to
obtain a reasonably accurate estimate of Tseq and hence Tcomp. For example, an
execution rate of 6.78 Mflops is achieved on the iPSC/860 with grid dimensions
S. Thomas, J. C6tt / Simulation Practice and Theory 3 (1995) 223-238 233

Fig. 1. Passive advection of a cone.

160 x 160. The resulting execution time is Tses= 0.200 s per time step. The logical
process structure of a p,.i x py processor mesh implemented using PVM 3.0 on the
iPSC/860 does not map directly onto the hypercube network (e.g. through the use
of binary reflected Gray codes). Therefore, it is assumed that the communications
channels are shared and communications overhead is modeled by N = p in (21).
Eqs. (22) and (23) are then modified accordingly.
To assess how well the problem scales on the iPSC/860, the problem size is
increased up to a grid size of 1280 x 640. Predicted speedup curves are plotted as
solid lines in Fig. 2 and the observed values are plotted as single points. Predicted
and observed execution times for 100 time steps of the program on the iPSC/860
are also plotted in Fig. 3. The predictions fit well with the observed speedup and
execution times, indicating that the available bandwidth is decreasing with the
number of processors and that timings on the iPSC/860 are only influenced by t,
since t,, and t, are negligible. A single node execution rate of 17.34 Mflops is obtained
on the Cray T3D for a 160 x 160 grid, whereas 16.15 Mflops is obtained for a
640 x 320 grid. It appears, therefore, that performance of the DEC Alpha processor
is affected by problem size and access to cache memory. A logical mesh process
topology maps directly onto the 3-D torus interconnection network of the T3D.
Nevertheless, our tests indicate that available bandwidth is reduced as the number
of processors increases. Sharing of communication channels on the T3D was modeled
by setting N = 2& in (21).
In order to assess scalability, the problem size was increased once again on the
T3D. Predicted speedup curves are plotted in Fig. 4, where observed values are
plotted as single points. Predicted and observed execution times for 100 time steps
are plotted in Fig. 5. For small size problems, the predictions are quite accurate. In
234 S. Thomas, J. CM / Simulation Practice and Theory 3 (1995) 223-238

Execution Time

30

+ 320x320

25
*

15 20 25 30 35
Nodes

Fig. 2. Predicted and observed speedup for iPSC/860.

the case of very large grids, the observed speedups are slightly larger. This effect is
most likely due to the decrease in size of the local grids, resulting in fewer cache
misses. Surprisingly, the Cray T3D implementation of PVM appears to impose a
larger latency and for small size problems this could limit performance. A better
choice might be to use the Cray T3D SHMEM shared memory primitives to
minimize communication overheads.

4. Discussion and conclusions

The basic motivation for the analysis presented in this paper is to eventually
develop a full 3-D atmospheric model on the sphere for massively parallel computers.
Even though the implementation of parallel advection on a 2-D Cartesian plane has
provided useful information, several important issues need to be addressed in spheri-
cal geometry. These are the use of fixed overlap regions and the difficulties associated
with the poles.
Williamson [ 171 and Zero [ 191 have gathered statistics on the average length of
particle trajectories at different latitudes in existing atmospheric models on the
sphere. Their results indicate that near the poles one can expect to encounter Courant
numbers on the order of C = 15 to C = 20. Therefore, the fixed overlap strategy
S. Thomas, J. C6tk j Simulation Practice and Theory 3 (1995) 223-238 235

Execution Time
I I I I 1

0 80x80
X 160x160

10” i&-.---F
I
t , I 1 I I

0 5 10 15 20 25 30 5
Nodes

Fig. 3. Predicted and observed execution time for iPSC/860.

described herein may require an excessive amount of communication overhead for


a mesh partitioning of a latitude-longitude grid. One solution to this problem might
be to dedicate a processor to each of the polar regions. Another approach would be
to use a strip partitioning in the north-south direction, allocating a latitude band to
each processor. In this scheme the communication to computation ratio increases
rapidly as the number of strips increases.
Several alternatives exist to the “maximum” overlap strategy on each processor
node. Both Williamson [17] and Zero [ 193 describe a parallel algorithm where
each processor exports an interpolation problem to the processor who owns the
associated departure point. To cut down on the number of messages, each processor
would build a single message containing all problems destined for another processor.
One possible drawback of such a scheme is that the wind u will, in general, vary
greatly across the grid and so a load imbalance may result. Another approach
consists of using local estimates of the wind speed to maintain “dynamic” overlap
regions which vary in size [4]. Lie and Skdlin [S] advocate an entirely different
approach in which trajectories are determined only up to the processor sub-grid
boundary. Thus, interpolation in time as well as in space is required. Research into
parallel semi-Lagrangian advection is active and we expect to see further progress
as MPP machines become widely accessible. The use of portable parallel message-
236 S. Thomas, J. C&k 1 Simulation Practice and Theory 3 (1995) 223-238

ExecutionTime
I 1 I I I I

0 80x80
X 160x160
+ 320x320

100- m 1280x640

S 80-

i= 60-

40-

60 80 100 120 140


Nodes

Fig. 4. Predicted and observed speedup for Cray T3D.

passing libraries such as PVM or MPI should also aid in the development of these
new methods.
To summarize, we found that our performance models provided a good fit to the
experimental results and it was also made clear that the semi-Lagrangian algorithm
based on cubic interpolation is scalable due to a high flop count per grid point. We
believe that our results are very promising for the eventual parallel implementation
of a complete semi-implicit, semi-Lagrangian shallow-water model based on a block
data distribution of the computational grid. Each time step in such a model consists
of semi-Lagrangian advection followed by a semi-implicit correction, requiring the
solution of a nonlinear Helmholtz problem. The task of building an efficient parallel
elliptic solver still remains.

Acknowledgements

We are most grateful to John Drake of the Oak Ridge National Laboratory for
providing access to an Intel iPSC/860. The efforts of John Tulley and Jimmy Scott
of Cray Canada in obtaining time on a Cray T3D are very much appreciated. In
particular, we would like to thank Ivar Lie and Roar SkHlin for sharing their insights
S. Thomas, J. C&e / Simulation Practice and Theory 3 (1995) 223-238 231

ExecutionTime
lo3 I I I 1 I

=‘
lo*

g10'
P ‘1

loo

X
-1 0
lo-
I I I I I

0 20 40 60 80 100 120 140


Nodes

Fig. 5. Predicted and observed execution time for Cray T3D

into parallel SLT algorithms with us. We would also like to thank Pierre Gauthier
and Michel Valin of RPN for reviewing the manuscript.

References

Cl1 R. Bermejo and A. Staniforth, The conversion of semi-Lagrangian schemes to quasi-monotone


schemes, Monthly Weather Review 120 (1992) 2622-2632.
c21 J. C&t, A Lagrange multiplier approach for the metric terms of semi-Lagrangian models on the
sphere, Quart, J. Roy. Meteor. Sot. 114 (1988) 1347-1352.
c31 Cray Research Inc, MPP Software Guide (October 1994), SG-2508, 1.2.
c41 J.B. Drake, Parallel semi-Lagrangian Transport, May 1994, Presentation at the Workshop on
Parallel Semi-Lagrangian Algorithms, NASA Goddard, Washington, DC.
c51 I. Foster, W. Gropp and R. Stevens, The parallel scalability of the spectral transform method,
Monthly Weather Review 120 (1992) 835-850.
C61I. Foster and P. Worley, Parallel algorithms for the spectral transform method, Technical Report
MCS-P426-0494, Argonne National Laboratory, Argonne, Illinois, 1994.
c71 S. Gravel and A. Staniforth, A mass-conserving semi-Lagrangian integration scheme for the
shallow-water equations, Monthly Weather Reuiew 122 (1994) 242-248.
CSI I. Lie and R. Skalin, Parallelism in semi-Lagrangian transport, in: Parallel Supercomputing in
Atmospheric Science, Proceedings of the 6th ECMWF Workshop on the use of Parallel Processors
in Meterology, Singapore, 1994 (World Scientific, Singapore), in press.
238 S. Thomas, J. C&i / Simulation Practice and Theory 3 (1995) 223-238

[9] A. McDonald and R. Bates, Accuracy of multiply-upstream semi-Lagrangian advection schemes,


Monthly Weather Review 112 (1984) 1267-1275.
[lo] A. McDonald and R. Bates, Semi-Lagrangian integration of a gridpoint shallow-water model on
the sphere, Monthly Weather Reoiew 117 (1989) 130-137.
[ll] A. Priestley, A quasi-conservative version of the semi-Lagrangian advection scheme, Monthly
Weather Review 121 (1993) 621-629.
[12] H. Ritchie, Semi-Lagrangian advection on a Gaussian grid, Monthly Weather Reoiew 115
(1987) 608-619.
[13] C. Rivest, A. Staniforth and A. Robert, Spurious resonant response of semi-Lagrangian
discretizations to orographic forcing: diagnosis and solution, Monthly Weather Review 122 (2)
(1994) 366-376.
[14] A. Robert, A semi-Lagrangian, semi-implicit numerical integration scheme for the primitive
meterological equations, Amos. Ocean. 19 (1981) 35-46.
[ 151 H. Simon, ed., Parallel Computational Fluid Dynamics: Implementations and Results, Scientific and
Engineering Computation Series (MIT Press, Cambridge, MA, 1992).
[16] A. Staniforth and J. Cot& Semi-Lagrangian integration schemes for atmospheric models ~ A
review, Monthly Weather Reciew 119 (9) (1991) 2206-2223.
[17] D.L. Williamson, MPP implementations of the NCAR CCM2, November 1993, Presented at the
Workshop on the Algorithmic Implications of MPP Architectures in Atmospheric and Oceanic
Modeling, RPN, Dorval, PQ, Canada.
[ 181 D.L. Williamson and P.J. Rasch, Two dimensional semi-Lagrangian transport with shape-
preserving interpolation, Monthly Weather Review 117 (1988) 102-129.
[ 191 J. Zero, Data-Parallel semi-Lagrangian advection, Portland, OR, November 1993, Poster presented
at Supercomputing 93.

You might also like