You are on page 1of 88

Fig. 1.1. A complicated computational domain.

1 Grid generation
A complex geometry could be something like in Figure 1.1. Assume that we want to
solve a partial dierential equation (pde) on the domain in Figure 1.1. It could be a pde
describing the ow of air or water past the object, or a pde describing the electric and
magnetic elds around the object. We must then discretize the domain, i.e., divide it into
a number of cells where the solution can be represented. Numerical solutions are often
represented as point values of a quantity, (e.g. the local velocity of a uid) at the grid
points, or as the average of the quantity over one cell, see Figure 1.2.
Fig. 1.2. Grid points and cell centers.
1
We next approximate the pde on the discretized domain, using a nite dierence
method, nite element method or any other method which is suitable for the problem at
hand.
The more densely clustered the grid points are, the better the accuracy will be in the
computed solution. There are two distinct diculties when we discretize the domain.
How to resolve the geometry, and how to resolve the computed solution.
Fig. 1.3. Compressible uid ow past a step.
In Figure 1.3 we see a computation of ow past a forward facing step. We have to
make several decisions about this computation.
We could put many grid points at the discontinuities, the sharp transititions, in
order to resolve the solution.
We could put many points near the corner of the step in order to resolve the geom-
etry.
Which resolution is desired ? If we are only interested in quantities on the boundary,
perhaps it is not necessary to resolve all features of the solution far away from the
boundary ?
1.1 Handling of Geometry
We distinguish several ways of discretizing a complex geometry
1. Unstructured grids
2. Structured grids
(a) Rectangular grids
(b) Multi block grids
(c) Overlapping grids.
Unstructured grids. The domain is divided into polygons, triangles are often used.
See Figure 1.4 for a triangulation of the unit square. Software to generate this type of
discretization normally require the user to input an initial, very coarse, triangulation.
Perhaps only containing points on the boundary of the domain. Techniques for automatic
renement is then used. In the data structure, each triangle has pointers to its neigh-
bors, but there is no information on coordinate directions. If there are n nodes they are
enumerated i = 1, . . . , n, and the coordinates, (x
i
, y
i
), of each node is stored in an array,
2
double x[n], y[n];.
Information on the connectivity of the nodes is also required. Exactly which information
is required depends on the type of discretization used. Sometimes the triangles themselves
are needed. They are then described by an array,
int tri[m,3];,
where tri[i,0], tri[i,1], tri[i,2] are the numbers of the three nodes in triangle
i. Note that the number of triangles, m, is dierent from the number of nodes, n. The
coordinates of the rst node in triangle i, is thus given by the indirect reference
x[tri[i,0]], y[tri[i,0]]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig. 1.4. Unstructured grid on the unit square.
Advantages with the unstructured grids are:
+ Generality, no need to think about block decomposition.
+ Grid renement is straightforward.
Disadvantages
- Inecient on many computer architectures.
- Not straightforward to get an ecient parallelization.
- Discretization formulas often more complicated.
We will not discuss unstructured grids further in this course. We refer to the courses in
the nite element method.
0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
0
0.5
1
1.5
2
2.5
3
Fig. 1.5. Structured grid.
3
Structured grids. A structured grid is something which is indexed along coordinate
directions. We think of a grid as a mapping
x(, , ), y(, , ), z(, , ) (1.1)
from the unit cube 0 1, 0 1, 0 1 to the physical space. See Figure 1.5.
For nite dierence approximations, we want the grids to be smooth transformations,
so that we use (1.1) to transform the pde to the unit cube, and solve it there. The
transformed problem will contain derivatives of the grid as coecients. This is in analogy
with the smooth local coordinate maps of dierential geometry. It is possible to dene
nite volume or nite dierence approximations which do not require smoothness of the
grid, but these approximations are in general more complicated, and computationally
expensive.
Advantages with structured grids are:
+ Easy to implement, and good eciency
+ With a smooth grid transformation, discretization formulas can be written on the
same form as in the case of rectangular grids.
Disadvantages
- Dicult to keep the structured nature of the grid, when doing local grid renement.
- Dicult to handle complex geometry. (Mapping of an aircraft to the unit cube !).
There are several ways to overcome the diculty of using structured grids for complex
geometry. The most common device is to divide the domain into blocks, where each block
can be mapped to the unit cube separately. A division into blocks can sometimes come
naturally from a cad representation of the geometry. The blocks can be adjacent, like
in Figure 1.6 a), or overlapping as in Figure 1.6 b). In both cases special interpolating
boundary conditions are needed at the interfaces.
Note that block subdivision is like using unstructuredness on a coarser level. The block
layout is like an unstructured decomposition of the domain, but where each component
is further rened structuredly. The rst step in the discretization is to dene suitable
blocks, and then in the second step, generate one grid for each block.
An alternative to this, is to use purely rectangular grids, and cut out the objects as
holes in the grid, as shown in Figure 1.7. This is a very general method. However, it is dif-
cult to achieve good accuracy in the boundary conditions for this method. Furthermore,
since cells are cut arbitrarily, some cells at the boundary can become very small, and cause
stability problems for explicit dierence schemes. The discretization is required to work
for all shapes of cells. The method is used when the geometry is extremely complicated,
so that generating separate grids around each little detail is not a feasible technique.
We next concentrate on the problem of generating one single grid. There are some
fairly general techniques for doing this. On the other hand, the division into blocks is
usually done manually for each particular conguration, and can require many days work
by an engineer.
4
Fig. 1.6a. Overlapping grids. Fig. 1.6b. Adjacent grids.
0.5 0 0.5 1 1.5
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
Fig. 1.7 Geometry cut out from a rectangular grid.
1.2 Single structured grids
For single grid components, we have several methods of constructing grids.
1. Analytical transformations
2. Algebraic grid generation
3. Elliptic grid generation
4. Variational grid generation
5. Hyperbolic grid generation
In general, the procedure to generate a grid goes as follows. First generate grids on the
edges of the domain, second from the given edges, generate the sides, and nally generate
the volume grid from the six given side grids. In the description below, we will assume
two space dimensions, but the methods extends straightforwardly to three dimensions.
5
Analytical transformations. This is the most obvious type of grid generation. If the
geometry is simple enough that we know an analytical transformation to the unit square,
the grid generation becomes very easy. E.g., the domain
D = {(x, y) | r
2
i
x
2
+ y
2
r
2
o
}
can be mapped to the unit square by the mapping
x = (r
i
+ (r
o
r
i
)r) cos 2
y = (r
i
+ (r
o
r
i
)r) sin 2
where now 0 r 1 and 0 1. The grid is then obtained by a uniform subdivision
of the (r, ) coordinates. This method of generating grids is simple and ecient. Another
advantage is that we can generate orthogonal grids, if a conformal mapping can be found.
Orthogonal grids are grids where grid lines always intersect at straight angles. Often pde
approximations becomes more accurate, and have stencils with fewer points on orthogonal
grids. Of course, this method has very limited generality, and can only be used for very
special geometries.
Algebraic grid generation. This type of grid generation is also called transnite in-
terpolation. In its simplest form, it is described by the formula
x(, ) = (1 )x(0, ) + x(1, ) + (1 )x(, 0) + x(, 1)
(1 )(1 )x(0, 0) (1 )x(1, 0) (1 )x(0, 1) x(1, 1)
y(, ) = (1 )y(0, ) + y(1, ) + (1 )y(, 0) + y(, 1)
(1 )(1 )y(0, 0) (1 )y(1, 0) (1 )y(0, 1) y(1, 1)
(1.2)
Here it is assumed that (x(, ), y(, )) are known on the sides of the domain. Formula
(1.2) gives then the grid in the entire domain. The formula is an interpolation from the
sides to the interior. The two rst terms is the interpolation between = 0 and = 1, the
next two terms are the same for the direction. Finally four corner terms are subtracted.
One can easily verify that (1.2) is exact on the boundary by putting = 0, = 1, = 0,
or = 1.
This formula can be generalized to the case where
x and y are given on several coordinate lines in the interior of the domain.
both x and y and its derivatives are given on the sides.
A general form of (1.2) is obtained by introducing the blending functions
0
and
1
,
with the properties

0
(0) = 1
0
(1) = 0
1
(0) = 0
1
(1) = 1
The general transnite interpolation is then dened by
x(, ) =
0
()x(0, ) +
1
()x(1, ) +
0
()x(, 0) +
1
()x(, 1)

0
()
0
()x(0, 0)
1
()
0
()x(1, 0)
0
()
1
()x(0, 1)
1
()
1
()x(1, 1)
6
By introducing the projector
P

(x) =
0
()x(0, ) +
1
()x(1, )
we can write the transnite interpolation above as the boolean sum
x(, ) = P

(x) + P

(x) P

(P

(x)) (1.3)
This form is convenient for describing generalizations of the method, e.g., if the derivatives
of the grid are prescribed on the boundary, we can use the projector
P

(x) =
0
()x(0, ) +
1
()x(1, ) +
0
()
x(0, )

+
1
()
x(1, )

in (1.3) instead. Here the new blending function


0
satises

0
(0) = 0
0
(1) = 0

0
(0) = 1

0
(1) = 0

1
(0) = 0
1
(1) = 0

1
(0) = 0

1
(1) = 1
An example of when we would like to prescribe the derivatives on the boundary, is when
we want the grid to be orthogonal to the boundary. This can sometimes be needed in
order to simplify boundary conditions. Of course, it is possible to use dierent projectors
and blending functions in the dierent coordinate directions.
The advantage of algebraic grid generation is its eciency, and ease of implementation.
A disadvantage is that there is no guarantee that the method will be successful. For very
curved boundaries, it can often happen that grid lines intersect, like the example in
Figure 1.8 a). Another problem is the propagation of singularities from the boundary. If
the boundary has an interior corner, as in Figure 1.8 b), the corner will be seen in the
interior domain, making it dicult to generate smooth grids.
0 0.5 1 1.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
Fig. 1.8a. Folding of grid lines. Fig. 1.8b. Propagating corner.
Elliptic grid generation. This type of grid generation is motivated by the maximum
principle for elliptic pdes. We dene the inverse grid transformation, (x, y), (x, y) as
the solution of

xx
+
yy
= 0

xx
+
yy
= 0
(1.4)
7
We know that 0 1 and 0 1 are monotone on the boundaries. It then follows
from the maximum principle that and will stay between these values. Furthermore,
there will be no local extrema in the interior, and thus grid lines can not fold. The
equations (1.4) are formulated in the x-y domain, and has to be transformed to the unit
square, so that we can solve them there. We use the unknown transformation itself to
transform the equations (1.4). The transformed system then becomes
(x
2

+ y
2

)x

2(x

+ y

)x

+ (x
2

+ y
2

)x

= 0
(x
2

+ y
2

)y

2(x

+ y

)y

+ (x
2

+ y
2

)y

= 0
.
This problem can be solved as a Dirichlet problem if (x, y) are given on the boundaries, or
as a Neumann problem if the normal derivatives of (x, y) are specied on the boundaries.
Specifying normal derivatives is here equivalent to specifying the distance between the
rst and second grid lines. These equations are then approximated by, e.g.,
x

(x
i+1,j
x
i1,j
)/2 x

(x
i,j+1
x
i,j1
)/2
x

x
i+1,j
2x
i,j
+ x
i1,j
etc.
where now the index space 1 i n
i
and 1 j n
j
is a uniform subdivision of the
(, ) coordinates, = (i 1)/(n
i
1), = (j 1)/(n
j
1). The number of grid points
is specied as n
i
n
j
.
subroutine GAUSEI2D( ni, nj, x, y, err )
integer ni, nj, i, j
real*8 x(ni,nj), y(ni,nj), xtemp, ytemp, err
real*8 g11, g12, g22
err = 0
do j=2,nj-1
do i=2,ni-1
g11 =((x(i+1,j)-x(i-1,j) )**2 + (y(i+1,j)-y(i-1,j))**2 )/4
g22 =((x(i,j+1)-x(i,j-1) )**2 + (y(i,j+1)-y(i,j-1))**2 )/4
g12 =(x(i+1,j)-x(i-1,j))*(x(i,j+1)-x(i,j-1))/4+
* (y(i+1,j)-y(i-1,j))*(y(i,j+1)-y(i,j-1))/4
xtemp = 1/(2*(g11+g22))*(
* g22*x(i+1,j) - 0.5*g12*x(i+1,j+1) + 0.5*g12*x(i+1,j-1)+
* g11*x(i,j+1) + g11*x(i,j-1) +
* g22*x(i-1,j) - 0.5*g12*x(i-1,j-1) + 0.5*g12*x(i-1,j+1) )
ytemp = 1/(2*(g11+g22))*(
* g22*y(i+1,j) - 0.5*g12*y(i+1,j+1) +0.5*g12*y(i+1,j-1)+
* g11*y(i,j+1) + g11*y(i,j-1) +
* g22*y(i-1,j) - 0.5*g12*y(i-1,j-1) +0.5*g12*y(i-1,j+1) )
err = err + (x(i,j)-xtemp)**2+(y(i,j)-ytemp)**2
x(i,j) = xtemp
y(i,j) = ytemp
enddo
enddo
err = SQRT( err/((nj-2)*(ni-2)) )
return
end
Code 1.1. Gauss-Seidel iteration for elliptic grid generator.
8
The equations can then be solved by a standard elliptic solver such as, e.g., conju-
gate gradients, Gauss-Seidel or the multi grid method. In Code 1.1, we show a fortran
subroutine for doing one Gauss-Seidel iteration on the system.
If you want to specify both the grid and its normal derivatives on the boundary, the
second order elliptic pde above can not be used, but it is possible to dene an elliptic
problem with fourth order derivatives instead.
Elliptic grid generation is very reliable, and will always produce a grid. However it
might not always be the grid you want. For example, grid lines tend to cluster near
convex boundaries, but will be very sparsely distributed near concave boundaries. In
Figures 1.9a and 1.9b we show the same example as in Figures 1.8a and 1.8b, but now
the grid is generated by the elliptic equations (1.4). Clearly, the problems which occured
with transnite interpolation does not happen here.
0 0.5 1 1.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
Fig. 1.9a. No folding. Fig. 1.9b. Corner smoothed.
To introduce more control over the grid, so called control functions are introduced into
(1.4). The system then becomes

xx
+
yy
= P

xx
+
yy
= Q
Alternatively the functions (P, Q) can be scaled so that the right hand side becomes
(
2
x
+
2
y
)P and (
2
x
+
2
y
)Q. The maximum principle is now lost, so that we have no
guarantee that a grid will be successfully generated. There is a risk that grids with folded
coordinate lines occur.
The functions P and Q can be chosen to attract grid lines to certain lines or points
in the space. To do this by hand is very dicult. Often P and Q are specied from a
weight function, which species the grid density at each point in the domain. This is often
used in adaptive methods, when the weight function measures the error in a computed
solution.
Elliptic grid generators are often used to post process grids computed by algebraic
grid generators. If (1.2) is solved by an iterative method for elliptic pdes, it is sucient
to do a few iterations to smooth out irregularities in the grid. We do not need to solve the
equations (1.2) themselves to any high accuracy, thereby saving computational power.
Variational methods. These methods have evolved from elliptic grid generation. To
solve an elliptic pde is often equivalent to minimizing a functional. We distinguish three
9
functionals which describe the grid quality. The orthogonality functional
O = O
1
+ O
2
+ O
3
+ O
4
,
each of the four components O
i
is related to the angle between grid lines. See Figure 1.10.
We have for the rst angle
O
1
=
n
i
1,n
j
1

i=1,j=1
((r
i,j+1
r
i,j
)
T
(r
i+1,j
r
i,j
))
2
where r = (x y)
T
. Here we let a
T
denote the transposed vector, and a
T
b becomes then
the scalar product between a and b. O is zero for an orthogonal grid.
(i,j)
(i+1,j)
(i+1,j+1)
(i,j+1)
Fig. 1.10 Scalar product between intersecting grid lines.
The spacing functional measures how fast the spacing changes, and is dened by
S = S
H
+ S
V
.
Here S
H
and S
V
are horizontal and vertical spacing respectively. The denition of S
H
is
S
H
=
n
i
1

i=1
n
j

j=1
|r
i+1,j
r
i,j
|
2
and similarly for S
V
. S is minimal for a uniform grid. The cell area functional is dened
by,
A =
n
i
1,n
j
1

i=1,j=1
A
2
i,j
,
and measures how much the cell area changes over the grid. A
i,j
is dened as the area of
the cell with corners (i, j), (i + 1, j), (i, j + 1), (i + 1, j + 1).
These three functionals are combined into one, which is minimized. We obtain the
problem to minimize
V = aA + bS + cO
here a, b and c are parameters, which the user should provide. They indicate the relative
importance of the dierent quality measures are for the present problem. To, e.g., have a
grid which is close to orthogonal, c should be big, and a and b small.
The minimization can be done by a suitable numerical method, such as, e.g., the
conjugate gradient method.
10
The advantage of this method is exibility. However, we have no guarantee for success.
E.g., minimizing the spacing functional corresponds to solving the elliptic pde
x

+ x

= 0
y

+ y

= 0
which, since formulated in physical space, is not the same as (1.4). Here we have no help
from the maximum principle.
Remark. It can be advantageous to mix grid generation methods. E.g., one could use
an elliptic method to generate grids on the sides of the domain, and then an algebraic
method to generate the grid in the entire volume. This gives good performance, since
algebraic grid generation is inexpensive compared with elliptic grid generation.
1.3 Implementational considerations
It is convenient to describe all boundary curves in a compatible parameterization. It is a
good idea to use the normalized arc length as curve parameter. To see why, consider the
airfoil ( NACA0012 ) in Figure 1.11 below. It is described by the equation
y = a
0

x + a
1
x + a
2
x
2
+ a
3
x
3
+ a
4
x
4
0 x 1
The constants a
i
have given values. The most straightforward way to place points on
the airfoil is to make a uniform discretization of the x-axis, x
i
= ix, i = 1, . . . , N, and
use the points (x
i
, y(x
i
)) as boundary grid points. However, we would then obtain the
picture in Figure 1.12a. There are very few points at the sharp gradient in the front. This
local sparseness inuences the entire grid when later the points on the airfoil are used
to generate the grid in the entire domain. Of course stretching functions can be used to
cluster more points near the leading edge, but it is dicult to tune the right stretching
by hand.
We instead write the airfoil as a curve, (x(s), y(s)), where s [0, 1] is proportional to
the arc length. This is done by the reparameterization
s(p) =
1
L
_
p
a
_
x

(t)
2
+ y

(t)
2
dt (1.5)
Here L is the total arc length, L =
_
b
a
_
x

(t)
2
+ y

(t)
2
dt. The function x(s) is implemented
according to the following algorithm. Given a value of s, we solve equation (1.5) for p,
using Newtons method, and numerical quadrature in the integral. The x value to return
is then x(p), which we easily compute from the original parameterization.
Using the same number of points as in Figure 1.12a, but with the arc length as
parameter, we obtain the picture in Figure 1.12b. There is clearly an improvement.
Another problem that has to be dealt with, is the coordinate direction of opposite sides.
Figure 1.13 illustrates this. The arc length coordinate must run in the same direction on
opposite sides. Every grid generating software must be capable of keeping track of the
coordinate directions on facing sides, and, if necessary, apply the transformation s 1s.
This problem becomes more severe in three space dimensions.
11
0 0.2 0.4 0.6 0.8 1
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
Fig. 1.11 NACA0012 airfoil.
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
0.1
0.08
0.06
0.04
0.02
0
0.02
0.04
0.06
0.08
0.1
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
0.1
0.08
0.06
0.04
0.02
0
0.02
0.04
0.06
0.08
0.1
Fig. 1.12a. x as parameter. Fig. 1.12b. Arc length as parameter.
s=1 s=0
s=1
s=0
Fig. 1.13 Coordinate directions on upper and lower sides do not match.
1.4 Stretching transformations
It is often necessary to cluster points more densely near certain critical parts of the
computational domain. This can be achieved by a so called stretching function. This is
a function which maps the unit interval onto itself. We denote the stretching function
by s(u) : [0, 1] [0, 1]. Stretching is applied in only one coordinate direction. It is of
course possible to dene dierent stretching functions in dierent coordinate directions.
Along one coordinate, we go through following steps in order to generate grid points on
one coordinate line.
1. Take a uniform grid, u
i
= (i 1)/(n 1), i = 1, . . . , n
2. Apply stretching s
i
= s(u
i
)
12
3. Generate points in the arc length parameter, i.e., use the grid points (x(s
i
), y(s
i
), z(s
i
))
on one edge.
The philosophy behind this is that grids should be stretched as a separate rst step before
the grid generation. This is in contrast with the discussion on elliptic grid generators,
one takes there the point of view that all grid stretchings should be build into the control
functions P, Q.
We next give some examples of suitable grid stretching functions. Assume that we
have the grid points x
1
= 0, x
2
, x
3
, . . . , x
n
= 1. A common practise for the computation
of boundary layers in uid dynamics is to use the following spacing
x
i+1
x
i
= (x
i
x
i1
) (1.6)
with > 1. Here we assume that the layer is near the boundary x
1
= 0. We specify the
smallest spacing x
2
x
1
, and then gradually increase the spacing with i through (1.6).
We have x
i+1
x
i
=
i1
(x
2
x
1
), and thus
x
i
= x
i1
+
i2
(x
2
x
1
) = . . . = x
1
+ (1 + + . . . +
i2
)(x
2
x
1
) =

i1
1
1
(x
2
x
1
)
Because x
n
= 1, we obtain x
2
x
1
= ( 1)/(
n1
1), and thus
x
i
=

i1
1

n1
1
Let the uniform spacing variable be u = (i1)/(n1). The stretching function associated
with (1.6) is then
x(u) =

(n1)u
1

n1
1
This function must be well dened for all n, which implies that = 1 + O(1/n). Since

n1
e

as n , we prefer to dene the stretching as


x(u) =
e
u
1
e

1
where the parameter > 0 is chosen to control the strength of the stretching. Often the
derivative x

(0) is specied as a measure of the stretching near the boundary. Beta can
then easily be computed from x

(0).
The exponential stretching is very simple. Analysis of truncation errors have led to
more advanced stretching functions, such as the hyperbolic tangent,
x(u) = 1 +
tanh (u 1)/2
tanh /2
. (1.7)
The parameter is used to control the strength of the stretching, and can be determined
from a specied value of x

(0). Examples of grid point distributions are given below.


13
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Beta=1.2, Exponential stretch
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Beta=2, Exponential stretch
Fig. 1.14a. = 1.2. Fig. 1.14b. = 2.
Exponential stretching.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Delta=1.2, Hyperbolic tangent stretch
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Delta=2, Hyperbolic tangent stretch
Fig. 1.15a. = 1.2. Fig. 1.15b. = 2.
Hyperbolic tangent stretching.
More general forms of the functions above can be derived to place the stretching at an
arbitrary point, not just near the left boundary. An example of the hyperbolic tangent
stretching function is then
x(u) =
tanh(b(u c)) A
B A
where A = tanh(b(c)) and B = tanh(b(1 c)). Here b is the strength and 1 c the
location of the attraction point. The formula above becomes (1.7) under the restriction
c = 1, ( with b = /2 ).
1.5 Further reading
Some references on grid generation are:
J. Thompson, Z. Warsi, and C. Mastin, Numerical Grid Generation, North-Holland, 1985,
ISBN 0-444-00985-X.
P. Knupp and S. Steinberg, Fundamentals of Grid Generation, CRC Press, 1993, ISBN
0-8493-8987-9.
Thompson, Soni, Weatherill, Handbook of Grid Generation, CRC Press, 1999, ISBN 0-
8493-2687-7.
14
2 Representation in C++.
The computer language C++ is well suited for implementing a grid generator. C++ is a
superset of C. The discussion below assumes knowledge of the language C.
One major dierence between C++ and C is that C++ has a data type called a class.
A class is a generalization of the struct in C, or the record in Pascal. In a class we can
gather together a number of variables. We have also the possibility to let functions be
a part of a class. Furthermore we can chose to make certain variables and functions
inaccessible for all code outside the class itself. The main idea, is that a class should be
something encapsulated well isolated from the rest of the code. The class can only be
used through well dened interfaces. It is often dicult to design good classes, however
for boundary curves it is fairly easy to get a clean representation with classes.
2.1 A rst example of a class
Below we show an example of a class which implements the boundary curve
y = a
0

x + a
1
x, a x b (2.8)
where x is used as curve parameter.
class bcurve
{
private:
double a, b;
double a0, a1;
int dir;
public:
bcurve( double a in, double b in )
{
a0 = 0.3033;
a1 = 0.67;
a = a in;
b = b in;
dir = 0;
}
double x( double p )
{
return p;
}
double y( double p )
{
return a0*sqrt(p) + a1*p;
}
void reverse()
{
dir = !dir;
}
};
A class declaration has the structure
class name
{
// declarations
};
15
The word class is a reserved word, the name of the class is an identier chosen by the
programmer.
The class bcurve contains now the variables a,b, the end points of the curve param-
eter, a0, a1 describing the parameter functions, and dir which is the direction of the
curve. The value 0 of dir means that p goes from a to b, a value of 1 means from b to a.
bcurve also contains the functions bcurve, x, y, and reverse.
The reserved word private means that these variables can only be accessed from
functions inside the class. Under public we declare the variables and functions which
any user of the class will be able to access directly. This constitutes the interface to the
class. A class should be designed such that specic details that we may want to change
later is kept under private. Changes become in that case local only to the class, and the
rest of the code using the class can remain unchanged.
Another reason to use private, is to prevent users of the class to change variables by
mistake. For example, the variable dir can not be accessed directly, but can only be
changed in a very controlled way through the function reverse.
The rst function under public is a function with the same name as the class, and
with no return type. This is an initialization function, called a constructor, which is called
automatically each time a variable of the class type is created. The intended use of the
constructor is to give values to all variables in the class, so that we do not risk, by mistake,
to use uninitialized variables. The other two public functions are the parameterization
(x(p), y(p)).
In a main program, the class is used as in the example below, where 10 points on the
curve are generated.
// The class declaration here.
int main()
{
bcurve bc(0,1.0);
double x[10], y[10];
for( int i=0; i<10; i++ )
{
x[i] = bc.x( i/9.0 );
y[i] = bc.y( i/9.0 );
}
...
};
A variable of class type, such as bc in the example above, is called an object. When the
program is executed, the rst declaration bcurve bc(0,1.0); will immediately cause the
constructor of the bcurve class to be called with the parameter values a in=0, b in=1.0.
Note the dot notation bc.x(0.0) to access a member function of a class. Variables are
accessed in the same way, e.g., bc.a denotes the left endpoint of the curve. However, in
this example, trying to access a in the main program will lead to a compilation error,
because a is declared as private.
Normally functions are not dened inside the class denition as in the example above.
C++ allows us to dene the functions outside the class, and only declare them inside the
class. The same example can in that case be written
class bcurve
{
private:
double a, b;
double a0, a1;
16
int dir;
public:
bcurve( double a in, double b in );
double x(double p );
double y(double p );
void reverse();
};
bcurve::bcurve( double a in, double b in )
{
a0 = 0.3033;
a1 = 0.67;
a = a in;
b = b in;
dir = 0;
}
double bcurve::x( double p )
{
return p;
}
double bcurve::y( double p )
{
return a0*sqrt(p) + a1*p;
}
void bcurve::reverse()
{
dir = !dir;
}
When the function denitions are lifted out of the class, they must be preceded by
classname::, so that the compiler knows where the functions belong. It is possible
that two dierent classes contain functions with the same name. The function denitions
can now be put anywhere in the program. Often they are written on a separate le.
Every part of the code which will use the class bcurve must include the class declaration
at the top of its le. Therefore one usually puts the class declaration in a special le called
in this case bcurve.h which we give below.
class bcurve
{
private:
double a, b;
double a0, a1;
int dir;
public:
bcurve( double a in, double b in );
double x(double p );
double y(double p );
void reverse();
};
The le bcurve.h
The function denitions are put on a separate le, bcurve.C, which is compiled separately
once and for all. We give bcurve.C below.
#include "bcurve.h"
bcurve::bcurve( double a in, double b in )
{
a0 = 0.3033;
a1 = 0.67;
a = a in;
b = b in;
dir = 0;
}
17
double bcurve::x( double p )
{
return p;
}
double bcurve::y( double p )
{
return a0*sqrt(p) + a1*p;
}
void bcurve::reverse()
{
dir = !dir;
}
The le bcurve.C
In the main program where we use the class bcurve, we include bcurve.h as shown
below
#include "bcurve.h"
int main()
{
bcurve lower boundary(0,0.5);
....
}
The advantage with this, is that the functions inside the class bcurve are compiled once.
We can compile the main program without recompiling the functions in bcurve. Had
they been written inside the class, as in the rst example in this chapter, the function
denitions would have been present in the include-le, and therefore recompiled every
time the main program was recompiled.
When several les include each other in a non-trivial way, it can easily happen that a
le like bcurve.h becomes included more than once. To avoid this, it is common practise
to put extra preprocessor directives in bcurve.h, as follows
#ifndef BCURVE
#define BCURVE
class bcurve
{
private:
double a, b;
double a0, a1;
int dir;
public:
bcurve( double a in, double b in );
double x(double p );
double y(double p );
void reverse();
};
#endif
The le bcurve.h, with preprocessor directives
This will ensure that bcurve.h is included only once.
One way to improve execution speeds is to inline small functions which are called often.
Inlining means that the compiler does not generate calls to the function, but instead it
inserts the entire source code of the function at each place where it is called. In C++,
we tell the compiler to inline a function, by using the keyword inline in front of the
function.
inline double x( double p )
{
return p;
}
18
Furthermore, all functions which are dened inside the class declarations becomes inline
by default. Inlined functions are similar to macros, but gives less risk for unwanted side
eects. Inlined functions must be dened in the same le as it is being called. Separate
compiling of inline functions is not allowed in C++. Inline functions are therefore usually
dened in the include le, such as shown in the class declaration bcurve.h below.
#ifndef BCURVE
#define BCURVE
class bcurve
{
private:
double a, b;
double a0, a1;
int dir;
public:
bcurve( double a in, double b in );
inline double x(double p )
{
return p;
}
inline double y(double p )
{
return a0*sqrt(p) + a1*p;
}
void reverse();
};
#endif
We next improve the class above. Assume that we want to reparameterize, using the
arc length normalized to [0, 1]. We would then dene the class as
class bcurve
{
private:
double a, b;
double a0, a1;
int dir;
double length;
double xo( double p );
double yo( double p );
public:
bcurve( double a in, double b in );
double x( double s );
double y( double s );
};
The private functions xo,yo describes the curve in original parameterization, these are
private because the user of the class should not have to worry about the exact description
of the parameterization. Instead the curve should be accessed from the outside through
the public functions x,y, in which the arc length is used as parameter. In this way, the
curve can be implemented with any equivalent parameterization in xo,yo, the user will
not note the dierence in the functions x,y. Assume that we put the class declaration
above on the le bcurve.h. The member functions can then be dened as below.
#include "bcurve.h"
double bcurve::xo( double p )
{
return p;
}
double bcurve::yo( double p )
{
return a0*sqrt(p) + a1*p;
}
bcurve::bcurve( double a in, double b in )
{
19
a0 = 0.3033;
a1 = 0.67;
a = a in;
b = b in;
dir = 0;
length = // numerical approximation of
_
b
a
_
xo

(p)
2
+ yo

(p)
2
dp
}
double bcurve::x( double s )
{
// solve s = 1/(length)
_
p
a
_
xo

(t)
2
+ yo

(t)
2
dt for p
return xo(p);
}
double bcurve::y( double s )
{
// solve s = 1/(length)
_
p
a
_
xo

(t)
2
+ yo

(t)
2
dt for p
return yo(p);
}
The main program given previously can still be used unchanged with this modied class.
Assume that we want to use the parameter t instead of x, where the curve now is dened
by
x = t
2
y = a
0
t + a
1
t
2
This curve has the same graph as the curve (2.8), but a dierent parameterization (and
dierent parameter values at the end points). We could then make this change in the
functions xo,yo, but the main program (or any other function using the class) can remain
unchanged.
Remark: An ecient algorithm to solve the arc length equation
s =
1
L
_
p
a
_
x

(p)
2
+ y

(p)
2
dp
is Newtons method,
p
0
= guess
p
k+1
= p
k

_
p
k
a

(p)
2
+y

(p)
2
dpsL

(p
k
)
2
+y

(p
k
)
2
k = 0, 1, 2, . . .
Where the iteration proceeds until the dierence |p
k+1
p
k
| is less than some given
tolerance. The integral is evaluated by, e.g., Simpsons method, where the interval [a, p
k
]
is divided into a suciently large number of subintervals. It is possible to implement the
integral evaluation such that only the update
_
p
k+1
p
k
.. dp needs to be computed in each
iteration. However, I have not investigated whether this is advantageous or not.
2.2 Dynamic allocation of objects
The declaration
bcurve bc(0,1.0);
allocates the object bc in static memory. It is common to use dynamic allocation of
objects instead. In order to do that, we rst declare a pointer to the object
20
bcurve *bc;
The memory allocation is done by the statement
bc = new bcurve(0,1.0);
The parameters (0,1.0) are passed to the constructor, which is called automatically when
the object is allocated. To access a member function of the object, we can use the
expression
x[i] = (*bc).x(i/9.0);
The parenthesis are necessary, because the member selection operator, ., has higher
priority than the dereference operator, *. There is an alternative way to write the same
expression, namely
x[i] = bc->x(i/9.0);
When an object is no longer needed, it can be deallocated by the delete operator,
delete bc;
Unused objects should be deallocated, since C++ does not have automatic garbage col-
lection. Programs that break during execution due to memory leaks, are not uncommon
with C and C++.
2.3 Base classes and virtual functions
Consider the domain in Fig. 2.1 below. It consists of four sides, one is a circular arc, two
sides are straight lines, and one side is dened by spline points.
2 1.5 1 0.5 0 0.5 1 1.5 2
2
1.5
1
0.5
0
0.5
1
1.5
2
Fig. 2.1 Example of boundary curves.
We could think of introducing three C++ classes to describe the boundary. The classes
circle do describe the arc, line to describe the straight line sides, and spline curve
to describe the outer boundary. We could then write the class domain to describe the
entire domain. It contains the four boundaries as members, and the class could have the
capability to generate a grid on the domain.
class domain
{
private:
circle *side1;
spline curve *side2;
line *side3, *side4;
double *x, *y;
.....
21
public:
...
void generate grid( int n, int m );
....
};
The members x, y are matrices which will contain the grid.
The problem with this approach is that it lacks generality. Tomorrow, we might want
to generate a grid in a dierent domain. We would then have to rewrite the domain class,
in order to t in other types of boundary curves.
We would like to have a way to write a general domain class in which we can declare
the boundary curves as
bcurve *sides[4];
and where the domain class does not need to care about which type of curve each side is.
This can be accomplished by using inheritance, which we now describe by a simple
example. Consider the class below, describing a polygon.
#include <iostream.h>
#include <math.h>
struct point
{
double x;
double y;
};
class polygon
{
private:
int npts;
point *vertices;
double distance( point, point );
public:
polygon( int n, point *vert );
void translate( double dx, double dy );
void draw( Widget w, GC& gc );
void rotate( double alpha );
double perimeter();
};
We have introduced a data type point, describing a point in the plane. In C++ a struct
is the same thing as class, except for the dierence that all variables are public by default
in a struct. In a class the default is to make a member private. For example, some of the
member functions are dened as
polygon::polygon( int n, point *vert )
{
npts = n;
vertices = new point[npts];
for( int i=0 ; i<npts ; i++ )
vertices[i] = vert[i];
}
double polygon::perimeter( )
{
double p=0;
for( int i=0 ; i<npts-1 ; i++ )
p += distance( vertices[i], vertices[i+1] );
p += distance( vertices[npts-1], vertices[0] );
return p;
}
The function distance gives the distance between two points in the plane.
22
Assume that we use this class to represent a square. The square has additional prop-
erties that makes it a very special type of polygon. For example, the perimeter of a
square can be computed more eciently than in the general case above, by just taking 4
times the side. We want to exploit the special properties of the square. We do that by
introducing the class square as a derived class from the class polygon. The declaration of
square is given below.
class square : public polygon
{
double side;
public:
square( point vert[4] );
double perimeter();
};
square::square( point vert[4] ) : polygon( 4, vert )
{
side = distance( vert[0], vert[1] );
}
double square::perimeter()
{
return side*4;
}
The rst line class square : public polygon means that the square class contains
the entire polygon class. We say that square inherits polygon. The public means that
all the public members in polygon are also public in square.
The constructor of square takes the four corner points as a parameter. Before the
body of square::square is executed, the constructor of the base class polygon is called.
This is the meaning of the notation : polygon(4,vert).
Note how we have dened a new version of the function perimeter in square. In order
for everything to work as intended we have to make a few small changes in the base class.
We change the base class to
class polygon
{
protected:
int npts;
point *vertices;
double distance( point, point );
public:
polygon( int n, point *vert );
void translate( double dx, double dy );
void draw( Widget w, GC& gc );
void rotate( double alpha );
virtual double perimeter();
};
We have changed private to protected. The derived class, square, can not access the
private variables in the base class. Protected is used to allow derived classes to access
the variables. In other respects the variables under protected are inaccessible from the
outside.
The second change is that we have added the reserved word virtual to the function
perimeter. The meaning of virtual is to tell the compiler that this is a function which can
be redened by a derived class.
In the main program we can now use the classes as usual,
point verts[5], sqverts[4];
...
polygon *p = new polygon( 5, verts );
23
square *s = new square( sqverts );
What is more important is that we can assign squares to polygons,
point verts[5], sqverts[4];
...
polygon *p = new polygon( 5, verts );
polygon *s = new square( sqverts );
// call to polygon::perimeter
cout << "perimeter of polygon is " << p->perimeter() << endl;
// call to square::perimeter
cout << "perimeter of square is " << s->perimeter() << endl;
If we had not declared perimeter to be a virtual function, the call s->perimeter above
would have been a call to polygon::perimeter().
Although *s is of type polygon, it can be assigned a square. This is called polymor-
phism, and is a powerful technique, which is often used to avoid alternative statements
like
switch( type )
{
case s : per = perimeter square(); break;
case r : per = perimeter rectangle();break;
case c : per = perimeter circle();break;
default: per = perimeter general();
}
Codes which rely on a special type variable to choose between cases, like in the example
above, are dicult to maintain and modify. With a good hierarchy of inheriting classes
such switch statements can be replaced by a single call to a virtual function.
The assignment
square s;
polygon p;
p = s;
is thus allowed, but the opposite assignment s=p is forbidden. This is not strange, since
it is unlikely that a polygon is a square. However, a square is always a polygon.
2.4 Abstract base classes
In the previous example the function polygon::perimeter() was replaced by a function
with the same name in the derived class square. It is often advantageous to instead dene
a base class with unspecied virtual functions, a so called abstract base class. Consider
the example
class shape
{
private:
points* vertices;
int npts;
...
public:
shape( );
virtual void rotate( double alpha ) = 0;
virtual void translate( double dx, double dy ) = 0;
virtual double perimeter() = 0;
...
};
By using the notation =0 after the function, we have here declared the functions rotate,
translate, and perimeter to be purely virtual functions. This means that the the
24
functions are not dened in shape, but that they will be specied in classes which are
derived from shape.
No objects of the abstract base class can be created. The declaration of an object
shape sh;
is wrong, since the purely virtual functions are not dened in shape. An abstract class
can be used only as a base for another class.
An abstract class can be used to create an interface to the class without giving any
information about implementation.
2.5 Programming exercise: A base class for boundary curves
We would now like to use the techniques described above to redesign the bcurve and
domain classes described in Section 2.3. First we dene an abstract base class for boundary
curves. It should contain at least the following information
class curvebase
{
protected:
double pmin, pmax; // Max and minimum values of curve parameter
int rev; // pmin to pmax or vice versa
virtual double xp( double p ) = 0; // x(p)
virtual double yp( double p ) = 0; // y(p)
virtual double dxp( double p ) = 0; // x(p)
virtual double dyp( double p ) = 0; // y(p)
double integrate( double a, double b ); // arc length integral
....
public:
curvebase() ; // constructor
double x( double s ); // arc length parameterization
double y( double s ); // arc length parameterization
....
};
Exercise 1: Complete the class by writing the non-virtual functions. Add more variables
or functions to the class, if you nd it necessary.
Exercise 2: You will generate a grid on the domain in Fig. 2.2 below.
3 2 1 0 1 2 3
2
1
0
1
2
3
4
Fig. 2.2 Computational domain.
The corners are located at (-3,1), (3,0), (3,3), and (-3,3). The lower boundary is given
by the function y = 1/(e
5x
+ 1). (The lower corners do not match perfectly, but for all
practical purposes we can assume that 1/(e
15
+ 1) = 0 and 1/(e
15
+ 1) = 1.) Derive
classes that are needed to represent the boundary curves of the domain in Fig. 2.2 from
the base class. Test the class by using it in a simple main program.
25
Exercise 3: Design a class domain as outlined in Section 2.3. The class should contain
four boundary curves of type curvebase, and have capability for generating a grid on
the domain. Write a main program which generates the grid. Use the algebraic grid
generation formula (1.2).
Exercise 4: Add a function to the class domain to write the grid to a le. The simplest
is to use cout to write an ascii le. A better way is to use the unix functions open, write,
and close to output the grid in binary format. We give an example below of how they are
used to write a vector x consisting of n m doubles.
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
.....
int fd = open( "outfile.bin", O CREAT | O RDWR | O TRUNC, 0660 );
int nr = write( fd, x, n*m*sizeof(double) );
close(fd);
Use the unix commands man open, and man -s 2 write to obtain more information
about these functions. The grid can be viewed in Matlab. To read a binary le, use the
Matlab functions fopen and fread.
Exercise 5: This exercise is not compulsory, just do it if you have time, and would like
to make a better program. The four boundary curves are somehow input to the class
domain, e.g., through a constructor. You have probably up to now used some convention
of type the rst argument to the constructor should be the lower boundary etc. Add
a function order boundaries to the class domain. This function should order a random
set of four boundaries such that bcurve[0] and bcurve[1] are always facing each other,
and have consistent coordinate directions, and the same for bcurve[2] and bcurve[3].
When this function is used, the boundary curves can be input to domain in any order,
with any direction of the parameter.
26
3 Finite dierence approximations
3.1 Partial dierential equations
Many phenomena in the physics can be modeled by a system of partial dierential equa-
tions. For example the Navier-Stokes equations for compressible uid ow are in the
form
u
t
+f(u)
x
+g(u)
y
+h(u)
z
= f
v
(u, u
x
, u
y
, u
z
)
x
+g
v
(u, u
x
, u
y
, u
z
)
y
+h
v
(u, u
x
, u
y
, u
z
)
z
+s(u)
The unknown u(x, y, z, t) is a vector containing the density, momenta and energy of the
uid. We use subscript do denote partial derivatives, i.e., u
x
= u/x, u
y
= u/y, e.t.c.
Bold face letters denote vector quantities. The equations should be supplemented by an
initial state at a starting time, and boundary conditions on the domain in which we want
to solve the system. The functions f, f
v
, g, . . . are known. The left hand side terms f, . . .
are responsible for convective eects, while the right hand side terms f
v
, . . . are responsible
for diusive eects. The low order term s appears when chemical reactions are taking
place in the uid.
Other equations from physics can be cast into the form above. Examples are Maxwells
equations, the equations of magnetohydrodanamics, the wave equation.
Dierence approximations of such equations are often tested out on simpler model
problems. Some simple model problems are the equations
_
u
t
+ au
x
= 0 t > 0
u(0, x) = u
0
(x)
and
_
u
t
= bu
xx
t > 0
u(0, x) = u
0
(x)
b > 0
these are solved for t > 0, from a given function at t = 0. The wave equation u
t
+au
x
= 0
transports data along characteristic lines. The solution on an innite domain is u(x, t) =
u
0
(x at). The heat equation u
t
= u
xx
diuses the data.
These equation represent features of the full problem. To see this, linearize f(u)
x

Au
x
, and similarly for the other terms. We then obtain
u
t
+ Au
x
+ Bu
y
+ Cu
z
= Du
xx
+ Eu
xy
+ Fu
yy
+ Gu
xz
+ Hu
yz
+ Iu
zz
+ Ju
where the upper case letters denote matrices obtained by the linearization. Restrict this
to a scalar one dimensional problem,
u
t
+ au
x
= bu
xx
+ cu
This leads us to the study of the three separate problems
u
t
+ au
x
= 0 u
t
= bu
xx
u
t
= cu
as indicated above.
Sometimes the steady problem is of interest. This is the pde obtained when u
t
= 0,
e.g., f(u)
x
+g(u)
y
= 0. Elliptic problems arise in this way, as steady solution of diusion
problems, e.g.,
u
t
= u
xx
+ u
yy
u
xx
+ u
yy
= 0
at steady state. Another case of interest is the time periodic problem, where it is assumed
that u(x, t) = e
it
v(x), for some xed frequency . From this assumption, we obtain
eigenvalue problems, e.g.,
u
t
= bu
xx
v
xx
= (i/b)v.
27
3.2 Dierence schemes
To solve an equation such as u
t
+au
x
= 0 numerically, we divide the x-axis into grid points
x
j
= jx, j = 0, 1, . . . , n. We assume here that the grid is uniform, with a constant grid
spacing x. We introduce the time levels t
n
= nt, with a uniform time step t. The
approximation of the exact solution u(x
j
, t
n
) is denoted u
n
j
. A (linear) dierence scheme
is a formula
1

m=q
r

k=p
c
m,k
u
n+m
j+k
= 0 (3.1)
from which we can compute the solution at t
n+1
, u
n+1
j
, j = 0, . . . , n, from given grid values
at t
n
, t
n1
, . . . , t
nq
. An example of a dierence scheme, approximating u
t
+ au
x
= 0 is
u
n+1
j
=
1
2
(u
n
j+1
+ u
n
j1
)
at
2x
(u
n
j+1
u
n
j1
)
which is called explicit, because the values u
n+1
j
can be computed directly. On the other
hand, the scheme
u
n+1
j
= u
n
j

at
4x
(u
n+1
j+1
u
n+1
j1
+ u
n
j+1
u
n
j1
)
is called implicit, because computing u
n+1
j
involves solving an algebraic system of equa-
tions.
The order of accuracy of the dierence scheme (3.1) is in space and in time, if
the order of the residual when the exact solution is inserted into the dierence scheme is
(, ), or more precisely,
1

m=q
r

k=p
c
m,k
u(t
n+m
, x
j+k
) = O(t(x

+ t

)).
Dierence schemes can be derived by term wise application of dierence operators.
Example of dierence operators are
D
0
u
j
= (u
j+1
u
j1
)/(2x) = u
x
(x
j
) +O(x
2
)
D
+
u
j
= (u
j+1
u
j
)/x = u
x
(x
j
) +O(x)
D
04
u
j
= (u
j+2
+ 8u
j+1
8u
j1
+ u
j2
)/(12x) = u
x
(x
j
) +O(x
4
)
D
+
D

u
j
= (u
j+1
2u
j
+ u
j1
)/x
2
= u
xx
(x
j
) +O(x
2
)
For example the dierential equation
u
t
+ au
x
= u
xx
can be approximated in space, to get the semi discrete
du
j
(t)
dt
+ aD
0
u
j
= D
+
D

u
j
.
28
We can solve in time using a second order accurate ode solver. Total second order
accuracy is then guaranteed, since each term is approximated to second order.
An alternative way to derive dierence schemes, is to expand the formula (3.1) using
Taylor expansion, and zero out as many orders as possible in the truncation error. This
is a more general method, since some methods can not easily be derived by term wise
approximation. Take for example the Lax-Wendro scheme. It is given by
u
n+1
j
= u
n
j

at
2x
(u
n
j+1
u
n
j1
) +
a
2
t
2
2x
2
(u
n
j+1
2u
n
j
+ u
n
j1
)
and has order of accuracy 2 in both space and time. The centered D
0
approximation of
u
x
is clear, but the time derivative is not obviously second order. The scheme can be
derived from the expansion
u
n+1
= u
n
+ tu
n
t
+
t
2
2
u
n
tt
+O(t
3
).
The pde is then used to replace u
t
by au
x
, and u
tt
= au
xt
= a
2
u
xx
, giving
u
n+1
j
= u
n
j
atu
n
x
+ a
2
t
2
2
u
n
xx
+ . . . .
Terms of order higher than two are thrown away, and the standard spatial approximations
D
0
and D
+
D

are used to obtain the nal scheme. This technique gives good schemes,
but one has to be careful when the scheme is generalized. If, e.g., we want to add a
term u
xx
, or we want to approximate in two space dimensions, u
t
+ au
x
+ bu
y
= 0,
we have to redo the derivation of the scheme. The original scheme was derived under
the assumption u
t
= au
x
, which is no longer true. On the other hand, the term wise
approximation technique is very exible when adding more terms to the pde. The new
terms are introduced directly into the dierence scheme, approximated to the desired
accuracy.
Some dierence schemes give solutions which increase faster than exponentially in
time. These are unstable dierence schemes. We have three dierent techniques for
analyzing stability
Fourier method
Energy method
Normal mode analysis
In the Fourier method, periodic boundary conditions are assumed. The grid function, at
a xed time, is expanded in a Fourier base
u
n
j
=
r

=r
u
n

e
ix
j
where 2r +1 = n, the number of grid points. Dierence operators become diagonal in the
new Fourier basis, some examples
D
0
i sin(x)/x
D
+
D

4 sin
2
(x/2)/x
2
29
The dierence scheme is then transformed to an equation
u
n+1

=

Q(x) u
n

where

Q is a factor multiplying u
n

. Thus each Fourier mode can be analyzed separately.


The stability can now be investigated by inspecting the eigenvalues and eigenvectors of
the matrix

Q (

Q is a matrix if u
n

is a vector, otherwise it is a scalar factor ).


When explicit methods are stable, they are so under an extra condition, the so called
cfl-condition. For the equation u
t
+ au
x
= 0, the cfl-condition is
at c
0
x
where c
0
is a constant, usually of size one. The exact stability limit of cfl-condition varies
from scheme to scheme, but is derived from stability analysis. The parabolic equation
u
t
= bu
xx
has a cfl-condition of type
bt c
0
x
2
which is more restrictive since t is now proportional to x
2
<< x. This may force
us to take very small time steps, smaller than we need for the accuracy requirements.
Implicit methods are usually unconditionally stable, i.e., stable for all (positive) values of
t, x. For implicit methods, only accuracy considerations decides the size of the time
step, but unfortunately the linear system of equations that appears, can be very dicult
to solve when there are several space dimensions in the problem.
The energy method for analyzing stability has to be adapted for each problem sepa-
rately. It can handle stability for problems with non-periodic boundary conditions.
The normal mode method for analyzing stability is general, but complicated to apply.
It handles problems with non-periodic boundaries.
3.3 Non uniform grid
When the grid is not uniform, we have the possibility to generalize the dierence schemes
in two dierent ways
1. Discretization directly in the physical domain.
2. Grid mapping to the unit cube, and uniform discretization there.
We illustrate the dierence between these two approaches by the problem of approximating
the derivative in one space dimension. Consider the grid in Fig. 3.1. The grid points x
j
are given. We let x
j+1/2
= x
j+1
x
j
denote the non uniform grid size.
x x x
j-1 j j+1
Fig. 3.1. Non uniform grid.
Let us try to nd a three point approximation of the derivative du/dx in the point
x
j
. We start with a direct discretization in the physical coordinates. A direct Taylor
expansion of the expression
a

u(x
j1
) + a
0
u(x
j
) + a
+
u(x
j+1
)
30
gives
a

=
x
j+1/2
x
j1/2
(x
j+1/2
+x
j1/2
)
a
0
=
x
j+1/2
x
j1/2
x
j1/2
x
j+1/2
a
+
=
x
j1/2
x
j+1/2
(x
j+1/2
+x
j1/2
)
(3.2)
if the rst order errors in the expansions are set equal to zero. With these coecients we
have
a

u(x
j1
) + a
0
u(x
j
) + a
+
u(x
j+1
) = u
x
(x
j
) +
1
6
x
j1/2
x
j+1/2
u
xxx
(x
j
) + . . .
Thus the truncation error is second order accurate in the max grid parameter h =
max
j
x
j+1/2
. Note that here we have not relied on a grid mapping. The points can
be distributed in any way, and the formula will be second order.
In the alternative technique, using a grid mapping, we start instead by transforming
du
dx
=
1
x

(s)
du
ds
Here x(s) ( s [0, 1] ) is the given grid mapping, and our grid is given by the points
x
j
= x(j/n), j = 0, 1, 2, . . . , n. The second order approximation of the derivative is then
done by a standard centered dierence,
u
x
(x
j
)
1
x

(s
j
)
u
j+1
u
j1
2s
(3.3)
where s = 1/n, and s
j
= j/n. If the mapping is known, only at the grid points, the
grid derivative can be approximated too, giving the formula
u
x
(x
j
)
u
j+1
u
j1
x
j+1
x
j1
. (3.4)
The second formula (3.4) gives in general a smaller truncation error than (3.3). In general,
it is advisable to approximate metric derivatives by the same formula as used for the
approximation of the derivatives of the function itself.
We now compare the truncation errors from the formulae (3.3) and (3.4). For (3.3)
we obtain
1
x

(s
j
)
u
j+1
u
j1
2s
=
u

+
s
2
6
u

+ . . . =
u
x
(x
j
) +
s
2
6
(
x

u
x
+ 3x

u
xx
+ (x

)
2
u
xxx
) + . . .
(3.5)
The primes denote dierentiation with respect to s. Thus the approximation is second
order in the inverse number of points (s), but only under the conditions that the grid
mapping is kept xed as the number of grid points is increased, and that the grid mapping
has three continuous derivatives.
We expand the formula (3.4) rst only in u with respect to x, and obtain
u
j+1
u
j1
x
j+1
x
j1
= u
x
(x
j
) +
1
2
(x
j1/2
x
j+1/2
)u
xx
+
1
6
(x
2
j+1/2
x
j1/2
x
j+1/2
+ x
2
j1/2
)u
xxx
+ . . .
31
We see how the grid smoothness enters this formula. The term (x
j1/2
x
j+1/2
)u
xx
is
second order if the grid spacing changes smoothly, x
j+1/2
x
j1/2
= O(x
2
j+1/2
), but
otherwise this is a rst order approximation. We now assume a smooth grid mapping,
and expand the grid function around x
j
, to obtain
u
j+1
u
j1
x
j+1
x
j1
= u
x
(x
j
) +
s
2
2
x

u
xx
+
s
2
6
(x

)
2
u
xxx
+ . . .
The term u
xxx
is the truncation error on a uniform grid, the other error term has to do
with the non uniform grid. Note that the error term x

u
xx
is a second derivative, and
thus acts as a numerical diusion. If x

is large and has the wrong sign, this term may


be destabilizing. Compare this to (3.5). We see that using numerical evaluation of the
grid derivative, gives one term less in the truncation error.
These results generalize to higher order of approximation. It is then important to
approximate the metric derivatives by the same order of accuracy as used in the approx-
imation of the derivative of the function.
We nally show numerical example of truncation errors. We have put a grid on the
unit interval [0, 1], and we have evaluated the derivative of the function sin x on the
grid by the two dierent numerical approximation techniques described above. The two
numerical experiments below are set up to test the limits of the grid mapping method. In
most cases when a smooth, well resolved, grid mapping is available, the dierence between
the two approximation techniques is not so very big.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0.02
Fig. 3.2. Non uniform grid. Fig. 3.3. Truncation errors.
In Figure 3.2, we show the grid used in the rst experiment. The stretching function
was x(s) = (e
100s
1)/(e
100
1), there are 100 points in the domain, most of them
clustered near x = 0. In Figure 3.3, the error in the approximation of du/dx are given.
The error with smaller magnitude is obtained with the direct discretization given by the
coecients (3.2). The error with larger magnitude was obtained with the grid mapping
method, (u
j+1
u
j1
)/(x
j+1
x
j1
). The grid mapping is smooth, but under resolved, and
the direct approximation method gives an error which is one order of magnitude smaller
than with the grid mapping method.
In a second experiment, we investigate the eect of a grid discontinuity.
32
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Fig. 3.4. Grid with discontinuous spacing.
In Figure 3.4 we show the grid used in this experiment. We used the same two
dierence formulas as used in the previous experiment to compute the derivative of the
function sin x. The error of the direct approximation (3.2) is shown in Figure 3.5, and the
error of the grid mapping method in Figure 3.6. Of course, this is a case where we do not
expect the grid mapping method to work. In Figure 3.6 we see how it fails. An error of
size 10
3
appears at the spacing discontinuity, whereas the other method gives an error
of size 10
5
uniformly in the domain.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
8
7
6
5
4
3
2
1
0
x 10
5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.5
0
0.5
1
1.5
2
2.5
3
x 10
3
Fig. 3.5. Error in direct approximation. Fig. 3.6. Error in grid mapping method.
3.4 Several space dimensions
In two and three space dimensions, we can use the physical space approximation by
determining coecients such that

l,m

l,m
u
i+l,j+m
or, in three dimensions

l,m,n

l,m,n
u
i+l,j+m,k+n
are dierence approximations of the derivative at the point (x
i,j,k
, y
i,j,k
, z
i,j,k
). The sum-
mation is over the stencil, a limited number of points, such as e.g., (l, m, n) from -1 to
1. How wide stencils do we need ? We expand the two dimensional expression around
(x
i,j
, y
i,j
), and obtain

l,m

l,m
u
i+l,j+m
=

l,m

l,m
1
!

=0
((x
i+l,j+m
x
i,j
)
x
+ (y
i+l,j+m
y
i,j
)
y
)

u =

l,m

=0

p=0

l,m
1
!
_

p
_
((x
i+l,j+m
x
i,j
)
p
(y
i+l,j+m
y
i,j
)
p

p
x

p
y
u
33
From this we see that in order to have a rst order approximation of the x-derivative, we
should require

l,m

l,m
= 0

l,m

l,m
(x
i+l,j+m
x
i,j
) = 1

l,m

l,m
(y
i+l,j+m
y
i,j
) = 0
that is, three linear equations. It is reasonable then to try using a three point stencil. In
order to have second order accuracy, three more equations are added for the three second
derivative terms u
xx
, u
xy
and u
yy
. Thus we have six equations, and a six point stencil
would be a natural choice. The number of nodes in a triangular mesh is perfectly suited
for this since three points make up a triangle, and one triangle together with its three
neighbors provide six points.
The number of equations for a pth order accurate approximation is (p + 1)(p + 2)/2.
We expect a stencil for a pth order accurate approximation to contain that many grid
function values.
Note however, that on completely unstructured grids, one has to take care that the
systems are non-singular. A singularity occurs when some points line up in one single
direction, e.g., if all points are on a single line, then no derivatives normal to the line can
be approximated from these points.
We now describe the alternative method, using a grid mapping. In the grid mapping
method we transform the pde to the unit cube, and approximate it there. As an example
we consider the pde
u
t
+ f(u)
x
+ g(u)
y
+ h(u)
z
= 0 (3.6)
Assume that we are given the grid mapping x(q, r, s), y(q, r, s), z(q, r, s), from the unit
cube to the physical domain ( (q, r, s) [0, 1]
3
). The chain rule gives the following
relations

x
= q
x

q
+ r
x

r
+ s
x

y
= q
y

q
+ r
y

r
+ s
y

z
= q
z

q
+ r
z

r
+ s
z

s
The metric coecients q
x
, q
y
, etc. can not be computed directly. Instead, from the chain
rule we have the inverse relation

q
= x
q

x
+ y
q

y
+ z
q

r
= x
r

x
+ y
r

y
+ z
r

s
= x
s

x
+ y
s

y
+ z
s

z
which we use to obtain the relation
_
_
_
q
x
r
x
s
x
q
y
r
y
s
y
q
z
r
z
s
z
_
_
_ =
_
_
_
x
q
y
q
z
q
x
r
y
r
z
r
x
s
y
s
z
s
_
_
_
1
34
where now the right hand side matrix can be computed numerically from the given grid,
or grid mapping. Let J denote the determinant of the matrix
_
_
_
x
q
y
q
z
q
x
r
y
r
z
r
x
s
y
s
z
s
_
_
_.
We can then write the explicit form of the inverted matrix as
_
_
_
q
x
r
x
s
x
q
y
r
y
s
y
q
z
r
z
s
z
_
_
_ =
1
J
_
_
_
y
r
z
s
z
r
y
s
z
q
y
s
y
q
z
s
y
q
z
r
z
q
y
r
z
r
x
s
x
r
z
s
x
q
z
s
z
q
x
s
z
q
x
r
x
q
z
r
x
r
y
s
y
r
x
s
y
q
x
s
y
s
x
q
x
q
y
r
x
r
y
q
_
_
_ (3.7)
The pde (3.6) becomes
u
t
+ q
x
f(u)
q
+q
y
g(u)
q
+ q
z
h(u)
q
+ r
x
f(u)
r
+ r
y
g(u)
r
+ r
z
h(u)
r
+
s
x
f(u)
s
+ s
y
g(u)
s
+ s
z
h(u)
s
= 0
because of the relations
(Jq
x
)
q
+ (Jr
x
)
r
+ (Js
x
)
s
= 0
(Jq
y
)
q
+ (Jr
y
)
r
+ (Js
y
)
s
= 0
(Jq
z
)
q
+ (Jr
z
)
r
+ (Js
z
)
s
= 0
(3.8)
we can write the dierential equation in the original form
u
t
+
1
J
((J(q
x
f(u) + q
y
g(u) + q
z
h(u)))
q
+ (J(r
x
f(u) + r
y
g(u) + r
z
h(u)))
r
+
(J(s
x
f(u) + s
y
g(u) + s
z
h(u)))
s
) = 0
.
The relations (3.8) can be proved by inserting the expressions (3.7). New uxes are dened
by

f = Jq
x
f(u) + Jq
y
g(u) + Jq
z
h(u) e.t.c. and we write
(Ju)
t
+

f
q
+ g
r
+

h
s
= 0
now in the uniform (q, r, s) coordinates. The nite dierence approximation can now
be done on a uniform grid. The transformed problem has variable coecients, but is
otherwise exactly as the original pde. As pointed out previously, in order for this to
work, the grid mapping has to be suciently smooth.
For the single derivative u/x, we have two equivalent expressions, the conservative
u
x
=
1
J
((Jq
x
u)
q
+ (Jr
x
u)
r
+ (Js
x
u)
s
)
and the non conservative
u
x
= q
x
u
q
+ r
x
u
r
+ s
x
u
s
When these formulas are discretized, it is advantageous to use the same dierence
formula for metric derivatives and function derivatives. Especially for the conservative
35
form this is important. We consider a two dimensional case in order to simplify the
formulas. Let us approximate the derivative
u
x
=
1
J
((y
r
u)
q
(y
q
u)
r
)
If u is constant, the formula degenerates to
u
x
=
1
J
(y
rq
y
qr
)u,
which is zero because y
rq
= y
qr
. We want this property to hold for the numerical approx-
imation as well. We use the dierence operators D
M
i
and D
M
j
for the metric derivatives.
E.g., we could use
y
q
D
M
i
y
i,j
= (y
i+1,j
y
i1,j
)/2
y
r
D
M
j
y
i,j
= (y
i,j+1
y
i,j1
)/2
For the function derivative approximation we use operators D
F
i
and D
F
j
. The derivative
is then approximated by
u
x

1
J
i,j
(D
F
i
(D
M
j
(y
i,j
)u
i,j
) D
F
j
(D
M
i
(y
i,j
)u
i,j
)).
If u is constant, the approximation becomes
1
J
i,j
(D
F
i
D
M
j
(y
i,j
) D
F
j
D
M
i
(y
i,j
))u
It is clear that we should require D
M
= D
F
in order for this expression to become zero.
If dierent operators are used for the metric and the function, the approximation is still
accurate on smooth grids, but it is not certain that the derivative of constants are zero.
This is important in steady state computations, where it is often advantageous to have a
constant as exact solution of the steady problem. Furthermore, using D
M
= D
F
removes
one term from the truncation error. The better accuracy is most clearly visible on grids
which have strongly varying grid spacing, so that the grid is close to non-smooth. If the
grid is very smooth, and no particular steady state properties are desired, using D
M
= D
F
does not make a big dierence. Nevertheless it is advisable to use such a discretization,
since it then gives good performance on a larger class of problems. Furthermore, in order
to have order of accuracy p, it is necessary to use order p in both the metric derivative
and function derivative approximation. Using the same operator is then natural.
In summary what we required from the two dimensional formulas was that the rule
(y
r
)
q
(y
q
)
r
= 0 holds for the numerical approximation, where the innermost derivatives
comes from the approximation of the metric derivatives, and the outermost derivatives
comes from the ux derivative approximation. Similarly, in the three dimensional case
we can ascertain that constants have numerical derivative zero by requiring that (3.8)
holds numerically for the approximation. In order to do this, it is necessary to rewrite
the elements of metric derivative matrix (3.7) as
y
r
z
s
z
r
y
s
=
1
2
(y
r
z z
r
y)
s

1
2
(yz
s
zy
s
)
r
z
q
y
s
y
q
z
s
=
1
2
(z
q
y y
q
z)
s

1
2
(yz
s
zy
s
)
q
z
r
y
q
y
r
z
q
=
1
2
(y
q
z z
q
y)
r

1
2
(zy
r
yz
r
)
q
36
only the rst row of (3.7) is shown. The other two rows are rewritten similarly. If the
metric derivatives are approximated on this conservative form, it follows that (3.8) holds
for the approximation, under the condition of commuting operators for the ux- and grid-
derivative approximations, e.g., that the same dierence operator is used for the grid- and
ux-approximations.
It is not dicult to generalize the transformation formulas to the case of moving grids.
We then let the grid transformation depend on time, x(q, r, s, t), y(q, r, s, t), z(q, r, s, t), and
transform as before. The matrices above now becomes of size 4 4, and the derivation of
the transformed relations is straightforward.
What is the gain of the grid mapping method, if we compare with the physical space
discretization ? With the grid mapping approach, we need fewer points in the stencils.
This leads to better computational eciency.
x
y
x
y
Fig. 3.7. Stencil second order grid mapped. Fig. 3.8. Stencil second order direct.
For second order accuracy in two space dimensions, we can use three grid points for
the x derivative and three grid points for the y derivative. The stencil for approximating
f
x
+g
y
has the shape of a cross, and contains ve points. See Figure 3.7. In the physical
space discretization, described at the beginning of this section, we need six points for
second order accuracy, i.e., one point more. For example the stencil in Figure 3.8. The
dierence becomes larger when the order of accuracy is increased, for example a fourth
order approximation of f
x
+ g
y
can be done with 9 points in the grid mapped approach,
while the physical space discretization requires at least 15 points in the stencil. For pth
order accuracy, we need a (p +1)(p +2)/2 point stencil in the direct discretization, and a
2p + 1 stencil for the grid mapping method ( p + 1 in each direction, counting the center
point only once).
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Fig. 3.9. Smooth grid.
We conclude with some numerical examples in two dimensions. In the rst example,
37
we use a smooth grid mapping. A part of the grid is shown in Figure 3.9. Figures 3.10
and Figure 3.11 shows the error in approximating the derivative of u = sin xcos y on the
grid. We implemented the direct discretization method, by numerically solving the 6 6
system obtained for second order accuracy as described at the beginning of this section.
The stencil was the one in Figure 3.8.
1
0.5
0
2
1.5
1
0.5
0
0.5
1
1.5
2
4
2
0
2
4
x 10
3
1
0.5
0
2
1.5
1
0.5
0
0.5
1
1.5
2
5
0
5
10
15
x 10
4
Fig. 3.10. Error, second order grid mapped. Fig. 3.11. Error, second order direct.
The direct discretization method in Figure 3.11 has a somewhat smaller error, than
the grid mapping method. The scaling of the two plots are somewhat dierent. Maximum
norm errors are 0.004 in Fig. 3.10 and 0.001 in Fig. 3.11.
In the second example we make the same computation as in the previous example,
except that the grid is now the severely stretched grid shown in Figure 3.12. This grid
was made to resolve boundary layers in the Navier-Stokes equations. Here we expect
problems for the grid mapping method. And as can be seen in Figures 3.13 and 3.14,
the error in the direct discretization method is much smaller. Note that the scalings are
completely dierent in the two gures. The maximum norm of the error is 8.8 10
6
for
direct discretization and 0.082 for grid mapping. The grid was generated by transnite
interpolation, without additional smoothing ( see Chapter 1 ). In Figure 3.13, we see
clearly the problem with the kink at the internal corner, propagating into the domain.
The error level away from the troublesome spots is about four magnitudes higher in the
grid mapping discretization. We show the error along one radial grid line in Figure 3.15.
This grid line is located towards the rear of the domain, away from the kink. The dashed
line is the direct discretization.
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0.01 0.02
0.01
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Fig. 3.12. Severely stretched grid.
38
0.08
0.06
0.04
0.02
0
0.02
0.1
0.05
0
0.05
0.1
0.15
0
0.2
0.4
0.6
0.8
1
x 10
5
0.08
0.06
0.04
0.02
0
0.02
0.1
0.05
0
0.05
0.1
0.15
0
0.02
0.04
0.06
0.08
0.1
Fig. 3.13. Error, second order direct. Fig. 3.14. Error, second order grid mapped.
Grid Mapping
Direct
0 20 40 60 80 100 120 140 160
10
7
10
6
10
5
10
4
10
3
10
2
10
1
Fig. 3.15.One dimensional cut. Errors for the two methods.
39
4 Implementation
4.1 Program structure
The traditional CFD code is a fortran 77 program, having a fairly long main program
consisting of the following steps
1. Read numerical and physical problem parameters
2. Read problemsize and allocate memory
3. Read the grids, and information on boundary conditions
4. Get initial data
5. Iterate in time, either a xed number of steps, or to a specic time, or until a residual
becomes small. Each iteration consists of evaluating the dierence approximation,
and imposing boundary conditions.
6. Write out the result
The iteration is not necessarily time stepping, but can be any iterative solver suitable for
the problem.
It is not very dicult to add a command driven interface to such a code. One can
dene a small number of commands corresponding to the dierent steps above. The main
program would then have the following structure
get(cmd);
while( notequal(cmd,quit) )
{
if( equal(cmd,readparameters) )
readparameters();
else if( equal(cmd,allocate) )
allocate();
else if( equal(cmd,readgrid) )
readgrid();
else if( equal(cmd,initialdata) )
initialdata();
else if( equal(cmd,step) )
step();
else if( equal(cmd,writesolution) )
writesolution();
else if( equal(cmd,readsolution) )
readsolution();
get(cmd)
}
where for the step command we could give the possibility of an numeric argument speci-
fying the number of steps ( iterations ) that should be taken. The command style main
program will lead to greater exibility, especially useful in the debugging phase of code
development. Furthermore, this new main program can be written in C++, making it
easy to allocate memory dynamically. The functions called, consist of dierent parts of
the original fortran program, that has been split into smaller pieces. On most computers,
it is possible to call fortran subroutines from C++.
We will examine some of the parts of the code separately.
40
4.2 Memory allocation
Let us investigate memory allocation for a multi-block pde problem, with a static cong-
uration. Assume that we have a grid generating program which produces a multi-block
grid which is written to a data le. Assume, furthermore, that we are given the old
fortran77 subroutines GETNG, GETDIM, GETGRD, which makes it possible to read the grid
data le. We are also given the routine ONESTEP, which performs one time step. These
routines are documented below.
subroutine GETNG( ng )
c reads the number of grid blocks stored on a data file.
c Output: ng - number of blocks
c
integer ng
........
end
subroutine GETDIM( n, g )
c Reads the dimensions of one block g from the file.
c Input: g - The number of the grid for which the information is wanted.
c Output: n - The number of grid points in the three dimensions.
c
integer n(3)
.....
end
subroutine GETGRD( n1, n2, n3, grid, g )
c read grid g from the data file.
c Input: n1, n2, n3 - The number of grid points in the three dimensions
c g - The number of the grid for which the information is wanted.
c Output: grid - The grid points, grid(1,:,:,:) are x-coordinates,
c grid(2,:,:) are y-coordinates, grid(3,:,:) are z-coord.
integer n1, n2, n3
double precision grid(3,n1,n2,n3)
........
end
subroutine ONESTEP( n1, n2, n3, grid, w )
c Take one time step w := F(w)
c Input: n1, n2, n3 - The number of grid points in the three dimensions.
c grid - The grid.
c w - The solution, a vector of 5 components in each grid point.
c Output: w - w is updated with the value at the next time.
c
integer n1, n2, n3
double precision grid(3,n1,n2,n3), w(5,n1,n2,n3)
........
end
The unknown is an array of 5 components in each grid point, e.g., for uid ow w could
represent (density,x-,y-,z-velocity, pressure) in each grid point.
Fortran77 does not have dynamic memory capabilities, thus many codes have a xed
maximum problem size. It is necessary to recompile the main program when the maximum
size is increased.
A scalar quantity dened on a three dimensional multi-block grid could in principle be
declared as an array with four indices, however this is very wasteful with memory, since
the dimensions must be set to the maximum value over all blocks.
Instead, we make the main program such that there is only one one size parameter. All
arrays are allocated as one dimensional in the main program, and passed to subroutines
41
where they are received as multi dimensional arrays. Pointers are set up as integers giving
locations in the one dimensional data arrays.
program P3
integer size, mxblock
parameter( size=1000000, mxblock=60 )
double precision w(5*size), grid(3*size)
integer pw(mxblock), pgrid(mxblock)
integer dim(3,mxblock)
integer ng, g
call GETNG( ng )
if( ng.gt.mxblock )then
write(*,*) out of memory
stop
endif
do g=1,ng
call GETDIM( dim(1,g), g )
enddo
pw(1) = 1
pgrid(1) = 1
do g=2,ng
pw(g) = pw(g-1) + 5*dim(1,g-1)*dim(2,g-1)*dim(3,g-1)
pgrid(g) = pgrid(g-1) + 3*dim(1,g-1)*dim(2,g-1)*dim(3,g-1)
enddo
c should check that not out of memory here....
do g=1,ng
GETGRD( dim(1,g), dim(2,g), dim(3,g), grid(pgrid(g)), g )
enddo
c ..... etc .....
do i=1,nsteps
do g=1,ng
call ONESTEP( dim(1,g), dim(2,g), dim(3,g),
* grid(pgrid(g)), w(pw(g)) )
enddo
enddo
c ....
Code 4.1.Outline of fortran 77 main program.
The example Code 4.1 shows a typical fortran77 main program. We use that formal array
parameters to subroutines can have integer parameters as index bounds, and that arrays
in the calling program does not need to have the same shape as the corresponding formal
parameter.
In a language like C++, memory allocation is straightforward. A C++ version is
shown in Code 4.2. Here the pointer data type is used to allocate dynamically one
pointer for each grid. Each of these pointers points to the array on the particular grid.
int main()
{
double **grid, **w;
int **dim, ng;
getng( &ng );
// Allocate pointers
dim = new int*[ng];
grid = new double*[ng];
w = new double*[ng];
// Read dimensions and allocate arrays
42
for( int g=0; g < ng ; g++ )
{
dim[g] = new int[3];
getdim( dim[g], &g );
grid[g] = new double[3*dim[g][0]*dim[g][1]*dim[g][2]];
w[g] = new double[5*dim[g][0]*dim[g][1]*dim[g][2]];
}
// Read grid
for( g=0; g < ng ; g++ )
getgrd( &dim[g][0], &dim[g][1], &dim[g][2], grid[g], &g );
// etc...
for( int i=1;i<=nsteps;i++ )
for( g = 0; g<ng ; g++ )
onestep( &dim[g][0], &dim[g][1], &dim[g][2], grid[g], w[g] );
....
}
Code 4.2.Outline of C++ main program.
In Code 4.2, we call fortran 77 subroutines from C++. We must then put a declaration,
extern "C" getgrd ( int*, int*, int*, double*, int* );
before the main program. This declaration says that getgrd is compiled externally (
extern C ). Remember that fortran77 only has call by reference parameters, the pa-
rameters must therefore be declared as pointers. In the call to the function, the address-of
operator is used to pass a pointer. On some unix systems ( e.g. SUN solaris ) the fortran
compiler adds the underscore after the function name, on other systems ( IBM Aix ) there
is no underscore on fortran subroutine names. This is normally dealt with through the
cpp preprocessor by writing for example
#ifdef IBM
#define GETGRD getgrd
#endif
#ifdef SUN
#define GETGRD getgrd
#endif
...
extern "C" GETGRD( int*, int*, int*, double*, int* );
If the entire code is written in C++, one usually starts with dening an array class, which
becomes basic data structure for the solver. We will discuss this more later.
In fortran 90, dynamic memory allocation is straightforward, and shown in Code 4.3.
program P3
!
implicit none
!
!
!
type wtype
real(kind=kind(0d0)), dimension(:,:,:,:), pointer :: wp
end type
!
!
type(wtype), dimension(:), allocatable :: w, grid
integer, dimension(:,:), allocatable :: dim
integer ng, g
!
43
!
call GETNG( ng )
!
allocate( dim(3,ng) )
!
allocate( w(ng), grid(ng) )
!
do g=1,ng
call GETDIM( dim(1,g), g )
enddo
!
do g = 1, ng
allocate( grid(g)%wp(3, dim(1,g), dim(2,g), dim(3,g)) )
allocate( w(g)%wp(5, dim(1,g), dim(2,g), dim(3,g)) )
end do
!
do g = 1, ng
! read in grid
call GETGRD( dim(1,g), dim(2,g), dim(3,g), grid(g)%wp, g )
enddo
..... etc .....
do i=1,nsteps
do g=1,ng
call ONESTEP( dim(1,g), dim(2,g), dim(3,g), grid(g)%wp, w(g)%wp )
enddo
enddo
....
Code 4.3.Outline of fortran 90 main program.
In programs which use adaptive renement, grids can be added and deleted during the
computation. For such programs, dynamical data structures are necessary. The advantage
with a language like C++ then becomes then more pronounced. It is possible to make a
linked list, or a tree, of objects, where each node contains a grid variable and w variable.
4.3 Array classes in C++
In languages permitting array operations, like fortran90 or matlab, programs become
shorter and more easy to write and understand. See examples at the end of this chapter.
In C++ we have possibilites to use the same type of array operations, by dening an
array class, for which we can dene the arithmetic operators to do array operations. An
example below
class arc
{
double *w;
int ni, nj, nk;
public:
inline double ind(int i, int j, int k )
{
return w[i-1 + ni*(j-1) + ni*nj*(k-1)];
}
arc( int, int, int );
arc operator+( arc a1 );
};
arc::arc( int nii, int nji, int nki )
{
ni = nii;
nj = nji;
nk = nki;
w = new double[ni*nj*nk];
44
}
arc arc::operator+( arc a1 )
{
arc res( a1.ni, a1.nj, a1.nk );
for( int i=0 ; i<a1.ni*a1.nj*a1.nk ; i++ )
res.w[i] = a1.w[i] + w[i];
return res;
}
We have two possibilities to operate on arrays. First to use array operations, as in the
main program below, where we assume that the class declaration has been put on the le
arc.h.
#include "arc.h"
int main()
{
arc a(10,10,10), b(10,10,10), c(10,10,10); // Calls to constructor
...
c = a+b;
...
}
We can also access the elements of the array through the index function ind, as in the
example below
#include "arc.h"
int main()
{
arc a(10,10,10), b(10,10,10); // Calls to constructor
...
for( int k = 1 ; k<=nk ; k++ )
for( int j = 1 ; j<=nj ; j++ )
for( int i = 1 ; i<=ni ; i++ )
cout << "a(" << i << "," << j << "," << k << ") = " << a.ind(i,j,k) << endl;
...
}
The inline declaration of ind, makes it possible to use the index function without suf-
fering the penalty on eciency caused by having a function call in the innermost loop.
We have enumerated the elements in columnwise order ( as in fortran ), and we let the
lower bound be one instead of zero. It is not dicult to verify that the mapping between
the three dimensional index space 1 i ni, 1 j nj, 1 k nk, and the one
dimensional 0 ind ni nj nk 1 is
ind = (i 1) + ni (j 1) + ni nj (k 1)
We can improve the syntax of the indexation, if we realize that the function call operator
() can be overloaded to act on the array class as follows,
class arc
{
double *w;
int ni, nj, nk;
public:
inline double operator()(int i, int j, int k )
{
return w[i-1 + ni*(j-1) + ni*nj*(k-1)];
}
arc( int, int, int );
arc operator+( arc a1, arc a2 );
45
};
With this denition, it is possible to use the fortran style syntax array(i,j,k) to access
an element of the array, as shown in the example below,
#include "arc.h"
int main()
{
arc a(10,10,10), b(10,10,10); // Calls to constructor
...
for( int k = 1 ; k<=nk ; k++ )
for( int j = 1 ; j<=nj ; j++ )
for( int i = 1 ; i<=ni ; i++ )
cout << "a(" << i << "," << j << "," << k << ") = " << a(i,j,k) << endl;
...
}
In this way, old fortran code segments containing long arithmetic operations on array
elements, can be copied unchanged to a new C++ code.
It is possible to provide index bound checking, by providing a safer (but slower) index
operator
inline double operator()(int i, int j, int k )
{
if( i<1 || i>ni || j<1 || j>nj || k<1 || k>ni )
{
cout << "index out of bounds" << i << " " << j << " " << k << endl;
exit(-1);
}
return w[i-1 + ni*(j-1) + ni*nj*(k-1)];
}
It is advisable to use such bound checking during the code development. When the
program is debugged, the bound checking can be removed to improve the eciency.
There is one remaing problem with our array class, we can not use a call to the index
function as an lvalue. For example, we would like to write
w(i,j,k) = a(i+1,j,k)
but w(i,j,k) is a function call, and can not be assigned to. We resolve this diculty, by
dening the return type as a reference
inline double& operator()(int i, int j, int k )
{
return w[i-1 + ni*(j-1) + ni*nj*(k-1)];
}
The return type can now be used as an lvalue, so that the statement
w(i,j,k) = a(i+1,j,k)
is valid, and means that element (i,j,k) in array w is assigned the value of element
(i+1,j,k) in array a.
One has to be careful with assignments of arrays. In C++, the assignment between
two object
w = a;
means that each member variable is copied. If w and a are of type arc above, this means
that the integers ni, nj, nk and the pointer w are copied from a to w. However, it does
not mean that the array pointed to by w is copied. After the assignment, w and a will
46
contain one pointer each, both pointing to the same location in memory. In order to have
a copy of the elements of the array, it is necessary to overload the assignment operator.
In class arc we can write the function
void arc::operator=( arc a )
{
if( a.ni != b.ni || a.nj != b.nj || a.nk != b.nk )
cout << "incompatible array sizes" << endl;
for( int ind = 0 ; ind < ni*nj*nk ; ind++ )
w[ind] = a.w[ind];
}
Note that we make the copy by a one dimensional loop, for better eciency. In order to
allow the three way assignment a=b=c, we should return an object of type arc, instead
of having the return type void as in the example above.
The principle of encapsulation in object oriented programming can be illustrated by
the example of computing metric coecients. We assume a two dimensional problem,
where we want to store the metric coecients y
r
, y
s
, x
r
, x
s
in a 22 matrix in each
grid point, represented by the array met(2,2,ni,nj), where ni, nj are the number of
grid points in the coordinate directions. We can imagine an array class containing the
grid,
class grid
{
double *x, *y;
double *metric;
int ni, nj;
public:
grid(char * );
{
// Precompute metric coefficients and store in array metric.
}
inline int ind( int i , int j )
{
return i-1 + ni*(j-1);
}
inline double met( int m, int k, int i, int j )
{
return metric[m-1 + 2*(k-1) + 4*(i-1) + 4*ni*(j-1)];
}
...
assume that we instead want to recompute the coecients each time they are used. We
then change the function met to
inline double met( int m, int k, int i, int j )
{
if( m == 1 && k == 1 )
return (y[ind(i+1,j)]-y[ind(i-1,j)]/2;
elseif( m == 1 && k == 2 )
return -(x[ind(i+1,j)]-x[ind(i-1,j)])/2;
....
}
This change is done locally in the class grid, and parts of the code where grid is used,
does not have to know about which version of met that is being used. Furthermore, if the
grid is rectangular with a uniform step, we can write another version of met as follows
inline double met( int m, int k, int i, int j )
{
if( m == 1 && k == 1 )
return 0;
elseif( m == 1 && k == 2 )
return -dx;
47
....
}
where we assume that the spacing in the x- and y-directions are stored in the class as
variables dx and dy. Again this is a change that can be done internally to grid without
aecting any code which uses the class, thereby giving good encapsulation.
The technique with array classes and indexation functions has its drawbacks. The
following example was run on a SUN workstation using the CC compiler, and optimization
-O3 ag. First a one dimensional array
#include <iostream.h>
class A
{
int ni;
double* v;
public:
inline double& ind( int i )
{
return v[i-1];
}
A(int i);
~
A()
{
delete [] v;
}
};
A::A( int i )
{
ni = i;
v = new double[ni];
}
int main()
{
A vekt(493039);
for( int i=1; i<=493039; i++ )
vekt.ind(i) = 1;
}
The execution time was 0.09 seconds. We next change the example, to a three dimensional
array,
#include <iostream.h>
class A
{
int ni, nj, nk;
double* v;
public:
inline double& ind( int i, int j, int k )
{
return v[i-1 + ni*(j-1) + ni*nj*(k-1)];
}
A(int i, int j, int k );
};
A::A( int i, int j, int k )
{
ni = i; nj = j; nk = k;
v = new double[ni*nj*nk];
}
int main()
{
48
A vekt(79,79,79);
for( int k=1; k<=79; k++ )
for( int j=1; j<=79; j++ )
for( int i=1; i<=79; i++ )
vekt.ind(i,j,k) = 1;
}
The problem size is the same, since 79
3
= 439039, but the execution time to ll the array
with a number is now 0.35 seconds, i.e., 4 times bigger. I can think of the following
possible explanations for the degradation of performance.
Copying of arguments in the function call. Although the function is inlined, there
might still be some copying of arguments to avoid conict with names in the main
program.
The compiler can not know that we will step through the entire array in sequence,
instead of updating the index by one, the total index computation is redone for each
step.
Potentially, ni,nj and nk could change value between two calls to ind, thus the
expression ni*nj is recomputed between the calls.
There is nothing inherent in the language that makes the three dimensional indexation
slow, it is just a question of how intelligent the compiler is.
The fortran version of this code ( no classes, only one main program ) takes 0.06
seconds to run, for both the one dimensional and the three dimensional array.
An array class with the same functionality as the fortran90 array, requires more work
than shown here. For example, it is necessary to have operators for selecting subarrays,
such as taking out a column from a matrix. An attempt to make an array class with full
functionality is the A++ library, written by Dan Quinlan at the Los Alamos Laboratories
in USA. It is available from its homepage,
http://www.c3.lanl.gov/~ dquinlan/A++P++.html/
where also manuals and examples can be found.
4.4 The innermost loops
The time step involves one or more evaluation of the dierence scheme. One way of
thinking is to write the semi-discrete problem as
du
i,j
dt
= r(. . . , u
i,j
, . . .)
where r(u) is called the residual, and contains all spatial derivatives of the PDE. With
the notation r(. . . , u
i,j
, . . .), we mean that r at the point (i, j) depends on u
i,j
and some
neighbouring values such as, e.g., u
i+1,j
, or u
i,j1
.
This especially common for steady state problems where we want to solve r(u) = 0.
The residual is often computed by successive function calls, each function adding one part
49
of the PDE, e.g., for the problem u
t
+ f(u)
x
+ g(u)
y
= (u
xx
+u
yy
), solved by a centered
dierence method and fourth order articial diusion terms, we get
r(u
i,j
) = D
0x
f(u
i,j
) D
0y
g(u
i,j
) + (D
+x
D
x
u
i,j
+ D
+y
D
y
u
i,j
)
d(x
3
(D
+x
D
x
)
2
u
i,j
+ y
3
(D
+y
D
y
)
2
u
i,j
)
In several dimensions, we put an index on the dierence operators to show in which
direction it acts, for example
D
0r
u
i,j
= (u
i+1,j
u
i1,j
)/(2r)
D
+s
u
i,j
= (u
i,j+1
u
i,j
)/(s).
The residual computation would be divided into three function calls:
r = 0;
convective( u, r ); // Computes r = r (D
0x
f(u) + D
0y
g(u))
diffusive ( u, r, eps ); // Computes r = r + (D
+x
D
x
u + D
+y
D
y
u)
artificial( u, r, d ); // Computes r = r d ((
+x

x
)
2
u + (
+y

y
)
2
u)
This is a structured way of arranging the computation, it gives good exibility. It is easy
to, e.g., change diusion model, but keep other terms. On modern workstations however,
memory accesses are fairly expensive, and the three sweeps through memory needed gives
a signicant slow down, when compared with doing all terms of the PDE in the same
function.
We assume in this section, that the known function on the grid is w, which is an array
of ninjnk grid points, and that we want to solve a problem of the type
w
t
+ f(w)
x
+ g(w)
y
+ h(w)
z
= 0
or the two dimensional equivalent problem
u
t
+ f(u)
x
+ g(u)
y
= 0
which is transformed to a grid
u
t
+
1
J
((y
s
f x
s
g)
r
+ (y
r
f + x
r
g)
s
) = 0.
and where f(u) and g(u) are given functions. We will use the standard second order
centered dierence operator,
D
0
u
j
= (u
j+1
u
j1
)/(2x) = u
x
(x
j
) +O(x
2
),
The straight forward semi-discrete approximation of (4.4) then becomes,
du
i,j
/dt+
1
J
i,j
(D
0r
(D
0s
(y
i,j
)f(u
i,j
) D
0s
(x
i,j
)g(u
i,j
))+
D
0s
(D
0r
(y
i,j
)f(u
i,j
) + D
0r
(x
i,j
)g(u
i,j
))) = 0
It follows from the discussion in the previous chapter, that for this approximation, the
ux derivatives become identically zero when f and g are constant. In practice we often
50
have more complicated dierence approximation. We use the centered D
0
approximation
here as an example, furthermore this type of approximation often appears as a part of a
more advanced approximation.
We can solve in time like a system of ordinary dierential equations, since the problem
is of the type
du/dt = F(u)
with u = (u
1,1
u
1,2
. . . u
m,n
)
T
. This means that we will have to iterate a number steps in
time. The metric derivatives x
r
, etc. do not change in time, and we can either compute
and store these quantities, or recompute them each time we need them. This is a case
of trade o between higher computational cost or higher memory consumption. In three
space dimensions the precomputed metric derivatives gives a 3 3 matrix at each grid
point, whereas the grid itself has only 3 components at each point.
A fortran routine to compute and store the metric derivatives is shown below The met-
ric is stored in a matrix met(9,ni,nj,nk), which has been dened such that met(1,:,:)
is an approximation of Jq
x
, met(2,:,:,:) is an approximation of Jq
y
, etc. We use the
formulas in Chapter 3, (3.7), and store the coecients multiplied by J, since this is what
appears naturally in the formulas for the conservative derivatives.
The boundary points of the metric array are not assigned any values. We can provide
boundary values, by changing to a one sided dierence operator near the boundary. One
sided operators at boundaries will be described later in this chapter.
subroutine METCOMP3( ni, nj, nk, grid, met )
*** Input: ni, nj, nk - Number of grid points
*** grid - The grid, grid(1,:,:,:) is x, grid(2,:,:,:) is y
grid(3,:,:,:) is z
*** Output: met - Metric derivatives. Boundary points not treated.
real*8 half
parameter( half = 0.5d0 )
integer ni, nj, nk, i, j, k
real*8 grid(3,ni,nj,nk), met(9,ni,nj,nk)
real*8 xi, xj, xk, yi, yj, yk, zi, zj, zk
do k=2,nk-1
do j=2,nj-1
do i=2,ni-1
xk = (grid(1,i,j,k+1)-grid(1,i,j,k-1))*half
yk = (grid(2,i,j,k+1)-grid(2,i,j,k-1))*half
zk = (grid(3,i,j,k+1)-grid(3,i,j,k-1))*half
xj = (grid(1,i,j+1,k)-grid(1,i,j-1,k))*half
yj = (grid(2,i,j+1,k)-grid(2,i,j-1,k))*half
zj = (grid(3,i,j+1,k)-grid(3,i,j-1,k))*half
xi = (grid(1,i+1,j,k)-grid(1,i-1,j,k))*half
yi = (grid(2,i+1,j,k)-grid(2,i-1,j,k))*half
zi = (grid(3,i+1,j,k)-grid(3,i-1,j,k))*half
met(1,i,j,k) = yj*zk-zj*yk
met(2,i,j,k) = zj*xk-xj*zk
met(3,i,j,k) = zj*xk-xj*zk
met(4,i,j,k) = zi*yk-yi*zk
met(5,i,j,k) = xi*zk-zi*xk
met(6,i,j,k) = yi*xk-yk*xi
met(7,i,j,k) = yi*zj-zi*yj
met(8,i,j,k) = zi*zj-xi*zj
met(9,i,j,k) = xi*yj-xj*yi
enddo
51
enddo
enddo
return
end
Code 4.4. Metric derivative computation.
The computation of the derivatives of the functions f(u), g(u), and h(u) can be orga-
nized in dierent ways.
To make the exposition simple, let us rst consider the one dimensional computation
f(u(x
j
))
x

p

k=q

k
f(u
j+k
)
The straightforward implementation
for( int i=1+q; i<= ni-p ; i++ )
r[i] = r[i] + al[-q]*f(w[i-q]) + al[-q+1]*f(w[i-q+1]) + ...+ al[p]*f(w[i+p]) ;
has the disadvantage that f is computed several times for the same u
j
value. This can be
costly, since f(u) often involves complicated arithmetical formulae. On the other hand it
is a simple and straightforward implementation. The second approach is with so called
lagging uxes, we hold only the uxes needed in a short array
for( int k=1;k<=p+q+1 ; k++ )
f[k] = f(w[k]);
for( int i=1+q; i<= ni-p ; i++ )
{
f[1+p+q] = f(w[i+q]);
r[i] = r[i] + al[-q]*f[1] + al[-q+1]*f[2] + ...+ al[p]*f[1+p+q];
f[1] = f[2];
f[2] = f[3];
...
f[p+q] = f[p+q+1];
}
Here f(u) is computed only once. On the other hand, it has become complicated to
start the loop. Furthermore, vector computers, like for example Cray, have problems
to vectorize wrap-around variables. Another problem is that we unecessarily move data
between dierent memory locations in the ux copy. One could avoid the ux copy by
using an indirect addressing of the array f[ ], but that would also causes a certain ( but
probably smaller ) overhead.
It can be faster to precompute all uxes rst, as in the example below
for( int i=1; i<=ni ; i++ )
f[i] = f(w[i]);
for( int i=1+q; i<= ni-p ; i++ )
{
r[i] = r[i] + al[-q]*f[i-q] + al[-q+1]*f[i-q+1] + ...+ al[p]*f[i+p];
}
This is easy to implement and gives a fairly clean code. However, we are wasting
memory, but this is normally not severe if we use the technique line wise in several
space dimensions. Another problem is that we have to fetch the elements of f from
memory twice, once in each loop, which can cause a little performance degradation on
risc processors.
The fourth code organization, that we would like to show here is the uwrapped
computation
52
for( int i=1; i<= p ; i++ )
{
flx = f(w[i]);
if( i-p >= 1 && i-p <= ni )
r[i-p] += al[p]*flx;
if( i-p+1 >= 1 && i-p+1 <= ni )
r[i-p+1] += al[p-1]*flx;
...
if( i+q >= 1 && i+q <= ni )
r[i+q] += al[-q]*flx;
}
The summation of
k
f(u
j+k
) is spread out over several loop iterations. Some performance
can be lost by this, but it is not important if the computation of f(u) is the dominating
cost. If f(u) is very costly to compute, this is a good technique. Furthermore we need no
extra arrays, and we sweep through memory only once.
The techniques above can be extended to several space dimensions. It is usual to use
the same function for all directions. For example we can dene a function
void d0( int ni, int nj, double w[], double met[], double r[], char dir )
{
int io, jo, m;
if( dir == i )
{
io = 1; jo = 0; m=1;
}
else
{
io = 0; jo = 1; m=2;
}
for( int j=2 ; j<=nj-1; j++ )
for( int i=2 ; i<=ni-1; i++ )
{
fp = met[m,1,i+io,j+jo]*f(w[i+io,j+jo]) + met[m,2,i+io,j+jo]*g(w[i+io,j+jo]);
fm = met[m,1,i-io,j-jo]*f(w[i-io,j-jo]) + met[m,2,i-io,j-jo]*g(w[i-io,j-jo]);
r[i,j] += (fp-fm)/2;
}
where we for the moment assume that we have an indexation possibility for the arrays
w[i,j], however this is not valid C++ syntax, unless we have overloaded the indexation
operator. The idea with this routine is that we call it rst with dir equal to i and
then with dir equal to j. The array met are the metric derivatives, precomputed as
described previously. The code becomes easier to modify and debug, when we use one
function for all directions. The disadvantage is that some vector machines have problems
in vectorizing the code, when the information about sweep direction is not available at
compile time. We also lose somewhat in eciency on risc machines, since memory is swept
through two times.
In the technique with precomputed uxes, we save memory by precomputing uxes
only along lines. We show below an example of what we mean, in three dimensions.
for( int k=1 ; k<=nk ;k++ )
for( int j=1 ; j<=nj ;j++ )
{
for( int i=1 ; i<=ni ;i++ )
f[i] = f(w[i,j,k]);
for( int i=2 ; i<=ni-1 ;i++ )
r[i,j,k] += ( f[i+1] - f[i-1] )/2;
It is not completely obvious how to merge the i-, j-, and k- direction sweeps into one
routine. We have to remap the index (i, j, k) to new index (i
1
, i
2
, i
3
) so that the sweep
53
direction is always innermost. This means for i sweeps (i, j, k) (i
1
, i
2
, i
3
), for j sweeps
(i, j, k) (i
2
, i
1
, i
3
), and for k sweeps (i, j, k) (i
2
, i
3
, i
1
). We then need a exible index-
ation, so we transfer the arrays as one-dimensional, and use our own index expression, as
shown in the code below.
void d0( int ni, int nj, int nk, double w[], double met[], double r[], char dir )
{
...
if( dir == i )
{
n1 = ni; n2 = nj; n3=nk;
a1 = 1; a2 = ni; a3 = ni*nj;
}
else if( dir == j )
{
n1 = nj; n2 = ni; n3=nk;
a1 = ni; a2 = 1; a3 = ni*nj;
}
else
{
n1 = nk; n2 = ni; n3=nj;
a1 = ni*nj; a2 = 1; a3 = ni;
}
for( int i3 = 1 ; i3 <= n3 ; i3++ )
for( int i2 = 1 ; i2 <= n2 ; i2++ )
for( int i1 = 1 ; i1 <= n1 ; i1++ )
{
ind = b + a1*i1 + a2*i2 + a3*i3;
f[i1] = f(w[ind]);
}
for( int i1 = 1 ; i1 <= n1 ; i1++ )
{
ind = b + a1*i1 + a2*i2 + a3*i3;
res[ind] += (f[i1+1]-f[i1-1])/2;
}
The index expression ind = i 1 + ni (j 1) + ni nj (k 1) is used. The part
of the indexation expression which does not change with i1, can be moved out and be
computed in the i2 loop. The above code can be called three times, once for each of the
three coordinate directions. The met array consists of precomputed metric derivatives,
done as in Code 4.4, and r is the resulting residual, w is the function dened on the grid.
The technique above, can also be used when complicated subexpressions are common
to several terms as in the example
r
i
= h(u
i+1
s
i+1
/2, u
i
+ s
i
/2) h(u
i
s
i
/2, u
i1
+ s
i1
/2)
where s
i
= s(u
i+1
, u
i
, u
i1
). Here the function s should be evaluated only once, though
the value s
i
is needed in two dierent h evaluations. It can then be advantageous to
precompute s along lines as done with the ux above. A second loop performs the h
evaluation and adds to r.
Let us nally comment on ux dierences and nite volumes. It is very common to
rewrite the centred dierence approximation to a ux dierence formula, the computation
has then only two terms, which makes it simple. The ux derivatives,
D
0r
(D
0s
(y
i,j
)f(u
i,j
) D
0s
(x
i,j
)g(u
i,j
)) + D
0s
(D
0r
(y
i,j
)f(u
i,j
) + D
0r
(x
i,j
)g(u
i,j
))
can be written on ux-dierence ( conservative ) form as
(h
r
i+1/2,j
h
r
i1/2,j
)/r + (h
s
i,j+1/2
h
s
i,j1/2
)/s
54
where the numerical uxes h
r
, h
s
are dened as
h
r
i1/2,j
= (D
0s
(y
i,j
)f(u
i,j
) + D
0s
(y
i1,j
)f(u
i1,j
))/2
D
0s
(x
i,j
)g(u
i,j
) D
0s
(x
i1,j
)g(u
i1,j
))/2
h
s
i,j1/2
= (D
0r
(y
i,j
)f(u
i,j
) D
0r
(y
i,j1
)f(u
i,j1
))/2
+(D
0r
(x
i,j
)g(u
i,j
) + D
0r
(x
i,j1
)g(u
i,j1
))/2
This follows from the observation that
D
0
u
j
= (u
j+1
u
j1
)/(2x) = ((u
j+1
+ u
j
)/2 (u
j
+ u
j1
)/2)/x
Alternatively, one can dene the numerical uxes as
h
r
i1/2,j
= D
0s
y
i1/2,j
(f(u
i,j
) + f(u
i1,j
))/2 D
0s
x
i1/2,j
(g(u
i,j
) + g(u
i1,j
))/2
h
s
i,j1/2
= D
0r
y
i,j1/2
(f(u
i,j
) + f(u
i,j1
))/2 + D
0r
x
i,j1/2
(g(u
i,j
) + g(u
i,j1
))/2
where D
0s
y
i1/2,j
is the derivative at a location intermediate to the grid points, e.g.,
D
0s
y
i1/2,j
= (D
0s
y
i,j
+ D
0s
y
i1,j
)/2
In the nite volume method, this is the formula that appears naturally, since we then
have the grid points given at half index points (i + 1/2, j + 1/2). The approximation
y
s,i1/2,j
y
i1/2,j+1/2
y
i1/2,j1/2
is then second order accurate.
4.5 Some numerical tests
How much dierence does the implementation make ? It depends, of course, on the
computer on which the program will be run. We will show examples for SUN Ultra. One
often used measure for computational speed is the megaop rate. One megaop is dened
as one million oating point operations per second. For example, 15 mop stands for 15
million oating point operation per second.
First we compare the cost of evaluating the metric coecients vs. precomputing them.
In Table 4.1 we show the execution time in seconds, and the megaop rate, for the example
Euler equations of compressible uid ow. These consist of ve equations and have the
form u
t
+f(u)
x
+g(u)
y
+h(u)
z
= 0, with u a vector of ve components. The exact form
is
_
_
_
_
_
_
_

u
v
w
e
_
_
_
_
_
_
_
t
+
_
_
_
_
_
_
_
u
u
2
+ p
uv
uw
u(e + p)
_
_
_
_
_
_
_
x
+
_
_
_
_
_
_
_
v
uv
v
2
+ p
vw
v(e + p)
_
_
_
_
_
_
_
y
+
_
_
_
_
_
_
_
w
uw
wv
w
2
+ p
w(e + p)
_
_
_
_
_
_
_
z
=
_
_
_
_
_
_
_
0
0
0
0
0
_
_
_
_
_
_
_
Here, is density, u, v, w are velocities in the (x, y, z) directions, and e is the energy of
the uid. The pressure p is related to the other components through an equation of state,
and we assume that it has been computed elsewhere in the program.
55
In the computer program the array w(5,ni,nj,nk) represents the ve components
(, u, v, w, e) in each grid point. The results below were obtained by evaluating the
centered dierence approximation, described previously, for these equations. Thus there
is no time stepping, only a single evaluation of the residual r(u).
In Table 4.1 we compare a function where the metric coecients are computed di-
rectly in the approximation and another function using precomputed metric coecients.
Otherwise it is the same computation. We see that the loss is approximately 10 % in com-
putational time by performing the extra computation of metric coecients. The mop
rate increases when the metric is computed. On this risc computer, we make more ef-
cient use of the processors arithemtic unit when we increase the number of arithmetic
operations, and decrease the number array element accesses. The programs were com-
piled with optimization ags, to make the programs as fast as possible. The programs
are shown in Code 4.5 and 4.7 respectively.
Table 4.1, 3D Euler computation, dierent treatment of metric
Code Cpu seconds Mop
Computing metric, Code 4.7 0.048 30
Precomputed metric, Code 4.5 0.043 20
Next we compare some dierent languages
Table 4.2, 3D Euler computation, dierent languages, Mop rates
Code 10
3
points 30
3
points
f77 1 loop, Code 4.6 33 31
f77 3 loops, Code 4.5 28 22
f90, Code 4.10 16 16
C++, Code 4.8 20 20
C++, simple array class Code 4.9 10 9
A++, array class Code 4.11 1.7 1.7
In Table 4.2, we rst see that the computer is sensitive to how many times the memory
is swept through. The 1 loop implementation is faster than the three loop version. The
simple C++ code with a direct array access, is only slightly slower than f77. When a
simple home made array class is used the performance drops considerably, and the
professional array class library A++ shows the slowest performance. A++ is however
very general, showing the problem with tradeo between generality and performance.
On a machine like the Cray, which is not of risc type, there is no gain in having one
loop instead of three.
4.6 Example codes
We here give the codes 4.54.10 that were used in the section on performance. They
all compute the same centered dierence approximation of the three dimensional Eulers
56
equations. Only the treatment of the boundary points can dier between the implemen-
tations.
subroutine D0B( ni, nj, nk, u, p, met, res )
real*8 half
parameter( half=0.5d0)
integer ni, nj, nk, i, j, k
real*8 u(5,ni,nj,nk), p(ni,nj,nk)
real*8 met(3,3,ni,nj,nk), res(5,ni,nj,nk)
real*8 f1, f2, f3, f4, f5, irh, u1, u2, u3
do k=1,nk
do j=1,nj
do i=1,ni
f1 = met(1,1,i,j,k)*u(2,i,j,k) +
* met(1,2,i,j,k)*u(3,i,j,k) +
* met(1,3,i,j,k)*u(4,i,j,k)
irh = 1/u(1,i,j,k)
f2 = half*(f1*u(2,i,j,k)*irh + met(1,1,i,j,k)*p(i,j,k))
f3 = half*(f1*u(3,i,j,k)*irh + met(1,2,i,j,k)*p(i,j,k))
f4 = half*(f1*u(4,i,j,k)*irh + met(1,3,i,j,k)*p(i,j,k))
f5 = half*(f1*(u(5,i,j,k)+p(i,j,k))*irh)
f1 = half*f1
if( i.lt.ni )then
res(1,i+1,j,k) = res(1,i+1,j,k) - f1
res(2,i+1,j,k) = res(2,i+1,j,k) - f2
res(3,i+1,j,k) = res(3,i+1,j,k) - f3
res(4,i+1,j,k) = res(4,i+1,j,k) - f4
res(5,i+1,j,k) = res(5,i+1,j,k) - f5
endif
if( i.gt.1 )then
res(1,i-1,j,k) = res(1,i-1,j,k) + f1
res(2,i-1,j,k) = res(2,i-1,j,k) + f2
res(3,i-1,j,k) = res(3,i-1,j,k) + f3
res(4,i-1,j,k) = res(4,i-1,j,k) + f4
res(5,i-1,j,k) = res(5,i-1,j,k) + f5
endif
enddo
enddo
enddo
do k=1,nk
do i=1,ni
do j=1,nj
f1 = met(2,1,i,j,k)*u(2,i,j,k) +
* met(2,2,i,j,k)*u(3,i,j,k) +
* met(2,3,i,j,k)*u(4,i,j,k)
irh = 1/u(1,i,j,k)
f2 = half*(f1*u(2,i,j,k)*irh + met(2,1,i,j,k)*p(i,j,k))
f3 = half*(f1*u(3,i,j,k)*irh + met(2,2,i,j,k)*p(i,j,k))
f4 = half*(f1*u(4,i,j,k)*irh + met(2,3,i,j,k)*p(i,j,k))
f5 = half*(f1*(u(5,i,j,k)+p(i,j,k))*irh)
f1 = half*f1
if( j.lt.nj )then
res(1,i,j+1,k) = res(1,i,j+1,k) - f1
res(2,i,j+1,k) = res(2,i,j+1,k) - f2
res(3,i,j+1,k) = res(3,i,j+1,k) - f3
res(4,i,j+1,k) = res(4,i,j+1,k) - f4
res(5,i,j+1,k) = res(5,i,j+1,k) - f5
endif
if( j.gt.1 )then
res(1,i,j-1,k) = res(1,i,j-1,k) + f1
res(2,i,j-1,k) = res(2,i,j-1,k) + f2
res(3,i,j-1,k) = res(3,i,j-1,k) + f3
res(4,i,j-1,k) = res(4,i,j-1,k) + f4
res(5,i,j-1,k) = res(5,i,j-1,k) + f5
endif
enddo
57
enddo
enddo
do j=1,nj
do i=1,ni
do k=1,nk
f1 = met(3,1,i,j,k)*u(2,i,j,k) +
* met(3,2,i,j,k)*u(3,i,j,k) +
* met(3,3,i,j,k)*u(4,i,j,k)
irh = 1/u(1,i,j,k)
f2 = half*(f1*u(2,i,j,k)*irh + met(3,1,i,j,k)*p(i,j,k))
f3 = half*(f1*u(3,i,j,k)*irh + met(3,2,i,j,k)*p(i,j,k))
f4 = half*(f1*u(4,i,j,k)*irh + met(3,3,i,j,k)*p(i,j,k))
f5 = half*(f1*(u(5,i,j,k)+p(i,j,k))*irh)
f1 = half*f1
if( k.lt.nk )then
res(1,i,j,k+1) = res(1,i,j,k+1) - f1
res(2,i,j,k+1) = res(2,i,j,k+1) - f2
res(3,i,j,k+1) = res(3,i,j,k+1) - f3
res(4,i,j,k+1) = res(4,i,j,k+1) - f4
res(5,i,j,k+1) = res(5,i,j,k+1) - f5
endif
if( k.gt.1 )then
res(1,i,j,k-1) = res(1,i,j,k-1) + f1
res(2,i,j,k-1) = res(2,i,j,k-1) + f2
res(3,i,j,k-1) = res(3,i,j,k-1) + f3
res(4,i,j,k-1) = res(4,i,j,k-1) + f4
res(5,i,j,k-1) = res(5,i,j,k-1) + f5
endif
enddo
enddo
enddo
end
Code 4.5.Fortran 77, three loops precomputed metric.
subroutine D0B( ni, nj, nk, u, p, met, res )
real*8 half
parameter( half=0.5d0)
integer ni, nj, nk, i, j, k
real*8 u(5,ni,nj,nk), p(ni,nj,nk)
real*8 met(3,3,ni,nj,nk), res(5,ni,nj,nk)
real*8 f1, f1i, f1j, f1k, ui, uj, uk, irh
do k=2,nk-1
do j=2,nj-1
do i=2,ni-1
f1i = met(1,1,i,j,k)*u(2,i,j,k) +
* met(1,2,i,j,k)*u(3,i,j,k) +
* met(1,3,i,j,k)*u(4,i,j,k)
f1j = met(2,1,i,j,k)*u(2,i,j,k) +
* met(2,2,i,j,k)*u(3,i,j,k) +
* met(2,3,i,j,k)*u(4,i,j,k)
f1k = met(3,1,i,j,k)*u(2,i,j,k) +
* met(3,2,i,j,k)*u(3,i,j,k) +
* met(3,3,i,j,k)*u(4,i,j,k)
irh = 1d0/u(1,i,j,k)
ui = f1i*irh
uj = f1j*irh
uk = f1k*irh
58
f1i = half*f1i
f1j = half*f1j
f1k = half*f1k
res(1,i+1,j,k) = res(1,i+1,j,k) - f1i
res(1,i-1,j,k) = res(1,i-1,j,k) + f1i
res(1,i,j+1,k) = res(1,i,j+1,k) - f1j
res(1,i,j-1,k) = res(1,i,j-1,k) + f1j
res(1,i,j,k+1) = res(1,i,j,k+1) - f1k
res(1,i,j,k-1) = res(1,i,j,k-1) + f1k
f1 = half*(u(2,i,j,k)*ui + met(1,1,i,j,k)*p(i,j,k))
res(2,i+1,j,k) = res(2,i+1,j,k) - f1
res(2,i-1,j,k) = res(2,i-1,j,k) + f1
f1 = half*(u(2,i,j,k)*uj + met(2,1,i,j,k)*p(i,j,k))
res(2,i,j+1,k) = res(2,i,j+1,k) - f1
res(2,i,j-1,k) = res(2,i,j-1,k) + f1
f1 = half*(u(2,i,j,k)*uk + met(3,1,i,j,k)*p(i,j,k))
res(2,i,j,k+1) = res(2,i,j,k+1) - f1
res(2,i,j,k-1) = res(2,i,j,k-1) + f1
f1 = half*(u(3,i,j,k)*ui + met(1,2,i,j,k)*p(i,j,k))
res(3,i+1,j,k) = res(3,i+1,j,k) - f1
res(3,i-1,j,k) = res(3,i-1,j,k) + f1
f1 = half*(u(3,i,j,k)*uj + met(2,2,i,j,k)*p(i,j,k))
res(3,i,j+1,k) = res(3,i,j+1,k) - f1
res(3,i,j-1,k) = res(3,i,j-1,k) + f1
f1 = half*(u(3,i,j,k)*uk + met(3,2,i,j,k)*p(i,j,k))
res(3,i,j,k+1) = res(3,i,j,k+1) - f1
res(3,i,j,k-1) = res(3,i,j,k-1) + f1
f1 = half*(u(4,i,j,k)*ui + met(1,3,i,j,k)*p(i,j,k))
res(4,i+1,j,k) = res(4,i+1,j,k) - f1
res(4,i-1,j,k) = res(4,i-1,j,k) + f1
f1 = half*(u(4,i,j,k)*uj + met(2,3,i,j,k)*p(i,j,k))
res(4,i,j+1,k) = res(4,i,j+1,k) - f1
res(4,i,j-1,k) = res(4,i,j-1,k) + f1
f1 = half*(u(4,i,j,k)*uk + met(3,3,i,j,k)*p(i,j,k))
res(4,i,j,k+1) = res(4,i,j,k+1) - f1
res(4,i,j,k-1) = res(4,i,j,k-1) + f1
f1 = half*(u(5,i,j,k)+p(i,j,k))*ui
res(5,i+1,j,k) = res(5,i+1,j,k) - f1
res(5,i-1,j,k) = res(5,i-1,j,k) + f1
f1 = half*(u(5,i,j,k)+p(i,j,k))*uj
res(5,i,j+1,k) = res(5,i,j+1,k) - f1
res(5,i,j-1,k) = res(5,i,j-1,k) + f1
f1 = half*(u(5,i,j,k)+p(i,j,k))*uk
res(5,i,j,k+1) = res(5,i,j,k+1) - f1
res(5,i,j,k-1) = res(5,i,j,k-1) + f1
enddo
enddo
enddo
end
Code 4.6.Fortran 77, one loop precomputed metric.
subroutine D0C( ni, nj, nk, u, p, grid, res )
real*8 half, fourth
parameter( half=0.5d0, fourth=0.25d0)
integer ni, nj, nk, i, j, k
59
real*8 u(5,ni,nj,nk), p(ni,nj,nk)
real*8 grid(3,ni,nj,nk), res(5,ni,nj,nk)
real*8 f1, f2, f3, f4, f5, irh, u1, u2, u3
real*8 k1, k2, k3
do k=1,nk
do j=1,nj
do i=1,ni
k1 = fourth*((grid(2,i,j+1,k)-grid(2,i,j-1,k))*
* (grid(3,i,j,k+1)-grid(3,i,j,k-1)) -
* (grid(2,i,j,k+1)-grid(2,i,j,k-1))*
* (grid(3,i,j+1,k)-grid(3,i,j-1,k)) )
k2 = fourth*((grid(3,i,j+1,k)-grid(3,i,j-1,k))*
* (grid(1,i,j,k+1)-grid(1,i,j,k-1)) -
* (grid(3,i,j,k+1)-grid(3,i,j,k-1))*
* (grid(1,i,j+1,k)-grid(1,i,j-1,k)) )
k3 = fourth*((grid(1,i,j+1,k)-grid(1,i,j-1,k))*
* (grid(2,i,j,k+1)-grid(2,i,j,k-1)) -
* (grid(1,i,j,k+1)-grid(1,i,j,k-1))*
* (grid(2,i,j+1,k)-grid(2,i,j-1,k)) )
f1 = k1*u(2,i,j,k) +
* k2*u(3,i,j,k) +
* k3*u(4,i,j,k)
irh = 1/u(1,i,j,k)
u1 = u(2,i,j,k)*irh
u2 = u(3,i,j,k)*irh
u3 = u(4,i,j,k)*irh
f2 = half*(f1*u1 + k1*p(i,j,k))
f3 = half*(f1*u2 + k2*p(i,j,k))
f4 = half*(f1*u3 + k3*p(i,j,k))
f5 = half*(f1*(u(5,i,j,k)+p(i,j,k))*irh)
f1 = half*f1
if( i.lt.ni )then
res(1,i+1,j,k) = res(1,i+1,j,k) - f1
res(2,i+1,j,k) = res(2,i+1,j,k) - f2
res(3,i+1,j,k) = res(3,i+1,j,k) - f3
res(4,i+1,j,k) = res(4,i+1,j,k) - f4
res(5,i+1,j,k) = res(5,i+1,j,k) - f5
endif
if( i.gt.1 )then
res(1,i-1,j,k) = res(1,i-1,j,k) + f1
res(2,i-1,j,k) = res(2,i-1,j,k) + f2
res(3,i-1,j,k) = res(3,i-1,j,k) + f3
res(4,i-1,j,k) = res(4,i-1,j,k) + f4
res(5,i-1,j,k) = res(5,i-1,j,k) + f5
endif
enddo
enddo
enddo
do k=1,nk
do i=1,ni
do j=1,nj
k1 = fourth*((grid(3,i+1,j,k)-grid(3,i-1,j,k))*
* (grid(2,i,j,k+1)-grid(2,i,j,k-1)) -
* (grid(3,i,j,k+1)-grid(3,i,j,k-1))*
* (grid(2,i+1,j,k)-grid(2,i-1,j,k)) )
k2 = fourth*((grid(1,i+1,j,k)-grid(1,i-1,j,k))*
* (grid(3,i,j,k+1)-grid(3,i,j,k-1)) -
* (grid(1,i,j,k+1)-grid(1,i,j,k-1))*
* (grid(3,i+1,j,k)-grid(3,i-1,j,k)) )
k3 = fourth*((grid(2,i+1,j,k)-grid(2,i-1,j,k))*
* (grid(1,i,j,k+1)-grid(1,i,j,k-1)) -
* (grid(2,i,j,k+1)-grid(2,i,j,k-1))*
* (grid(1,i+1,j,k)-grid(1,i-1,j,k)) )
f1 = k1*u(2,i,j,k) +
* k2*u(3,i,j,k) +
* k3*u(4,i,j,k)
irh = 1/u(1,i,j,k)
u1 = u(2,i,j,k)*irh
u2 = u(3,i,j,k)*irh
60
u3 = u(4,i,j,k)*irh
f2 = half*(f1*u1 + k1*p(i,j,k))
f3 = half*(f1*u2 + k2*p(i,j,k))
f4 = half*(f1*u3 + k3*p(i,j,k))
f5 = half*(f1*(u(5,i,j,k)+p(i,j,k))*irh)
f1 = half*f1
if( j.lt.nj )then
res(1,i,j+1,k) = res(1,i,j+1,k) - f1
res(2,i,j+1,k) = res(2,i,j+1,k) - f2
res(3,i,j+1,k) = res(3,i,j+1,k) - f3
res(4,i,j+1,k) = res(4,i,j+1,k) - f4
res(5,i,j+1,k) = res(5,i,j+1,k) - f5
endif
if( j.gt.1 )then
res(1,i,j-1,k) = res(1,i,j-1,k) + f1
res(2,i,j-1,k) = res(2,i,j-1,k) + f2
res(3,i,j-1,k) = res(3,i,j-1,k) + f3
res(4,i,j-1,k) = res(4,i,j-1,k) + f4
res(5,i,j-1,k) = res(5,i,j-1,k) + f5
endif
enddo
enddo
enddo
do j=1,nj
do i=1,ni
do k=1,nk
k1 = fourth*((grid(2,i+1,j,k)-grid(2,i-1,j,k))*
* (grid(3,i,j+1,k)-grid(3,i,j-1,k)) -
* (grid(2,i,j+1,k)-grid(2,i,j-1,k))*
* (grid(3,i+1,j,k)-grid(3,i-1,j,k)) )
k2 = fourth*((grid(3,i+1,j,k)-grid(3,i-1,j,k))*
* (grid(1,i,j+1,k)-grid(1,i,j-1,k)) -
* (grid(3,i,j+1,k)-grid(3,i,j-1,k))*
* (grid(1,i+1,j,k)-grid(1,i-1,j,k)) )
k3 = fourth*((grid(1,i+1,j,k)-grid(1,i-1,j,k))*
* (grid(2,i,j+1,k)-grid(2,i,j-1,k)) -
* (grid(1,i,j+1,k)-grid(1,i,j-1,k))*
* (grid(2,i+1,j,k)-grid(2,i-1,j,k)) )
f1 = k1*u(2,i,j,k) +
* k2*u(3,i,j,k) +
* k3*u(4,i,j,k)
irh = 1/u(1,i,j,k)
u1 = u(2,i,j,k)*irh
u2 = u(3,i,j,k)*irh
u3 = u(4,i,j,k)*irh
f2 = half*(f1*u1 + k1*p(i,j,k))
f3 = half*(f1*u2 + k2*p(i,j,k))
f4 = half*(f1*u3 + k3*p(i,j,k))
f5 = half*(f1*(u(5,i,j,k)+p(i,j,k))*irh)
f1 = half*f1
if( k.lt.nk )then
res(1,i,j,k+1) = res(1,i,j,k+1) - f1
res(2,i,j,k+1) = res(2,i,j,k+1) - f2
res(3,i,j,k+1) = res(3,i,j,k+1) - f3
res(4,i,j,k+1) = res(4,i,j,k+1) - f4
res(5,i,j,k+1) = res(5,i,j,k+1) - f5
endif
if( k.gt.1 )then
res(1,i,j,k-1) = res(1,i,j,k-1) + f1
res(2,i,j,k-1) = res(2,i,j,k-1) + f2
res(3,i,j,k-1) = res(3,i,j,k-1) + f3
res(4,i,j,k-1) = res(4,i,j,k-1) + f4
res(5,i,j,k-1) = res(5,i,j,k-1) + f5
endif
enddo
enddo
enddo
end
61
Code 4.7. Fortran 77, three loops, metric computed.
void d02c( double* u, double* p, double* met, double* res,
int ni, int nj, int nk )
{
int i, j, k;
int b3, b4, b5, a32, a42, a43, a52, a53;
const double half = 0.5;
double f1, f2, f3, f4, f5, u1, u2, u3, irh;
b3 = -1 - ni - ni*nj;
a32 = ni*nj;
int nc = 5;
b4 = -1-nc-nc*ni-nc*ni*nj;
a42 = nc*ni;
a43 = nc*ni*nj;
nc = 9;
b5 = -1 -3 -nc -nc*ni-nc*ni*nj;
a52 = nc*ni;
a53 = nc*ni*nj;
for( k=1 ; k<=nk ; k++ )
for( j=1 ; j<=nj ; j++ )
for( i=1 ; i<=ni ; i++ )
{
f1 = met[1+ 3 + 9*i + a52*j + a53*k + b5]*
u[2+5*i+a42*j+a43*k + b4] +
met[ 1+3*2+9*i + a52*j + a53*k + b5]*
u[3+5*i+a42*j+a43*k + b4] +
met[ 1 + 3*3+ 9*i + a52*j + a53*k + b5 ]*
u[4+5*i+a42*j+a43*k + b4];
irh = f1/u[1+5*i+a42*j+a43*k + b4];
u1 = u[2+5*i+a42*j+a43*k + b4]*irh;
u2 = u[3+5*i+a42*j+a43*k + b4]*irh;
u3 = u[4+5*i+a42*j+a43*k + b4]*irh;
f2 = half*(u1 + met[1+ 3 + 9*i + a52*j + a53*k + b5]*
p[i+ni*j+a32*k+b3]);
f3 = half*(u2 + met[ 1+3*2+9*i + a52*j + a53*k + b5]*
p[i+ni*j+a32*k+b3]);
f4 = half*(u3 + met[ 1 + 3*3+ 9*i + a52*j + a53*k + b5 ]*
p[i+ni*j+a32*k+b3]);
f5 = half*((u[5+5*i+a42*j+a43*k + b4]+ p[i+ni*j+a32*k+b3])*irh);
f1 = half*f1;
if( i < ni )
{
res[1+5*(i+1)+a42*j+a43*k + b4] -= f1;
res[2+5*(i+1)+a42*j+a43*k + b4] -= f2;
res[3+5*(i+1)+a42*j+a43*k + b4] -= f3;
res[4+5*(i+1)+a42*j+a43*k + b4] -= f4;
res[5+5*(i+1)+a42*j+a43*k + b4] -= f5;
}
if( i>1 )
{
res[1+5*(i-1)+a42*j+a43*k + b4] += f1;
res[2+5*(i-1)+a42*j+a43*k + b4] += f2;
res[3+5*(i-1)+a42*j+a43*k + b4] += f3;
res[4+5*(i-1)+a42*j+a43*k + b4] += f4;
res[5+5*(i-1)+a42*j+a43*k + b4] += f5;
}
}
for( k=1 ; k<=nk ; k++ )
for( j=1 ; j<=nj ; j++ )
62
for( i=1 ; i<=ni ; i++ )
{
f1 = met[2+ 3 + 9*i + a52*j + a53*k + b5]*
u[2+5*i+a42*j+a43*k + b4] +
met[ 2+3*2+9*i + a52*j + a53*k + b5]*
u[3+5*i+a42*j+a43*k + b4] +
met[ 2 + 3*3+ 9*i + a52*j + a53*k + b5 ]*
u[4+5*i+a42*j+a43*k + b4];
irh = f1/u[1+5*i+a42*j+a43*k + b4];
u1 = u[2+5*i+a42*j+a43*k + b4]*irh;
u2 = u[3+5*i+a42*j+a43*k + b4]*irh;
u3 = u[4+5*i+a42*j+a43*k + b4]*irh;
f2 = half*(u1 + met[2+ 3 + 9*i + a52*j + a53*k + b5]*
p[i+ni*j+a32*k+b3]);
f3 = half*(u2 + met[ 2+3*2+9*i + a52*j + a53*k + b5]*
p[i+ni*j+a32*k+b3]);
f4 = half*(u3 + met[ 2 + 3*3+ 9*i + a52*j + a53*k + b5 ]*
p[i+ni*j+a32*k+b3]);
f5 = half*((u[5+5*i+a42*j+a43*k + b4]+ p[i+ni*j+a32*k+b3])*irh);
f1 = half*f1;
if( j>nj )
{
res[1+5*i+a42*(j+1)+a43*k + b4] -= f1;
res[2+5*i+a42*(j+1)+a43*k + b4] -= f2;
res[3+5*i+a42*(j+1)+a43*k + b4] -= f3;
res[4+5*i+a42*(j+1)+a43*k + b4] -= f4;
res[5+5*i+a42*(j+1)+a43*k + b4] -= f5;
}
if( j>1 )
{
res[1+5*i+a42*(j-1)+a43*k + b4] += f1;
res[2+5*i+a42*(j-1)+a43*k + b4] += f2;
res[3+5*i+a42*(j-1)+a43*k + b4] += f3;
res[4+5*i+a42*(j-1)+a43*k + b4] += f4;
res[5+5*i+a42*(j-1)+a43*k + b4] += f5;
}
}
for( k=1 ; k<=nk ; k++ )
for( j=1 ; j<=nj ; j++ )
for( i=1 ; i<=ni ; i++ )
{
f1 = met[3+ 3 + 9*i + a52*j + a53*k + b5]*
u[2+5*i+a42*j+a43*k + b4] +
met[ 3+3*2+9*i + a52*j + a53*k + b5]*
u[3+5*i+a42*j+a43*k + b4] +
met[ 3 + 3*3+ 9*i + a52*j + a53*k + b5 ]*
u[4+5*i+a42*j+a43*k + b4];
irh = f1/u[1+5*i+a42*j+a43*k + b4];
u1 = u[2+5*i+a42*j+a43*k + b4]*irh;
u2 = u[3+5*i+a42*j+a43*k + b4]*irh;
u3 = u[4+5*i+a42*j+a43*k + b4]*irh;
f2 = half*(u1 + met[3+ 3 + 9*i + a52*j + a53*k + b5]*
p[i+ni*j+a32*k+b3]);
f3 = half*(u2 + met[ 3+3*2+9*i + a52*j + a53*k + b5]*
p[i+ni*j+a32*k+b3]);
f4 = half*(u3 + met[ 3 + 3*3+ 9*i + a52*j + a53*k + b5 ]*
p[i+ni*j+a32*k+b3]);
f5 = half*((u[5+5*i+a42*j+a43*k + b4]+ p[i+ni*j+a32*k+b3])*irh);
f1 = half*f1;
if( k < nk )
{
res[1+5*i+a42*j+a43*(k+1) + b4] -= f1;
res[2+5*i+a42*j+a43*(k+1) + b4] -= f2;
res[3+5*i+a42*j+a43*(k+1) + b4] -= f3;
63
res[4+5*i+a42*j+a43*(k+1) + b4] -= f4;
res[5+5*i+a42*j+a43*(k+1) + b4] -= f5;
}
if( k > 1 )
{
res[1+5*i+a42*j+a43*(k-1) + b4] += f1;
res[2+5*i+a42*j+a43*(k-1) + b4] += f2;
res[3+5*i+a42*j+a43*(k-1) + b4] += f3;
res[4+5*i+a42*j+a43*(k-1) + b4] += f4;
res[5+5*i+a42*j+a43*(k-1) + b4] += f5;
}
}
}
Code 4.8. C++, raw version.
class arc
{
int nc, ni, nj, nk;
double* _v;
int b3, b4, b5, a32, a42, a43, a52, a53;
public:
arc( int, int, int );
arc( int, int, int, int );
arc( int, int, int, int, int );
~arc();
inline double& ind(int m, int i, int j, int k )
{
// return _v[ m-1 +nc*(i-1)+nc*ni*(j-1)+nc*ni*nj*(k-1)];
return _v[ m + nc*i + a42*j + a43*k + b4 ];
}
inline double& ind( int i, int j, int k )
{
// return _v[ (i-1)+ni*(j-1)+ni*nj*(k-1)];
return _v[ i + ni*j + a32*k + b3];
}
inline double& ind(int m, int l, int i, int j, int k )
{
// return _v[ m-1 +3*(l-1) +nc*(i-1)+nc*ni*(j-1)+nc*ni*nj*(k-1)];
return _v[ m + 3*l + nc*i +a52*j + a53*k + b5];
}
int dimi();
int dimj();
int dimk();
};
void d02( arc& u, arc& p, arc& met, arc& res )
{
int i, j, k, ni, nj, nk;
const double half = 0.5;
double f1, f2, f3, f4, f5, u1, u2, u3, irh;
ni = u.dimi();
nj = u.dimj();
nk = u.dimk();
for( k=1 ; k<=nk ; k++ )
for( j=1 ; j<=nj ; j++ )
for( i=1 ; i<=ni ; i++ )
{
64
f1 = met.ind(1,1,i,j,k)*u.ind(2,i,j,k) +
met.ind(1,2,i,j,k)*u.ind(3,i,j,k) +
met.ind(1,3,i,j,k)*u.ind(4,i,j,k);
irh = 1/u.ind(1,i,j,k);
u1 = u.ind(2,i,j,k)*irh;
u2 = u.ind(3,i,j,k)*irh;
u3 = u.ind(4,i,j,k)*irh;
f2 = half*(f1*u1 + met.ind(1,1,i,j,k)*p.ind(i,j,k));
f3 = half*(f1*u2 + met.ind(1,2,i,j,k)*p.ind(i,j,k));
f4 = half*(f1*u3 + met.ind(1,3,i,j,k)*p.ind(i,j,k));
f5 = half*(f1*(u.ind(5,i,j,k)+p.ind(i,j,k))*irh);
f1 = half*f1;
if( i < ni )
{
res.ind(1,i+1,j,k) -= f1;
res.ind(2,i+1,j,k) -= f2;
res.ind(3,i+1,j,k) -= f3;
res.ind(4,i+1,j,k) -= f4;
res.ind(5,i+1,j,k) -= f5;
}
if( i>1 )
{
res.ind(1,i-1,j,k) += f1;
res.ind(2,i-1,j,k) += f2;
res.ind(3,i-1,j,k) += f3;
res.ind(4,i-1,j,k) += f4;
res.ind(5,i-1,j,k) += f5;
}
}
for( k=1 ; k<=nk ; k++ )
for( j=1 ; j<=nj ; j++ )
for( i=1 ; i<=ni ; i++ )
{
f1 = met.ind(2,1,i,j,k)*u.ind(2,i,j,k) +
met.ind(2,2,i,j,k)*u.ind(3,i,j,k) +
met.ind(2,3,i,j,k)*u.ind(4,i,j,k);
irh = 1/u.ind(1,i,j,k);
u1 = u.ind(2,i,j,k)*irh;
u2 = u.ind(3,i,j,k)*irh;
u3 = u.ind(4,i,j,k)*irh;
f2 = half*(f1*u1 + met.ind(2,1,i,j,k)*p.ind(i,j,k));
f3 = half*(f1*u2 + met.ind(2,2,i,j,k)*p.ind(i,j,k));
f4 = half*(f1*u3 + met.ind(2,3,i,j,k)*p.ind(i,j,k));
f5 = half*(f1*(u.ind(5,i,j,k)+p.ind(i,j,k))*irh);
f1 = half*f1;
if( j>nj )
{
res.ind(1,i,j+1,k) -= f1;
res.ind(2,i,j+1,k) -= f2;
res.ind(3,i,j+1,k) -= f3;
res.ind(4,i,j+1,k) -= f4;
res.ind(5,i,j+1,k) -= f5;
}
if( j>1 )
{
res.ind(1,i,j-1,k) += f1;
res.ind(2,i,j-1,k) += f2;
res.ind(3,i,j-1,k) += f3;
res.ind(4,i,j-1,k) += f4;
res.ind(5,i,j-1,k) += f5;
}
}
for( k=1 ; k<=nk ; k++ )
for( j=1 ; j<=nj ; j++ )
for( i=1 ; i<=ni ; i++ )
{
65
f1 = met.ind(3,1,i,j,k)*u.ind(2,i,j,k) +
met.ind(3,2,i,j,k)*u.ind(3,i,j,k) +
met.ind(3,3,i,j,k)*u.ind(4,i,j,k);
irh = 1/u.ind(1,i,j,k);
u1 = u.ind(2,i,j,k)*irh;
u2 = u.ind(3,i,j,k)*irh;
u3 = u.ind(4,i,j,k)*irh;
f2 = half*(f1*u1 + met.ind(3,1,i,j,k)*p.ind(i,j,k));
f3 = half*(f1*u2 + met.ind(3,2,i,j,k)*p.ind(i,j,k));
f4 = half*(f1*u3 + met.ind(3,3,i,j,k)*p.ind(i,j,k));
f5 = half*(f1*(u.ind(5,i,j,k)+p.ind(i,j,k))*irh);
f1 = half*f1;
if( k < nk )
{
res.ind(1,i,j,k+1) -= f1;
res.ind(2,i,j,k+1) -= f2;
res.ind(3,i,j,k+1) -= f3;
res.ind(4,i,j,k+1) -= f4;
res.ind(5,i,j,k+1) -= f5;
}
if( k > 1 )
{
res.ind(1,i,j,k-1) += f1;
res.ind(2,i,j,k-1) += f2;
res.ind(3,i,j,k-1) += f3;
res.ind(4,i,j,k-1) += f4;
res.ind(5,i,j,k-1) += f5;
}
}
}
Code 4.9. C++, using simple array class.
subroutine d02( u, p, met, res, ni, nj, nk )
real etime
!integer mclock
real(kind=kind(1d0)), dimension(5,ni,nj,nk) :: u, res, flx
real(kind=kind(1d0)), dimension(9,ni,nj,nk) :: met
real(kind=kind(1d0)), dimension(ni,nj,nk) :: p, irh
real(kind=kind(1d0)), parameter :: half=0.5d0
integer ni, nj, nk
u(1,:,:,:) = 1;
flx(1,:,:,:) = met(1,:,:,:)*u(2,:,:,:) + &
met(2,:,:,:)*u(3,:,:,:) + &
met(3,:,:,:)*u(4,:,:,:);
irh = flx(1,:,:,:)/u(1,:,:,:);
flx(1,:,:,:) = half*flx(1,:,:,:);
flx(2,:,:,:) = half*( u(2,:,:,:)*irh + met(1,:,:,:)*p );
flx(3,:,:,:) = half*( u(3,:,:,:)*irh + met(2,:,:,:)*p);
flx(4,:,:,:) = half*( u(4,:,:,:)*irh + met(3,:,:,:)*p);
flx(5,:,:,:) = half*( u(5,:,:,:) + p )*irh;
res(:,2:ni-1,:,:) = res(:,2:ni-1,:,:) + &
flx(:,3:ni,:,:) - flx(:,1:ni-2,:,:);
66
flx(1,:,:,:) = met(4,:,:,:)*u(2,:,:,:) + &
met(5,:,:,:)*u(3,:,:,:) + &
met(6,:,:,:)*u(4,:,:,:);
irh = flx(1,:,:,:)/u(1,:,:,:);
flx(1,:,:,:) = half*flx(1,:,:,:);
flx(2,:,:,:) = half*( u(2,:,:,:)*irh + met(4,:,:,:)*p );
flx(3,:,:,:) = half*( u(3,:,:,:)*irh + met(5,:,:,:)*p );
flx(4,:,:,:) = half*( u(4,:,:,:)*irh + met(6,:,:,:)*p );
flx(5,:,:,:) = half*( u(5,:,:,:) + p )*irh;
res(:,:,2:nj-1,:) = res(:,:,2:nj-1,:) + &
( flx(:,:,3:nj,:) - flx(:,:,1:nj-2,:) );
flx(1,:,:,:) = met(7,:,:,:)*u(2,:,:,:) + &
met(8,:,:,:)*u(3,:,:,:) + &
met(9,:,:,:)*u(4,:,:,:);
irh = flx(1,:,:,:)/u(1,:,:,:);
flx(1,:,:,:) = half*flx(1,:,:,:);
flx(2,:,:,:) = half*( u(2,:,:,:)*irh + met(7,:,:,:)*p );
flx(3,:,:,:) = half*( u(3,:,:,:)*irh + met(8,:,:,:)*p );
flx(4,:,:,:) = half*( u(4,:,:,:)*irh + met(9,:,:,:)*p );
flx(5,:,:,:) = half*( u(5,:,:,:) + p )*irh;
res(:,:,:,2:nk-1) = res(:,:,:,2:nk-1) + &
( flx(:,:,:,3:nk) - flx(:,:,:,1:nk-2) );
end subroutine
Code 4.10. Fortran 90, using array notation.
#include <A++.h>
void d02( doubleArray& u, doubleArray& p, doubleArray& met, doubleArray& res,
int ni, int nj, int nk )
{
setGlobalBase(1);
int i, j, k;
const double half = 0.5;
Index I(1,ni), J(1,nj), K(1,nk), M(1,5);
Index II(2,ni-1), JI(2,nj-1), KI(2,nk-1);
Index IM(1,ni-2), JM(1,nj-2), KM(1,nk-2);
Index IP(3,ni), JP(3,nj), KP(3,nk);
u(1,I,J,K) = 1;
doubleArray flx(5,ni,nj,nk);
flx(1,I,J,K) = met(1,I,J,K)*u(2,I,J,K) +
met(2,I,J,K)*u(3,I,J,K) +
met(3,I,J,K)*u(4,I,J,K);
doubleArray irh(1,ni,nj,nk);
irh = flx(1,I,J,K)/u(1,I,J,K);
flx(1,I,J,K) = half*flx(1,I,J,K);
flx(2,I,J,K) = half*( u(2,I,J,K)*irh + met(1,I,J,K)*p );
flx(3,I,J,K) = half*( u(3,I,J,K)*irh + met(2,I,J,K)*p);
67
flx(4,I,J,K) = half*( u(4,I,J,K)*irh + met(3,I,J,K)*p);
flx(5,I,J,K) = half*( u(5,I,J,K) + p )*irh;
res(M,II,J,K) += flx(M,II+1,J,K) - flx(M,II-1,J,K);
flx(1,I,J,K) = met(4,I,J,K)*u(2,I,J,K) +
met(5,I,J,K)*u(3,I,J,K) +
met(6,I,J,K)*u(4,I,J,K);
irh = flx(1,I,J,K)/u(1,I,J,K);
flx(1,I,J,K) = half*flx(1,I,J,K);
flx(2,I,J,K) = half*( u(2,I,J,K)*irh + met(4,I,J,K)*p );
flx(3,I,J,K) = half*( u(3,I,J,K)*irh + met(5,I,J,K)*p );
flx(4,I,J,K) = half*( u(4,I,J,K)*irh + met(6,I,J,K)*p );
flx(5,I,J,K) = half*( u(5,I,J,K) + p )*irh;
res(M,I,JI,K) += ( flx(M,I,JI+1,K) - flx(M,I,JI-1,K) );
flx(1,I,J,K) = met(7,I,J,K)*u(2,I,J,K) +
met(8,I,J,K)*u(3,I,J,K) +
met(9,I,J,K)*u(4,I,J,K);
irh = flx(1,I,J,K)/u(1,I,J,K);
flx(1,I,J,K) = half*flx(1,I,J,K);
flx(2,I,J,K) = half*( u(2,I,J,K)*irh + met(7,I,J,K)*p );
flx(3,I,J,K) = half*( u(3,I,J,K)*irh + met(8,I,J,K)*p );
flx(4,I,J,K) = half*( u(4,I,J,K)*irh + met(9,I,J,K)*p );
flx(5,I,J,K) = half*( u(5,I,J,K) + p )*irh;
res(M,I,J,KI) += ( flx(M,I,J,KI+1) - flx(M,I,J,KI-1) );
}
Code 4.11. A++ array class, using array notation.
68
5 Boundary Conditions
The necessity to impose boundary conditions greatly complicates the implementation of
a pde solver. In this chapter we will show how boundaries can be treated in a general
way.
We distinguish between physical and numerical boundary conditions. Take the exam-
ple
u
t
+ au
x
= 0 0 < t 0 < x < 1 a > 0
u(0, t) = 1 u(x, 0) = cos x
Approximate by a centered dierence formula in space, and leave the time discretization
unspecied. We then obtain
du
j
(t)
dt
+ aD
0
u
j
(t) = 0 0 < t j = 2, . . . , N 1.
The grid points are enumerated from 1 to N, and the approximation can only be dened for
the interior points j = 2, . . . , N1, since the D
0
operator has a stencil which is three points
wide. At the point x
1
= 0 we have a given boundary condition, u
1
(t) = 1, from the original
problem, a so called physical boundary condition. This entails no additional diculties.
However at the point x
N
= 1, the problem does not have any specied boundary data.
Here we need to introduce a new so called numerical boundary condition. One possibility
is to dene a one sided dierence operator at x
N
, i.e., using the approximation
du
j
(t)
dt
+ aD

u
j
(t) = 0 0 < t j = N
with D

u
j
= (u
j
u
j1
)/x. Higher order one sided operators can be used instead, for
example,
(u
j2
4u
j1
+ 3u
j
)/(2x) = u
x
(x
j
) +O(x
2
)
(u
j+2
+ 4u
j+1
3u
j
)/(2x) = u
x
(x
j
) +O(x
2
)
which give full second order accuracy.
Instead of using one sided operators, it is possible to dene the value at x
N
by extrap-
olation from some of the points x
N1
, x
N2
, . . . With this type of boundary conditions,
we often add extra points, so called ghost points. For example, in the example above
we introduce the new grid point x
N+1
. The dierence operator D
0
can now be applied
at the point x
N
, since the right neighbor exists. Dene the extra value by, e.g., linear
extrapolation
u
N+1
(t) = 2u
N
(t) u
N1
(t) (5.1)
For the linear example problem, this is equivalent to using the one sided operator D

. To
see this, substitute the value of u
N+1
from (5.1) into the formula
du
N
(t)
dt
+ a
u
N+1
u
N1
2x
= 0
We then get
du
N
(t)
dt
+ a
u
N
u
N1
x
= 0,
69
i.e., the D

operator approximation.
The number of numerical boundary conditions increases when the order of accuracy
increases. Consider the fourth order accurate approximation
du
j
(t)
dt
+ a
u
j+2
+ 8u
j+1
8u
j1
+ u
j2
12x
= 0. j = 3, 4, . . . , N 2
The dierence operator is ve points wide, and consequently needs two boundary condi-
tions for each boundary. One way would be to impose the physical boundary condition
u
1
(t) = 1 at x
1
, and then use special boundary operators at the points x
2
, x
N1
, and x
N
.
If ghost points and extrapolation are used we insert two extra grid points at each
boundary, x
1
and x
0
at the lower boundary, and x
N+1
and x
N+2
at the upper boundary.
We impose u
1
(t) = 1 directly and extrapolate the values u
1
and u
0
. For the other
boundary, we extrapolate the values u
N+2
and u
N+1
from the interior.
A numerical method which is stable by Fourier analysis for a periodic boundary con-
dition, is not necessarily stable when using other boundary conditions than periodic.
The analysis of stability for problems with boundary conditions is complicated, and in
many practical computations, stability is not investigated. The problems with stability
are greater for higher order of accuracy, when many numerical boundary conditions are
needed. In computations, loss of stability does often occur. Furthermore, in three space
dimensions, not only sides, but also corners and edges need special boundary conditions.
There is very little known about stability for corners.
5.1 When to impose boundary conditions
We denoted the spatially discretized pde by the residual, r(w) and wrote the problem as
dw
dt
= r(w)
where w is the matrix of unknowns grid point values. Once the method is discretized in
time by an explicit method, we take one time step to get w
n+1
from w
n
. If the r(w) does
not use boundary operators we obtain w
n+1
only in the interior of the domain, i.e., at the
points that are not on the boundary. The boundary values are then provided afterwards
by calling a function which imposes the boundary conditions. The boundary values are
thus imposed after each time step. If a Runge-Kutta method is used in time, boundary
conditions are imposed at each stage of the method.
If instead one sided operators are used near the boundary, these are usually incorpo-
rated into the r(w) function, so that numerical boundary conditions are not needed as
a separate function. Nevertheless, a function imposing boundary conditions has to be
called after each time step here, too, in order to give physical boundary conditions.
5.2 Standard method
Here we assume a three dimensional computation on one structured grid. It means that
the computed solution is an array with three indices, u[i][j][k], with limits 1 i ni,
1 j nj, and 1 k nk. The problem now is that imposing boundary conditions
70
on dierent sides leads to slightly dierent codes of the same type. For example extrap-
olation (of order zero) at the side i = 1 is given by u[i][j][k] = u[i+1][j][k], but
extrapolation at the side k = nk is given by u[i][j][k] = u[i][j][k-1]. At a rst
glance, it seems that we have to write six dierent functions for extrapolation boundary
conditions, one for each side. This would be tedious to write, and dicult to modify
when introducing changes in existing boundary conditions, or when adding new types of
boundary conditions. We will here show that it is possible to use only one function for
each boundary condition type. We introduce an enumeration for the sides, for example
Table 5.1, Enumeration of sides
side number boundary
0 i = 1
1 i = ni
2 j = 1
3 j = nj
4 k = 1
5 k = nk
The dierent types of boundary conditions are assigned integer codes, for example
Table 5.2, Boundary condition codes.
code boundary condition
1 Give u
2 Give u/n
3 Extrapolate u
4 Periodic boundary
5 Reecting boundary
An integer array bc can be used to describe the boundaries, where bc[side] contains
the boundary condition code for the side side,( side is an integer between 0 and 5 ).
Boundary conditions are imposed as follows
for( side=0;side< 6;side++)
{
if( bc[side] == 1 )
bcgive( ni, nj, nk, w, side, val[side] );
else if( bc[side] == 2 )
bcgiveder( ni, nj, nk, w, side, dval[side] );
else if( bc[side] == 3 )
bcextrapol( ni, nj, nk, w, side );
else if( bc[side] == 4 )
bcper( ni, nj, nk, w, side );
}
(This does not take into account special treatment of corners.) We have here written
one function for each type of boundary condition, each of which provides the boundary
71
information to the array w. Here val[6] are constant values imposed on the boundary, but
these can of course be made arrays, to describe given data which varies on the boundary.
The functions to impose boundary conditions can use the side variable to decide
direction and coordinate to extrapolate as shown in the example below
void bcextrapol( int ni, int nj, int nk, array w, int side )
{
int ib=1, ie=ni, jb=1, je=nj, kb=1, ke=nk;
int il=0; jl=0; kl=0;
if( side == 0 )
{
ie = 1; il = 1;
}
else if( side == 1 )
{
ib = ni; il = -1;
}
else if( side == 2 )
{
je = 1; jl = 1;
}
else if( side == 3 )
{
jb = nj; jl = -1;
}
else if( side == 4 )
{
ke = 1; kl = 1;
}
else if( side == 5 )
{
kb = nk; kl = -1;
}
for( int k=kb, k<=ke; k++ )
for( int j=jb, j<=je; j++ )
for( int i=ib, i<=ie; i++ )
// Assume three dimensional indexation possible
w[i,j,k] = 2*w[i+il,j+jl,k+kl]-w[i+2*il,j+2*jl,k+2*kl];
}
The loop is restricted to run over one of the six sides only. The vector il,jl,kl is the
normal direction, pointing into the domain. The domain, and the normal direction are rst
determined from the side number. The loop over the side to impose boundary condition
then works for any of the sides. The other types of boundary conditions (give,periodic,etc)
can be implemented similarly. With this technique, we avoid writing special boundary
condition code for each particular side.
The method can be made more general by dening boundary windows. For each
boundary condition, we declare a variable wind[6], which holds the limits of the part
of the boundary where it should be imposed. Another variable dir[3] holds the normal
direction. With these variables the extrapolation subroutine would be as follows
void bcextrapol( array w, int wind[6], int dir[3] )
{
for( int k=wind[4], k<=wind[5]; k++ )
for( int j=wind[2], j<=wind[3]; j++ )
for( int i=wind[0], i<=wind[1]; i++ )
72
// Assume three dimensional indexation possible
w[i,j,k] = 2*w[i+dir[0],j+dir[1],k+dir[2]]-w[i+2*dir[0],j+2*dir[1],k+2*dir[2]];
}
Now we can have any number of windows which together cover the boundary. If we
want to change type of boundary condition on the middle of a side, we can easily create
two dierent windows for the dierent parts of the side. It is also possible to select
corners and edges in this way. The code needed to impose boundary conditions would be
as follows
for( s=0; s<nw; s++ )
{
if( bc[s] == 1 )
bcgive( w, wind[s], dir[s], val[s] );
else if( bc[s] == 2 )
bcgiveder( w, wind[s], dir[s], dval[s] );
else if( bc[s] == 3 )
bcextrapol( w, wind[s], dir[s] );
else if( bc[s] == 4 )
bcper( w, wind[s], dir[s] );
}
The arrays int wind[nw][6] and int dir[nw][3] are set up during the problem
denition, the array int bc[nw] holds the boundary condition type codes as previously.
nw is the number of windows. For example, consider the domain in the gure below.
reflect u
give u
give u
give du/dn
give du/dn
i=1 i=20 i=50
j=1
j=30
Fig. 5.1. Example of boundary types.
There are 5030 grid points. On the lower side we want reecting boundary conditions
from i=1 to i=20, and given boundary data from i=21 to i=50. On the sides i=1 and i=50,
we give the normal derivative of the solution, and on j=30 we give boundary data. We
want the lower and upper corners to belong to the upper and lower boundaries respectively.
In that case the denition of the windows and boundary codes would be
wind[0][0] = 1; wind[0][1]=20; wind[0][2]=1; wind[0][3]=1; bc[0]=5;
wind[1][0] = 21; wind[1][1]=50; wind[1][2]=1; wind[1][3]=1; bc[1]=1;
wind[2][0] = 1; wind[2][1]=50; wind[2][2]=30; wind[2][3]=30; bc[2]=1;
wind[3][0] = 1; wind[3][1]=1; wind[3][2]=2; wind[3][3]=29; bc[3]=2
wind[4][0] = 50; wind[4][1]=50; wind[4][2]=2; wind[4][3]=29; bc[4]=2;
nw is 5. In this two dimensional example the window has 4 components instead of 6 in
three dimensions. Normal directions in dir[2] should also be dened.
73
5.3 Object oriented representation
The technique with integer codes for the sides and boundary windows is standard in many
CFD codes, and are often implemented in fortran. In C++ we have can use inheritance
to get a nice and clean representation of boundary conditions. Consider the base class
class bcbase
{
protected:
int wind[6];
int dir[3];
public:
// Other functions, constructor etc.
virtual void impose( array w )=0;
};
We then let each type of boundary condition be represented by a class derived from
bcbase. For example the general extrapolation formula
w
i
=
p

m=1
c
m
w
i+m
can be represented as follows.
class bcextrapol : public bcbase
{
double* cofs;
int ncof;
public:
// Constructor and other member functions here
void impose( array w );
};
void bcextrapol::impose( array w )
{
for( int k=wind[4], k<=wind[5]; k++ )
for( int j=wind[2], j<=wind[3]; j++ )
for( int i=wind[0], i<=wind[1]; i++ )
{
w[i,j,k]=0;
for( int m=1 ; m<ncof ; m++ )
w[i,j,k] += cofs[m]*w[i+m*dir[0],j+m*dir[1],k+m*dir[2]];
}
}
Other boundary conditions can be derived from bcbase similarly,
class bcgive : public bcbase
{
double value;
public:
// Constructor and other member functions here
void impose( array w );
74
};
void bcgive::impose( array w )
{
for( int k=wind[4], k<=wind[5]; k++ )
for( int j=wind[2], j<=wind[3]; j++ )
for( int i=wind[0], i<=wind[1]; i++ )
w[i,j,k]=value;
}
With these classes, we implement the boundary condition as an array of bcbase ob-
jects.
int main()
{
....
bcbase **bcs; // pointers to the boundary conditions
int nbc; // Number of boundary conditions
bcs = new bcbase*[nbc];
...
bcs[0] = new bcextrapol(2,-1,wind0);
bcs[1] = new bcgive(4.5,wind1);
...
Once the boundary objects are allocated, imposing boundary conditions on the array
w becomes one simple loop
for( int b=0 ; b<nbc ; b++ )
bcs[b]->impose(w);
75
6 Distribution of an array on a parallel computer
We consider the distributed memory model, in which the parallel computer is described by
a set of processors, each with its own local memory. The processors can send data to each
other through a network. This has to be done explicitly by the programmer through calls
to functions in a special communication library. Examples of communication libraries are
MPI (message passing interface ) and PVM ( parallel virtual machine ).
A one dimensional array is distributed onto the parallel machine as follows.
P
1
P
2
P
3
P
4
Distribution of a one dimensional domain into 4 processors.
Assume that there are p processors in the computer, and that the array has N elements.
Divide the domain into p pieces, one for each processor. Let s be the number of points
in one piece, and let a be the amount of overlap between neighbors. In the gure above,
p = 4, a = 2, s = 5, N = 14. The size of the overlap depends on the width of the
computational stencil, that we will use. For a three point explicit dierence method,
a = 2 is appropriate. The number of elements inside one processor is denoted by s, and
the array elements are locally enumerated from 1 to s.
We want the number of points in each processor to be equal, so that the work in parallel
becomes balanced, i.e., each processor performs the same amount of work. We assume
that the same operation is going to be made at all elements of the array, such as, e.g.,
the evaluation of a dierence formula. We have introduced a certain amount of overlap
between the processors. Points near the boundaries are replicated in the neighboring
processor. This is done to faciliate evaluation of dierence stencils, for example, letting
two points overlap, we can evaluate a three point stencil at any point of the interior of
the array.
Assume that we are given a dierence scheme
u
n+1
i
= G(u
n
iq
, . . . , u
n
i+q
), i = 1, 2 . . . , N
where n is the time iteration index, and i denotes grid points in one space dimension. On
a one-processor computer, we represent the solution at a certain time as an array double
u[N], and use the dierence scheme to advance u in time (at boundaries special formulas
might be required).
On a parallel computer, the function u
i
is represented by arrays of length s in each
processor, v[s].
To advance the solution one step in time by an explicit dierence scheme, the following
steps are taken. We assume that the with of the stencil is 2q +1, which makes it natural
to take the overlap a = 2q. (For example, a three point scheme has q = 1, which gives
a = 2 ).
76
1. Use the dierence method u
n+1
i
= G(u
n
iq
, . . . , u
n
i+q
) in the interior points (q + 1
i s q) in each processor. This step is done in parallel in all processors at the
same time.
2. send elements q + 1 i 2q to the left
3. receive elements s q + 1 i s from the right
4. send elements s 2q + 1 i s q to the right
5. receive elements 1 i q from the left
6. Impose boundary conditions at the ends of the domain, this step is only done in
processor 1 and processor p.
6.1 Useful formulas
We next give general formulas for distributing an array on a parallel computer in a bal-
anced way. The formulas are useful when all processors are of equal type, and the arith-
metic work to be done is equal in each point of the array. When the processors are not
identical so that performance characteristics dier between the nodes, for example in a
cluster of heterogeneous workstations, the distribution of the array should be done to take
this into account. Faster processors should hold a larger part of the array than slower
processors. Similar considerations must be done if the computation itself is not homoge-
neous, for example if some part of the domain is not used due to a cut out hole in an
overlapping grid conguration.
Assume that the processors are enumerated from 1 to p. We put s
k
points in processor
k. We let the local index in processor k, i
k
vary from 1 to s
k
. The width of the overlap
regions is a ( a = 2 in the gure above ). The global index, i, varies between and n.
The total number of points is
s
1
a + s
2
a + . . . + s
p1
a + s
p
= n + 1
Let N = n + 1, we then obtain
p

k=1
s
k
= N + (p 1)a
To distribute this as evenly as possible, we choose
s = [
N + (p 1)a
p
]
where [x] denote the integer part. The remaining points, if there are any, we distribute
evenly to the rst processors. Dene the remainder,
r = N + (p 1)a mod p
and take
s
k
=
_
s + 1 for k r
s for k > r
77
The index transformation t
k
, is an integer such that i
k
+ t
k
= i, with i
k
local processor
index, and i global domain index. The formula for t
k
is
t
k
= 1 + (k 1)(s a) + min(k 1, r) (6.1)
To derive this, we start from the identity
s
k1
+ t
k1
= a + t
k
which states that the last point in processor k 1 is identical to the point at local index
a in processor k, (see the gure above). We then iterate to k = 1 as follows
t
k
= a +s
k1
+t
k1
= 2a +s
k1
+s
k2
+t
k2
= . . . = (k 1)a +s
k1
+. . . +s
1
+t
1
Now t
1
= 1, and s
k
= s or s
k
= s + 1 so that we obtain
t
k
= (k 1)a + 1 + (k 1)s + min(k 1, r)
from which (6.1) follows.
If the number of space dimensions is greater than one, we rst divide the processors
into a processor grid, i.e., we split the number of processors p into factors p = p
1
p
2
for
two dimensions, or p = p
1
p
2
p
3
for three dimensions.
The one dimensional procedure above can then be used dimension by dimension. If
the array has n
1
n
2
n
3
elements, n
1
is distributed onto p
1
nodes, n
2
onto p
2
nodes,
and n
3
onto p
3
nodes. The processor identity (c
1
, c
2
, c
3
) of my processor in the processor
cube of p
1
p
2
p
3
processors is then required. We determine the processor coordinates
by inverting the mapping
pr = c
1
1 + p
1
(c
2
1) + p
1
p
2
(c
3
1). (6.2)
Here 1 c
i
p
i
, and 0 pr p 1. pr is my identity in the one dimensional processor
enumeration, which is usually given by the communication library, for example by the
function MPI Comm rank in MPI. c
1
is obtained by taking the remainder from dividing pr
by p
1
, and the quotient gives a new similar problem for c
2
. c
i
then plays the role of k
in the one dimensional formula above, and p
i
is p. Thus, we can compute the required
quantities s
k
, t
k
for each direction separately.
When computing with three dimensional dierence schemes, we use the same proce-
dure as described above for one dimension, communicating overlap boundaries to neigh-
boring processors. The communication is done dimension by dimension, that is, rst the
communication is done one dimensionally in the i-direction for all points on the i bound-
aries, one or several planes of varying j, k indices are sent and received. Second the same
thing is done in the j-direction, and then nally in the k-direction.
We know that the neighbors in the i-direction of processor with id (c
1
, c
2
, c
3
) is (c
1

1, c
2
, c
3
). However the send and receive operations in the communication library, often
requires the one-dimensional processor enumeration. We thus have to use the mapping
(6.2) here to determine the one dimensional id of neighbors.
For optimal performance, the factorization of the processors should be done such that
each processor holds a piece that is as close to a square or a cube as possible. This will
78
minimize the size of the communication boundary. In two dimensions, for a matrix of
n
1
n
2
points we would like to have
n
1
/p
1
= n
2
/p
2
This is of course not always possible to achieve with integers, we then try instead to
determine p
1
and p
2
such that the expression
m = |n
1
/p
1
n
2
/p
2
|
is minimized under the condition that p = p
1
p
2
. If the number of processors is moderate,
it is not unreasonable to explicitly evaluate m for all possible factorizations of p, and
returning the minimum. If p is a prime number, only the cases p
1
= p, p
2
= 1 and
p
1
= 1, p
2
= p are possible. Taking the number of processors, p, as a number with many
factors, greatly increases the possibility of distributing the array eciently.
6.2 Programming exercise: A parallel array class
In this exercise you will dene a two dimensional array class on a parallel computer with
distributed memory. The array should be distributed on the processors as evenly as
possible.
In the included example we show a simple array class for two dimensional arrays. First
study this example program, until you are sure that you understand it well. ( this is a
keyword in C++, giving a pointer to the object itself )
#include <iostream.h>
class arc
{
int m, n;
double *v;
public:
arc( int i, int j );
arc( const arc& a );
~arc();
void out();
inline double &operator()( int i, int j )
{
return v[i-1 + m*(j-1)];
}
};
arc::arc( int i, int j )
// Constructor, giving dimensions only
{
m = i; n = j;
v = new double[m*n];
}
arc::arc( const arc& a )
// Copy-constructor
{
m = a.m; n = a.n;
v = new double[m*n];
for( int i=0 ; i < m*n ; i++ )
v[i] = a.v[i];
}
void arc::out()
79
{
for( int j= 1 ; j<=n ; j++ )
for( int i= 1 ; i<=m ; i++ )
cout << "ar(" << i << "," << j << ") = " <<
(*this)(i,j) << endl;
}
arc::~arc()
{
delete[] v;
}
// Example main program
int main()
{
arc d1(4,5);
for( int j=1 ; j<=5 ; j++ )
for( int i=1 ; i<=4 ; i++ )
{
d1(i,j) = i*10 +j;
}
d1.out();
}
You can use the le starting-point.C as a starting point for your parallel array class.
In that le, the necessary variables are given, and the member functions are described by
comments. It will be your task to implement these functions. The le can be downloaded
from the course home page.
1. Add extra variable necessary to describe the distribution on the parallel machine. In
starting-point.C this is already done. You will have to write the code necessary
for initializing these variable with appropriate values. It is suggested that you do
this in the function arc::set up Use MPI to obtain information about number of
nodes and my processor identity. Use the formulas given in the subsection below
for some of the variables.
2. Write the functions for reading and writing the array to and from disk. You can
either send one patch at a time to a dedicated processor which handles all the I/O,
or you can let each processor write its own patch. In the latter case, it is necessary
to synchronize the processors to make sure that two processors does not access the
le at the same time. The I/O must be done such that you never collect the entire
global array in one node. Use the unix functions open, read and write as you did
in the grid generation exercise. You will need the function lseek to address the
local patches in the le ( to get information, do man lseek in a xterm window ).
3. Write a function, belonging to the array class, which communicates the overlap
regions of the array between nodes. Use MPI. This is done one-dimensionally, one
dimension at the time.
4. Read the grid you constructed in the rst exercise into array objects x and y.
Create another array object, representing the function u(x, y) = sin x
2
cos x + y on
the grid. Use the centered dierence operator D
0
to compute an approximation of
the derivative with respect to x of the function. Call the communication function,
80
and apply the operator a second time to compute the second derivative with respect
to x. Store the result in a new object. Write the result to a le, and use matlab to
visualize the result, and compare with the exact second derivative.
6.2.1 The parallel computer Kallsup2
Kallsup2 is a part of the IBM/SP2 machine at PDC. Kallsup2 is the name of three power3
boards, each with 16 cpus, totally 48 processors.
To log in to the machine, give the command
rxtelnet kallsup2
If rxtelnet is not dened, you have to add the module athena:
module add athena
On kallsup2, the command to compile C++ programs is
xlC file.C
When using a parallel program with MPI, the command
mpCC file.C
should be used instead. mpCC sets up paths to include les and libraries required for
MPI and uses xlC for compiling. To run in parallel interactively, you must rst issue the
command
spattach -i -p 2N
to run on two processors. Substitute 2 for another number of processors that you would
like to use. You can now run your parallel program by typing
a.out
To run with fewer processors than originally allocated
a.out -procs 1
When you have nished using the parallel program, you have to give back the processors
by doing
exit
This interactive processor allocation is not guaranteed to give physically dierent pro-
cessors. Some of your MPI processors might be executed as dierent processes on the
same processor. Furhtermore, there might be more than one user executing on the cpus
at the same time. This means that measuring performance of parallel programs is not
completely reliable
Another possibility is to run the program in batch mode. In batch mode, you send
the program to a queue, where it waits its turn to be executed. In batch mode, only your
own program executes on the processor. To run on one Kallsup node in batch mode (=
16 processors) for 10 minutes give the command
spsubmit -p 1K -t 10 batchscript
where batchscript is a shell script which runs your program. An example script is given
below.
81
#!/bin/csh
#
# Submit MPI job on SMP nodes. SP_PROCS given as argument to
# script is number of nodes.
#
# PROCS_PER_NODE <= 4 when using the old IMP SP switch.
# PROCS_PER_NODE <= 16 on Nighthawks at PDC with switch 2.
#
# M - nodes have 4 procs/node
# N - nodes have 8 procs/node
# K - nodes have 16 procs/node
#
#
#-------- Set number of processes and switch type
setenv PROCS_PER_NODE 16
@ totproc = $PROCS_PER_NODE * $SP_PROCS
setenv MP_PROCS $totproc
setenv MP_HOSTFILE $SP_HOSTFILE
setenv MP_EUILIB us
setenv MP_EUIDEVICE css0
setenv MP_PULSE 0
#-------- Create host file
setenv MP_HOSTFILE /tmp/${USER}_hostfile_${SP_JID}
if( -e $MP_HOSTFILE )then
setenv MP_HOSTFILE ${MP_HOSTFILE}_$$
endif
touch $MP_HOSTFILE
awk "{ for(i=$PROCS_PER_NODE;i>0;i--)" { print $0 } } \
$SP_HOSTFILE >> $MP_HOSTFILE
cd $SP_INITIALDIR
#--------- User commands
echo "Using $SP_PROCS nodes with $PROCS_PER_NODE processes per node"
echo "Started execution date"
poe a.out
echo "Done job date"
82
7 Parallel computing
P
M
P
M
P
M
Network
M
P
P P
Fig. 7.1. Computer with distributed memory. Fig. 7.2. Computer with shared memory.
In this section we will consider computers with more than one processor. These machines
can be either of the type with distributed memory, or of the type with shared memory.
A computer with distributed memory can be thought of as in Fig. 7.1. The processors,
P, have their own memory, M, and can talk to each other through a communication net-
work. A computer with shared memory is like the picture in Fig. 7.2, where all processors
are connected to the same memory through a network.
Computers with shared memory, often have fewer processors (typically up to 8 or 16),
than computers with distributed memory. The connection from processor to memory can
easily become congested when the number of processors is very large. On the other hand,
computers with shared memory is often easier to program. Many modern workstation
servers comes with a few processors, connected as a shared memory computer.
Computers with distributed memory scales better with number of processors, and there
exist today such computers which have more than a thousand processors. These computers
often come with a special communication network, which allows very high communication
speeds. However, it is also possible to use a number of connected workstations as a parallel
computer with distributed memory. The communication speed is then of course not as
high as in a specialised network.
Recently, the IBM/SP2 computer at PDC has been provided with so called SMP-
nodes. Each node is a part of the distributed memory machine, as in Fig. 7.1, but the
nodes consist of several processors on a shared memory. The situation is outlined in
Fig. 7.3.
83
P
M
P
M
Network
P
P P
M
Fig. 7.3. One SMP node in a computer with distributed memory.
The Figs. 7.13, are only models of computers. The reality is of course more compli-
cated. A computer with shared memory can have a software which emulates distributed
memory, and vice versa. There are smaller cache memories close to the processor, which
are not considered in the gures. Disks are connected to the networks.
Parallel execution of a program can be achieved in dierent ways. The simplest is to
have a compiler which automatically makes the program parallel. Another way of getting
parallel execution is to use calls to library functions in the program, to get full control over
the parallelism. Programming on a lower level (such as explicitly sending and receiving
data by function calls ) usually gives lower execution times, but more programming work.
The nal choice depends on the problem, the computer, and its software.
In the distributed memory model, the parallel computer is described by a set of proces-
sors, each with its own local memory. The processors can send data to each other through
a network. This has to be done explicitly by the programmer through calls to functions in
a special communication library. Examples of communication libraries are MPI (message
passing interface ) and PVM ( parallel virtual machine ). We give an example of a MPI
program in Code 7.1. The average v
i
= (u
i+1
+2u
i
+u
i1
)/4 is computed. The arrays are
rst distributed onto the computer, (with overlap two ) as described in Chapter 6. This
is assumed to have been done before the function in Code 7.1 is called.
//
// Average. It is assumed that the array u has been distributed
// onto the computer
//
int average( int n, double *u, double* v )
{
for( int i=1 ; i<n-1 ;i++ )
v[i] = (u[i+1]+2*u[i]+u[i-1])/4;
int myid, np;
MPI_Comm_size( MPI_COMM_WORLD, &np );
MPI_Comm_rank( MPI_COMM_WORLD, &myid );
int tag1=99, tag2=98;
MPI_Status status;
if( myid != 0 )
MPI_Send(&v[1], 1, MPI_DOUBLE, myid-1, tag1, MPI_COMM_WORLD );
if( myid != np )
{
84
MPI_Recv(&v[n-1], 1, MPI_DOUBLE, myid+1, tag1, MPI_COMM_WORLD, &status );
MPI_Send(&v[n-2], 1, MPI_DOUBLE, myid+1, tag2, MPI_COMM_WORLD );
}
if( myid != 0 )
MPI_Recv(&v[0], 1, MPI_DOUBLE, myid-1, tag2, MPI_COMM_WORLD, &status );
}
Code 7.1. Parallel average computation by MPI.
For shared memory, parallelization is often done by the compiler on the loop level,
by inserting directives to the compiler into the code. Dierent vendors have dierent
standards for directives. Recently a common standard for such compiler directives, called
OpenMP, has emerged. In Code 7.2 we show an example of parallelization by OpenMP.
Here the execution is serial until the loop is found. To execute the loop, the compiler
divides ng by the number of processors to get a local loop length, n. Each processor then
executes a loop of this local length, i.e., processor 0 iterates from 1 to n, processor 1 from
n + 1 to 2n, etc. The loop iteration variable i is copied so that each processor gets its
own local copy of i. More advanced constructs are possible, variables can be declared
local, or to be kept in shared memory, etc..
//
// Average. The entire array is in shared memory.
//
int ave( int ng, double *u, double *v )
{
#pragma omp parallel for
for( int i=1 ; i<ng-1 ; i++ )
v[i] = (u[i+1]+2*u[i]+u[i-1])/4;
}
Code 7.2. Parallel average computation by OpenMP.
The directive to the compiler begins with #pragma omp. The distribution of the array,
is done automatically by the compiler.
7.1 Performance models of parallel programs
The purpose of parallel computing is to reduce the execution time. Before implementing a
parallel algorithm, it is advantageous to rst develop a model for the execution time. The
execution time consists of arithmetic time, and communication time. For the arithmetic
time, we introduce the time it takes to do one arithmetic operation, T
a
.
The time to send a message of n bytes between two processors is often modelled as
T
c
(n) = + n.
Here is the start-up time to initialize the communication, and is the time to send
one byte of data. In practise, >> , so that the start up time is signicant. Avoid
to send the data one byte at the time, it is better to send in larger chunks. The above
is, of course, a very simplied model of the reality. The model gives, however, a good
approximation to the real execution time, and it is often used for comparison between
85
dierent algorithms. It furthermore gives insight into which parts of the problem that are
critical for performance, and how these bottlenecks depend on the computer characteristic
parameters.
For the computers at PDC, I have estimated and by running a program which
passes a message around the computer, with the result shown in Table 7.1.
Table 7.1, Parameters in communication time model.
Computer
IBM/SP2 43 s 0.01 s/byte
Cray J932 10 ms 0.10 s/byte
Pile of PCs 10 ms 0.56 s/byte
The nal result of the modelling is often an approximate expression for the total
execution time of the program,
T(N, P)
where N is the problem size (for example number of grid points), and P is the number
of processors used. The behavior for large N is often interesting, and is described by the
ordo notation. For example, T(N) = O(N
3
) means that T(N) increases cubically with
N when N becomes large. More precisely expressed as there exists a N
0
and a constant
c such that T(N) cN
3
for all N > N
0
.
The speed-up is dened as
S(N, P) = T(N, 1)/T(N, P)
the ratio of time to run the problem on one processor, to the time on P processors.
Note that it is here question of actual execution times. Some authors use instead as one-
processor-time ( T(N, 1) ), the time for the best possible implementation, not necessarily
the same program as used for the P processor computation. Perfect speed-up is obtained
when S(N, P) = P, i.e., running on P processors give P times lower execution time. The
computations where N P has ne grained parallelism, and when N >> P, we say that
the parallelism is coarse grained.
Note that the speed-up depends on two quantities, the problem size, and the number
of processors. To present a meaningful picture of the execution time, both parameters
should vary. For example we could present the measured execution time as function of P
for a few xed values of N. The parallel properties of a program is most easily seen when
P varies, and N is xed.
In some texts, the eciency, E, is used instead of the speed-up. The eciency is
dened as
E(N, P) = S(N, P)/P.
Let us next consider an example. We are given a PDE problem in two space dimen-
sions. The number of grid points is n
1
n
2
. We step forward in time (from t
n
to t
n+1
)
by the dierence scheme
u
n+1
ij
= S(u
n
i+1,j
, u
n
i1,j
, u
n
i,j+1
, u
n
i,j1
) 1 i n
1
, 1 j n
2
86
The notation is as usual. u
n
i,j
denotes the numerical approximation of the function at the
time t
n
at the grid point (i, j), or u(t
n
, x
i,j
, y
i,j
). Assume now that the computational
domain has been split up into p processors. There are p
1
processors in the i-direction,
and p
2
processors in the j-direction. P = p
1
p
2
. The number of grid points in each
processor is then m
1
m
2
, where m
1
= n
1
/p
1
and m
2
= n
2
/p
2
. We assume that p
1
and
p
2
perfectly divides n
1
and n
2
respectively. To advance one time step we
1. Take one step with S in each processor.
2. communicate overlap region to the right in the i-direction.
3. communicate overlap region to the left in the i-direction.
4. communicate overlap region to the right in the j-direction.
5. communicate overlap region to the left in the j-direction.
This is exactly the same as was given in one space dimension in Chapter 6, see the
discussion there for details. We now model the time it takes to do one time step. Assume
that C arithmetic operations are needed to evaluate the dierence scheme S. If the
program was run on only one processor, there would be no communication, and the
execution time would be T(N, 1) = n
1
n
2
CT
a
= NCT
a
. On P = p
1
p
2
processors, we
obtain that
T(N, P) = m
1
m
2
CT
a
+ 2T
c
(m
1
a) + 2T
c
(m
2
a)
where a is the size of the overlap. Using T
c
(n) = + n, we get
T(N, P) = NCT
a
/P + 4 + 2a(n
1
/p
1
+ n
2
/p
2
) (7.1)
The speed-up becomes
S(N, P) =
NCTa
NCTa/P+4+2a(n
1
/p
1
+n
2
/p
2
)

P
(1+4P/(TaNC)+2a/(CTa)(p
2
/n
2
+p
1
/n
1
))
From this we can see that the speed-up is close to perfect if
P
CT
a
N
<< 1 and

CT
a
m
i
<< 1, i = 1, 2
The second condition is usually true, C and m
1
or m
2
are usually much larger than one,
and on many computers T
a
is of the same order of magnitude as . The rst condition
is always true if N is taken suciently large, but on most computers >> T
a
, so it is
possible that N must be fairly large.
From the expression (7.1), we can compute the best possible factorization of processors,
P = p
1
p
2
. The rst two terms of (7.1) do not depend on the factorization. We write the
third term
2a(n
1
/p
1
+ n
2
p
1
/P)
and minimize this with respect to p
1
. The p
1
derivative is
2a(n
1
/p
2
1
+ n
2
/P)
87
which is zero for p
2
1
= Pn
1
/n
2
. It is easy to see that this is a minimum. We should thus
take p
2
1
= p
1
p
2
n
1
/n
2
, i.e., such that
n
1
/p
1
= n
2
/p
2
the local number of grid points in each processor should be the same in each direction.
The typical behavior of the execution time as function of P is shown in Fig. 7.4. We
note that the problem is perfectly parallel in the sense that the time always goes down
when P is increased. However, when P is large, increasing P does not give a very large
improvement.
0 10 20 30 40 50 60 70
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Fig. 7.4. Execution time vs number of processors.
88

You might also like