Distributed Performance Improvement of Alternating Iterative Methods Running On Master-Worker Paradigm With MPI

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

Volume 8, Issue 2, March - April 2019 ISSN 2278-6856
Distributed Performance Improvement of

Alternating Iterative Methods Running on
Master-Worker Paradigm with MPI
Ewedafe Simon Uzezi1, Rio Hirowati Shariffudin2
1
Department of Computing, Faculty of Science and Technology,
The University of the West Indies, Mona Kingston 7, Jamaica
2
Institute of Mathematical Sciences, Faculty of Science,
University of Malaya, 50603 Kuala Lumpur, Malaysia.
Abstract: In this paper we present methods that can improve developed MPI [13] is chosen here since it has a large user
numerical parallel implementation using a scalability-related group.
measure called input file affinity measure on 2-D Telegraph An alternative and cost effective means of achieving a
Equation running on a master-worker paradigm. The comparable performance is by way of distributed
implementation was carried out using Message Passing computing, using a system of processors loosely connected
Interface (MPI). The Telegraph Equation was discretized using
through a Local Area Network (LAN). Relevant data need
Alternating Direction Implicit (ADI), Iterative Alternating
Direction Explicit Method by D’Yakonov (IADE-DY) and to be passed from processor to processor through a message
Mitchell-Fairweather Double Sweep Method (MF-DS). Parallel passing mechanism [12, 18, 27]. The telegraph equation is
performance compare of these methods using the Input File important for modeling several relevant problems such as
Affinity Measure (Iaff) were experimentally evaluated and wave propagation [24], and others. In this paper, a number
numerical results show their effectiveness and parallel of iterative methods are developed to solve the telegraph
performance. We show in this paper that, the input file affinity equations [8]. Some of these iterative schemes are
measure has scalability performance with less information to employed in various parallel platforms [15, 16, 30]. Parallel
schedule tasks. algorithms have been implemented for the finite difference
Keywords: Parallel Performance, Input File Affinity, method [9, 10, 8], the discrete eigen functions method used
Master-Worker, 2-D Telegraph Equation, MPI. in [1] and [27] used the AGE method on 1-D telegraph
problem, but not implemented on parallel platform for
1. INTRODUCTION parallel improvement. We present in this paper algorithms
Computing infrastructures are reaching an unprecedented which have scalability performance comparable to other
degree of complexity. First, parallel processing is coming algorithms in several circumstances. We describe the
to mainstream, because of the frequency and power design of our parallel system through Iaff for understanding
consumption wall that leads to the design of multi-core parallel execution. Through several implementations and
processors. Second, there is a wide adoption of distributed performance improvement analysis of the alternating
processing technologies because of the deployment of the iterative methods, we explore how to use the Iaff on master-
Internet and consequently large-scale grid and cloud slave paradigm for identifying parallel performance on 2-D
infrastructures. In master-worker paradigm a master node is telegraph equation. We assume the amount of work is
responsible for scheduling computation among the worker increased exclusively by increasing the number of tasks.
and collecting the results [5]. The master-worker paradigm The reason for this is that it is relatively easy to increase the
has fundamental limitations: both communications from the range of parameters to be analyzed in an application. On
master and contention in accessing file repositories may the order hand, in order to increase the amount of work
become bottlenecks to the overall scheduling scheme, related to each individual task, the input data for that task
causing scalability problems. Parallelization of Partial should be changed, which may change the nature of the
Differential Equations (PDEs) by time decomposition was problem to be processed by those tasks and it is not always
first proposed by [22] following earlier efforts at space- feasible.
time methods [15, 16, 30]. The application of the It is worth noting that some of the related works mentioned
alternating iterative methods to solve problem of 2-D focus on the problem of efficient scheduling of the
telegraph equations have shown that they need high decomposition to execute on distributed architectures
computational power and communications. In the world of organized either as pure master-slave or hierarchical
parallel computing MPI is the de facto standard for platforms, while lacking scalability analysis. We consider
implementing programs on multiprocessors. To help with that the amount of computation associated with each task is
program development under a distributed computing fixed, each task depends on one or more input files for
environment a number of software tools have been execution, and input file can be shared among tasks [14]. In
Volume 8, Issue 2, March - April 2019 Page 17

this work we assume that both the size of the input files, the 2. TELEGRAPH EQUATION
dependencies among input files and tasks are known. We
We consider the second order telegraph equation as given
also consider that the master node has access to a storage
in [8]:
system which serves as the repository of all input and
output files. This paper is organized as follows: Section 2
discusses previous research work. Section 2 presents the 2v v 2v 2v
model for the 2-D telegraph equation and introduces the a b   0x1, 0 y1, t 0 (2.1)
IADE-DY and DS-MF scheme. Section 3 describes the t2 t x2 y2 
input file affinity measure. Section 4 describes the parallel
implementation of the algorithms. Section 5 presents the extending the finite difference scheme on the telegraph
numerical and experimental result of the schemes under equation of (3.1) becomes:
consideration. Finally, section 6 presents our conclusions.
vin, j 1  2vin, j  vin, j 1 vin, j 1  vin, j 1
1.1 Previous research work a 
Parallel performance of algorithms has been studied in (t )2 2t (2.2)
several papers during the last years [7]. Bag of Task (BoT)  vin1,1 j  2vin, j 1  vin1,1 j vin, j 11  2vin, j 1  vin, j 11 
applications composed of independent tasks with file b  0
sharing have appeared in papers [8, 14, 19]. Their relevance  (x)2 (y)2 
has motivated the development of specialized environments
which aim to facilitate the execution of large BoT Although this simple implicit scheme is unconditionally
applications on computational grids and clusters, such as stable, we need to solve a penta-diagonal system of
the AppLes Parameter-Sweep Template (APST) [4]. algebraic equations at each time step. Therefore, the
Giersch et al., [14] proved theoretical limits for the computational time is huge.
computational complexity associated to the scheduling
problem. In [19], they proposed an iterative scheduling 2.1 The 2-D ADI Method
approach that produces effective and efficient schedules,
compared to the previous works in [5, 14]. However, for In this section, we derive the 2-D ADI scheme of the
homogeneous platforms, the algorithms proposed in [19] simple implicit FDTD method by using a general ADI
can be considerably simplified, in a way that it becomes procedure extended to (2.1). Equation (2.4) can be rewritten
equivalent to algorithms proposed previously. Fabricio [11] as:
analyzes the scalability of BoT applications running on
master-slave platforms and proposed a scalability related
measure. Eric [2] presents a detailed study on the 2
 
scheduling of tasks in the parareal algorithm. It proposes  I   A m  v ni, j 1  2C o v i,n j  C1 v i,n j 1  0 (2.3)
two algorithms, one which uses a manager-worker  m1 
paradigm with overlap of sequential and parallel phases, sub-iteration 1 is given by:
and a second that is completely distributed. Hinde et al.
[17], proposed a generic approach to embed the master-
worker paradigm into software component models and ρ x vin1,j1(1)  (1  2ρ x )v i,jn 1(1)  ρ x v ni 1,j1(1) 
describes how this generic approach can be implemented (2.4)
within an existing software component model. On the
(A 2 )v i,jn 1(*)  (2C o vi,jn  C1v i,jn 1 )  i,j
numerical computing part, [29] went in search of numerical
consistency in parallel programming and presented For various values of i and j , (2.4) can be written in a more
methods that can drastically improve numerical consistency compact matrix form at the (k + ½ ) time level as:
for parallel calculations across varying numbers of
processors. In [9, 10], parallel implementation of 2-D Av (kj1/2)  f k , j  1,2,  n. (2.5)
telegraph equation using the SPMD technique and domain
decomposition strategy have been researched and they sub-iteration 2 is given by:
presented a model that enhances overlap communication
with computation to avoid unnecessary synchronization, ρ y vi,jn 1(2) n 1(2)
1  (1  2ρ y )v i,j  ρ y vi,jn 1(2)
1 
thus yield significant speed up. On the parallel computing (2.6)
front, [28], has proposed a parallel ADI solver for linear vi,jn 1(1)  A 2 vi,jn 1(*) .  i,j
array of processors. The ADI method in [23, 26] have been
used for solving heat equations in 2D. Approach to solve
For various values of i and j , (2.6) can be written in a more
the telegraphic equations using different numerical schemes
compact matrix form as:
has been treated in [8].
Bv (ki 1)  g k 1/2 , i  1,2,  m (2.7)

n 1
this is a prediction of vi , j by the extrapolation method. involving at each iterates the solution of an explicit
equation.
Splitting by using an ADI procedure, we get a set of
recursion relations as follows:
2.3 DS-MF
( I  A1 )vin, j 1(1)  ( A2 )vin, j 1(*)  (2Covin, j  C1vin, j 1 ) (2.8) The numerical representative of Eq. (2.8) and (2.9) using
the Mitchell and Fairweather scheme is as follows:
( I  A2 )vin, j 1(2)  vin, j 1(1)  A2 vin, j 1(*) (2.9)

 1 1   n 1(1)
where v
n 1(1)
is the intermediate solution and the desired 1  2  ρ x  6  A1  vi,j 
i, j    
n 1 (2.2)
solution is v i, j  vin, j 1(2) . Finally, expanding A1 and A2 on  1 1   n 1(*)
the left side of (2.8) and (2.9), we get the 2D ADI
n
 n 1
1  2  ρ y  6  A 2  vi,j  2Co vi,j  C1vi,j
  

algorithm. 
2.2 IADE-DY  1 1   n 1(2)

 1   ρ y   A 2  vi,j 
The matrices derived from the discretization resulting to A  2 6 
(2.13)
(2.5) and B (2.7) is respectively tridiagonal of size (mxm)  1 1   n 1(1) n 1(*)
and (nxn). Hence, at each of the (k + ½) and (k + 1) time  1   ρ x   A1  vi,j  A 2 vi,j
levels, these matrices can be decomposed into  2  6  
1
G1  G2  G1G2 , where G1 and G2 are lower and The horizontal sweep (2.12) and the vertical sweep (2.13)
6
formulas can be manipulated and written in a compact
upper bidiagonal matrices. By carrying out the relevant
matrix.
multiplications, the following equations for computation at
each of the intermediate levels are obtained:
3 THE IAFF
(i) at the ( p  1/ 2) iterate,
th With reference to [11], we introduce the Iaff. First, we
describe a simplified execution model that classifies several
issues related with the execution of the telegraphic equation
1  ^
on the master-worker paradigm. Typically, each task goes
u1(p1/2)   (s1 su1(p) wsu ( p)
1 2 hf1)
through three phrases during execution of a parameter-
d sweep application: (1) an initialization phase, where the
1 ^ ^ necessary files are sent from the master to the slave node
ui(p1/2)  ^ (li1ui(p11/2) vi1si1ui(p1) (vi1wi1 si s)ui(p) wsu ( p)
i i1 hfi ), and the task is started. The duration of this phase is equal to
d tinit. Note that this phase includes the overhead incurred by
i 2,3,,m1 the master to initiate a data transfer to a slave, (2) a
computational phase, where the task processes the
1 ^
parameter file at the slave node and produces an output file.
um(p1/2)  ^ (lm1um(p11/2) vm1sm1um(p)1 (vm1wm1 sm s)um(p) hfm)
The duration of this phase is equal to tcomp. Any additional
d overhead related to the reception of input files by a worker
(2.10) node is also included in this phase and (3) a completion
th
(ii) at the ( p  1) iterate, phase, where the output file is sent back to the master and
the master task is completed. The duration of this phase is
equal to tend. This phase may require some processing at the
um( p 1/ 2) master, mainly related to writing out files to the repository.
um( p 1)  ,
dm Since this writing may be deferred until the disk is
available, we assume that this processing time is negligible.
1 ( p 1/ 2) ^ ( p 1) Therefore, the initialization phase of one slave can occur
ui( p 1)  (ui  ui ui 1 ), (2.11)
di concurrently with the completion phase of another slave
node. Given these three phases, the total execution time of
where di  r  ei , i  m  1, m  2, , 2,1 a task is equal to
The two-stage iterative procedure in the IADE-DY t total  t init  t comp  t end (3.1)
algorithm corresponds to sweeping through the mesh

as the machine model, we consider a cluster composed of P the second term in the right hand side of Eq. (3.5) shows
+ 1 processors. For the rest of this paper we assume that T the time needed to start the first (P – 1) tasks in the other P
» P. One processor is the master and the other processors – 1 processors. If we have a platform where the number of
are workers. Communication among master and workers is processors is larger than Peff the overall makespan is
carried out through Ethernet and master can only send files
through the network to a single worker at a given time. We dominated by communication times between the master and
assume the communication link is full-duplex, i.e., the the workers. We then have:
master can receive an output file from a worker at the same
time it sends an input file to another worker. We also t makespan   MPt init  t 'comp (3.6)
assume that communication computation begins as soon as
the input files are completely received. This assumption is the set of Eqs. (3.2) – (3.6) will be considered in
coherent to the master-worker paradigm that is currently subsequent sections of this paper. It is worth noting that Eq.
available in [14]. (3.5) is valid when workers are constantly busy, either
performing computation or
For the sake of simplicity and without loss of generality, we
consider in this section that there is no contention related to communication. Eq. (3.6) is applicable when workers have
the transmission of output files from worker nodes to the idle periods, i.e., are not performing either computation or
master. Indeed, it is possible to merge the computation communication. Eq. (3.2) occurs mainly in two cases:
phase with the completion phase without affecting the
results of this section. Therefore, we merge both phases as (1) For very large platforms (P large).
t 'comp in the equations that follow. A worker is idle when it t comp
is not involved with the execution of any of the three (2) For applications with small ratio, such as
phases of a task. For the results below, the task model is
t init
composed of T heterogeneous tasks. All tasks and files fine-grain applications.
have the same size and each task depends upon a single In order to measure the degree of affinity of a set of tasks
non-shared file. Note that the problem of scheduling an concerning their input files, we introduce the concept of
application where each task depends upon a single non- input file affinity. Given a set of G of tasks, composed of K
shared file, all tasks and files have the same size and the tasks, G  T1 , T2 , , Tk , and the set F of the Y input
master-worker paradigm is heterogeneous, has polynomial files needed by the tasks belonging to group G,
complexity [14]. These assumptions are considered in the
analysis that follows. We define the effective number of
 
F  f1 , f 2 ,  , f y , we define I aff as follows:
processors Peff as the maximum number of workers needed
y
to run an application with no idle periods on any worker
I aff (G) 
 (N  1) f
i 1 i
(3.7)
processor. Taking into account the task and platform y
models described in this section, a processor may have idle  Nf i 1 i i
periods if:
where f i is the size in bytes of file f i and
t 'comp  (P  1)t init (3.2)
N i (1  N i  K) is the number of tasks in group G which
Peff is then given by the following equation: have file f i as an input file. The term (N i  1) in the
numerator of the above equation can be explained as
follows: if N i tasks share an input file f i , that file may be
 t 'comp 
Peff   1 (3.3) sent only once (instead of N i times) when the group of
 t init  tasks is executed on a worker node. The potential reduction
the total number of tasks to be executed on a processor is at of the number of bytes transferred from a master node to a
most worker node considering only input file f i is then
(N i  1) f i . Therefore, the input file affinity indicates the
T 
M  (3.4) overall reduction of the amount of data that needs to be
P transferred to a worker node, when all tasks of a group are
sent to one node. Note that 0  I aff  1 . For the special
for a platform with Peff processors, the upper bound for the
k 1
total execution time (makespan) will be case where all tasks share the same input file I aff  ,
k
t makespan   M(t init  t 'comp )  (P  1)t init (3.5)
where k is the number of tasks of a group.

4 PARALLEL IMPLEMENTATION provides an easy-to-use interface more appropriate for

distributing parallel computing applications. We focus our
The implementation is done on Geranium Cadcam Cluster evaluation on [13] because MPI serves as an important
consisting of 48 Intel Pentium at 1.73GHZ and 0.99GB foundation for a large group of applications. Conversely,
RAM. Communication is through a fast Ethernet of 100 MPI provides a wide variety of communication operations
MBits per seconds connected and running Linux. The including both blocking and non-blocking sends and
cluster performance has high memory bandwidth with a receives and collective operations such as broadcast and
message passing supported by MPI [13]. The program global reductions. We concentrate on basic message
written in C provided access to MPI through calling MPI operations: blocking send, blocking receives, non-blocking
library routines. At each time-step we have to evaluate send, and non-blocking receive. Note that MPI provides a
rather comprehensive set of messaging operations. A
v n 1 values at ' lm' grid points, where ' l' is the number of blocking receives (MPI_Recv) returns when a message that
grid points along x-axis. Suppose we are implementing this matches its specification has been copied to the buffer.
method on R  S mesh connected computer. Denote the However, an alternative to blocking communication
workers by operations MPI provides non-blocking communication to
Pi1, j1 : i1  1, 2, , R and R  l, j1  1, 2,  , S and S  M allow an application to overlap communication and
computation. This overlap improves application
. The workers Pi1j1 , are connected as shown in Fig.4. Let performance. In non-blocking communication, initialization
and completion of communication operations are distinct.
1 M
L1    and M1    where 
is the smallest A non-blocking send has both a send start call (MPI_Isend)
R  S initializes the send operation and it return before the
integer part. Divide the ' lm' grid points into ' RS' groups message is copied from the send buffer. The send complete
call (MPI_Wait) completes the non-blocking send by
so that each group contains at most (L1  1)(M1  1) grid verifying that the data has been copied from the send
points and at least L1 M 1 grid points. Denote these groups buffer. It is this separation of send start and send complete
by that provides the application with the opportunity to
perform computations.
G i1j1 : i1  1,2, , R, j1  1,2, , S . Design G i1j1 , 4.2 Speedup, Efficiency and Effectiveness
such that it contains the following grid points
Speedup and efficiency give a measure of the improvement
(X(i11)1 , Y(j11) j ): i  1, 2, , L1 or Li  1  of performance experienced by an application when
G i1j1   
 j  1, 2, , M1 or M1  1 executed on a parallel system [7]. In traditional parallel
systems it is widely define as:
Assign the group G i1j1 , to the workers

S(N)  T(s) , E(N)  S(N) (4.1)
Pi1, j1 : i1  1, 2,  , R, For, j1  1, 2,  , S. Each T(N) N
n 1
worker computes its assigned group v i, j values in the The total efficiency is usually decomposed into the
th following equations.
required number of sweeps. At the (p  1/2) sweep the
(p 1/2)th
workers compute v i, j values of its assigned groups. E ( N )  E num ( N ) E par ( N ) E load ( N ), (4.2)
th
For the (p  1/2) level the worker Pi1j1 requires one
Epar is the parallel efficiency which is defined as the ratio of
value from the worker Pi11j1 or Pi11j1 , worker. In the CPU time taken on one worker to that on N workers. The
parallel efficiency and the corresponding speedup are
(p  1/2) th level the communication between the workers commonly written as follow
is done row-wise. After communication between the
workers is completed then each worker Pij computes the
S par(N)
v i,pj 1/2 values. For the (p  1) th sweep each worker Pi1j1 S par(N)  T(1) , Epar(N)  (4.3)
T(N) N
requires one value from the Pi11j1 or Pi11j1 worker.
The CPU time for the parallel computations with N workers
4.1 MPI Communication Service Design can be written as follows:
MPI like most other network-oriented middleware services

communicates data from one worker to another across a
network. However, MPI’s higher level of abstraction

T ( N )  Tm ( N )  Tsd ( N )  Tsc ( N ) (4.4) Table 1 The wall time TW, the master time TM, the worker
data time TSD, the worker computational time TSC, the total
generally, time T, the parallel speed-up Spar and the efficiency Epar for
Tm(N) Tm(1), Tsd(N) Tsd(1), Tsc(N) Tsc(1) , (4.5) a mesh of 300x300, with B = 50 blocks and Niter = 100 for
N PVM and MPI.
therefore, the speedup can be written as:
T(1) T (1) Tsc(1) Tser(1) Tsc(1)

Spar(N)   ser  (4.6)
T(N) Tser(1) Tsc(1)/ N Tser(1)
The corresponding efficiency is given by:
T (1) Tser (1)  Tsc(1)

Epar ( N )  
nT ( N ) nTser (1)  Tsc(1)
(4.7)
Tser (1)  Tsc (1)

nTser (1)
The total efficiency can be decomposed as follows:
TsN o (1) TsN o (1) TBNo1 (1)

E(N )  
n.TBN1BL ( N ) TBNo1 (1) TBNoB (1)
(4.8)
TBN0B (1) TBN1B (1) TBN1B ( N )
,
TBN1B (1) TBN1B ( N ) TBN1BL ( N )
N NL
where TB1B ( n) has the same meaning as TB1B ( n) except
the idle time is not included. Comparing (6.5) and (6.2), we
obtain:
TBN1B ( N ) TBN1B (1)
Eload ( N )  , E par (n)  ,
TBN1BL n.TBN1B ( N )
(4.9)
T No (1) TsN o (1) TBNo1 (1) TBNoB (1)
Enum( N )  sN1  ,
TB B (1) TBNo1 (1) TBNoB (1) TBN1B (1) 5 Results and Discussion
5.1 Benchmark Problem

Therefore,
No TBNo1 (1) The application of the above mentioned algorithms were
E num ( N )  E dd , E dd  (4.10) compared in terms of performance by simulating their
N1 TBNoB (1) executions in master-worker paradigm on several sizes. We
we call (4.10) domain decomposition efficiency (DD), assume a platform composed of variable number of
which includes the increase of CPU time induced by grid heterogeneous processors. The solution domain was
overlap at interfaces and the CPU time variation generated divided into rectangular blocks. The experiment is
by DD techniques. Effectiveness is given by: demonstrated on meshes of 200x200 and 300x300 for block
sizes of 100 and 200 respectively, for MPI. Tables 1 - 5
Ln  S n C n (4.11) show the various performance timing.
Table 2 The wall time TW, the master time TM, the worker
where C n  nTn , T1 is the execution time on a serial data time TSD, the worker computational time TSC, the total
machine and Tn is the computing time on parallel machine time T, the parallel speed-up Spar and the efficiency Epar for
a mesh of 300x300, with B = 100 blocks and Niter = 100
with N processors. Hence, effectiveness can be written as: MPI.
Ln  S n (nTn )  E n Tn  E n S n T1 (4.12)

increases with increasing grid size for given block number

for MPI and decreases with the increasing block number for
given grid size. Given other parameters the speed-up
increases with the number of processors. When the number
of processors is small the wall time decreases with the
number of processors. As the number of processors become
large the wall time increases with the number of processors
as observed from the figures and table. Data
communication at the end of every iteration is necessary in
this strategy. Indeed, the updated values of the solution
variables on full domain are multicast to all workers after
each iteration since a worker can be assigned a different
sub-domain under the pool-of-task paradigm. In Tables 1 –
3, the master time TM is constant when the number of
processors increases for a given grid size and number of
sub-domains. The master program is responsible for (1)
sending updated variables to worker (T1), (2) assigning task
to worker (T2), (3) waiting for the worker to execute tasks
(T3), (4) receiving the results (T4).
Table 3 The wall time TW, the master time TM, the worker
data time TSD, the worker computational time TSC, the total
time T, the parallel speed-up Spar and the efficiency Epar for
a mesh of 300x300, with B = 200 blocks and Niter = 100
for MPI.
Consider the telegraph equation of the form:
 2U  2U  2U U
 2  2  U (5.1)
x 2 y t t
5.2 Parallel Efficiency
To obtain a high efficiency, the worker computational time

Tsc (1) should be significantly larger than the serial time
Tser . The speed-up and efficiency obtained for various
sizes of 200x200 to 300x300 are for various numbers of
sub-domains; from B = 50 to 200 are listed in Tables 1 – 3
for MPI application. In these tables we listed the wall
(elapsed) time for the master task, TW , (this is necessarily
greater than the maximum wall time returned by the
slaves), the master CPU time, TM , the average worker
computational time, TSC and the average worker data
communication time TSD all in seconds. The speed-up and
efficiency versus the number of processors are shown in
Fig.1(a,b) and Fig.2(a,b) respectively, with block number B
as a parameter. The results show that the parallel efficiency
100
Table4 Effectiveness of the various schemes with MPI for B=200 MF-
80
300 x 300 mesh size DS
Speedup
60
B=200 IADE
40
N MPI Ln
T(s) 20
B=200 ADI
2 816.05 0.118 0
8 371.35 0.142 1 4 16 24 38
B=100 MF-
ADI 16 261.35 0.143 Number of workers DS
20 217.92 0.165
30 171.73 0.177
120
48 137.65 0.172 B=200 MF-DS
100
80 B=200 IADE
Speedup
2 992.3 0.098
60 B=200 ADI
IADE 8 417.55 0.138 40
16 293.22 0.140 B=100 MF-DS
20
20 250.52 0.154 0 B=100 IADE
30 202.77 0.157 1 4 16 24 38 B=100 ADI
48 161.03 0.155
Number of workers B=50 MF-DS
2 1329.61 0.074
MF-DS 8 495.18 0.133
Fig.1 Speed-up versus the number of workers for
16 357.94 0.128
various block sizes. a mesh 200x200 MPI, b. mesh
20 307.31 0.139 300x300 MPI
30 259.00 0.130
48 207.38 0.127 10 B=200 MF-
DS
8
Efficiency
B=200 IADE
6
Table5 Performance improvement of different schemes for
300x300 mesh size 4
B=200 ADI
2
0 B=100 MF-
N MPI % % DS
1 4 16 24 38
ADI+ IADE+ B=100 IADE
ADI IADE MF- IADE MF- Number of workers
DS DS
2 816.0 992.3 1329.6 17.76 25.37 10

5 B=200 MF-
8 371.3 417.55 495.18 11.06 15.68 8 DS
Efficiency
5 6
16 261.3 293.22 357.94 10.87 18.08 B=200 IADE
5 4
20 217.9 250.52 307.31 13.01 18.48 2
2 B=200 ADI
30 171.7 202.77 259.00 15.31 21.71 0
3
48 137.6 161.03 207.38 14.52 22.35 1 4 16 24 38
B=100 MF-
5
Number of workers DS
b
Fig.2 Parallel efficiency versus the number of
workers for various block sizes. a mesh 200x200
MPI, b mesh 300x300 MPI
Table 7 shows the effectiveness of the various schemes than the serial one. Despite these inherent short comes. A
with MPI. As the number of processor increases, the high efficiency is obtained for large scale problems.
effectiveness of MF-DS scheme performs significantly
better than the ADI. As the total number of processor
1.2
increases, the bottleneck of parallel computers appears and
200X200 ADI
the global reduction consumes a large part of time, we
1
anticipate that the improvement in Table 5, will move to be
significant. Fig.3 and Fig. 4 show the efficiency using MPI 0.8 200X200
with varying block sizes, converges using the Iaff measure. IADE
0.6 200X200
5.3 Numerical Efficiency
MF-DS
0.4
The numerical efficiency E num includes the Domain 300X300 ADI
0.2
Decomposition efficiency E DD and convergence rate
behavior N o / N 1 , as defined in Eq. (6.10). The DD 0 300X300
IADE
N N 20 30 50 100 200
efficiency E dd  TB o1 (1) / TB oB (1) includes the increase
of floating point operations induced by grid overlap at Fig.3 The DD efficiency versus the number of sub-domains
interfaces and the CPU time variation generated by DD for various blocks of MPI
techniques. In Tables 2 and 3, we listed the total CPU time
distribution over various grid sizes and block numbers Table6 The number of iteration to achieve given tolerance
running on different processors for MPI. The DD efficiency of 10-3 for a grid of 300x300 for MPI
EDD is calculated and the result is shown in Fig3. Note that
the DD efficiency can be greater than one, even with one
processor. Fig.3 and Fig.4 show that the optimum number
of sub-domains, which maximizes the DD efficiency EDD Scheme N B=20 B=50 B=100 B=200
increases with the grid size. The convergence rate behavior
2 1818 3249 3611 1521
No / N1, the ratio of the iteration number for the best 4 1818 3568 3823 1768
sequential CPU time on one processor and the iteration ADI 8 1818 3992 4094 1904
number for the parallel CPU time on N processor describe 16 1818 4327 4468 2112
the increase in the number of iterations required by the 30 1818 4509 4713 2307
48 1818 4813 4994 2548
parallel method to achieve a specified accuracy as
compared to the serial method. This increase is caused IADE 2 2751 3976 4418 2154
mainly by the deterioration in the rate of convergence with 4 2751 4132 4632 2294
increasing number of processors and sub-domains. Because 8 2751 4378 4895 2478
16 2751 4599 5109 2653
the best serial algorithm is not known generally, we take 30 2751 4708 5343 2801
the existing parallel program running on one processor to 48 2751 4996 5599 2994
replace it. Now the problem is that how the decomposition
strategy affects the convergence rate? The results are 2 3101 4321 4852 3211
MF-DS 4 3101 4582 5074 3528
summarized in Table 5 with Fig.5 8 3101 4718 5362 3769
16 3101 4992 5621 3928
It can be seen that No / N1 decreases with increasing block 30 3101 5284 5895 4325
48 3101 5476 6122 4518
number and increasing number of processors for given grid
size. The larger the grid size, the higher the convergence
rate. For a given block number, a higher convergence rate
1.2
is obtained with less processors. This is because one
processor may be responsible for a few sub-domains at 1
each iteration. If some of this sub-domains share some 0.8 B=20
common interfaces, the subsequent blocks to be computed
0.6 B=50
will use the new updated boundary values, and therefore, an
improved convergence rate results. The convergence rate is 0.4
B=100
reduced when the block number is large. The reason for this 0.2
is evident: the boundary conditions propagate to the interior B=200
0
domain in the serial computation after one iteration, but this
is delayed in the parallel computation. In addition, the 2 4 8 16 30 48
values of variables at the interfaces used in the current
iteration are the previous values obtained in the last
iteration. Therefore, the parallel algorithm is less “implicit” Fig.4 Convergence behavior with domain decomposition
for mesh 200x200 MF-DS MPI

decreasing number of processors and for a given number of

1.2 processors it decreases with increasing block number for
both MPI and PVM (Fig.6 – 7). On the basis of the current
1 parallelization strategy, more sophisticated models can be
attacked efficiently.
0.8 B=20
0.6 B=50 References
0.4 B=100 [1.] W. Barry, A. Michael, 2003. Parallel Programming
0.2 B=200 Techniques and Application using Networked
Workstation and Parallel Computers. Prentice Hall,
0 New Jersy
2 4 8 16 30 48 [2.] A. Beverly, et al., 2005. The Algorithmic Structure
Design Space in Parallel Programming. Wesley
Professional
Fig.5 Convergence behavior with domain [3.] D. Callahan, K. Kennedy. 1988. Compiling Programs
decomposition for mesh 300x300 MF-DS MPI for Distributed Memory Multiprocessors. Journal of
Supercomputer 2, pp 151 – 169
[4.] J.C Chato, 1998. Fundamentals of Bio-Heat Transfer.
Springer-Verlag, Berlin
5.4 Total Efficiency [5.] H. Chi-Chung, G. Ka-Kaung, et. al., 1994. Solving
Partial Differential Equations on a Network of
We implemented the serial computations on one of the Workstations. IEEE, pp 194 – 200.
processors, and calculated the total efficiencies. The total [6.] M. Chinmay, 2005. Bio-Heat Transfer Modeling.
efficiency E(N) for grid sizes 200x200 and 300x300 have Infrared Imagine, pp 15 – 31
been showed respectively. From Eq. (6.8), we know that [7.] R. Chypher, A. Ho, et al., 1993. Architectural
the total efficiency depend on No / N1, E par and DD Requirements of Parallel Scientific Applications with
Explicit Communications. Computer Architecture, pp
efficiency EDD since the load balancing is not the real
2 – 13
problem here. For a given grid size and block number, the
[8.] P.J Coelho, M.G Carvalho, 1993. Application of a
DD efficiency is constant. Thus, the variation of E(N) with
Domain Decomposition Technique to the
processor number n is governed by E par and N o / N 1 . Mathematical Modeling of Utility Boiler. Journal of
When the processor number becomes large, E(N) decreases Numerical Methods in Eng., 36 pp 3401 – 3419
with n due to the effect of both the convergence rate and [9.] D’Ambra P., M. Danelutto, Daniela S., L. Marco,
the parallel efficiency. With the MPI version we were able 2002. Advance Environments for Parallel and
to achieve good implementation for the efficiency of Distributed Applications: a view of current status.
different mesh sizes with different block numbers as Parallel Computing 28, pp 1637 – 1662.
observed in Fig.4 and Fig.5. We observed that the [10.] Z.S Deng, J. Liu, 2002. Analytic Study on Bio-Heat
implementation with MPI achieved conformity to unity Transfer Problems with Spatial Heating on Skin
Surface or Inside Biological Bodies. ASME Journal of
Biomechanics Eng., 124 pp 638 – 649
6. CONCLUSION [11.] H.S Dou, Phan-Thien. 1997. A Domain
We have implemented the solution of the IADE-DY Decomposition Implementation of the Simple Method
scheme on 2-D Bio-Heat equation on a parallel platform of with PVM. Computational Mechanics 20 pp 347 – 358
MPI/PVM cluster via domain decomposition method. We [12.] F. Durst, M. Perie, D. Chafer, E. Schreck, 1993.
have reported a detailed study on the computational Parallelization of Efficient Numerical Methods for
efficiency of a parallel finite difference iterative alternating Flows in Complex Geometries. Flow Simulation with
direction explicit method under a distributed environment High Performance Computing I, pp 79 – 92, Vieweg,
with PVM and MPI. Computational results obtained have Braunschelweig
clearly shown the benefits of using parallel algorithms. We [13.] Eduardo J.H., Yero M.A, Amaral H. (2007). Speedup
have come to some conclusions that: (1) the parallel and Scalability Analysis of Master-Slave Applications
efficiency is strongly dependent on the problem size, block on Large Heterogeneous Clusters. Journal of Parallel &
numbers and the number of processors as observed in Distributed Computing, Vol. 67 Issue 11, 1155 – 1167.
Figures both for PVM and MPI. (2) A high parallel [14.] Fan C., Jiannong C., Yudong S. 2003. High
efficiency can be obtained with large scale problems. (3) Abstractions for Message Passing Parallel
The decomposition of domain greatly influences the Programming. Parallel Computing 29, 1589 – 1621.
performance of the parallel computation (Fig. 4 – 5). (4) [15.] I. Foster, J. Geist, W. Groop, E. Lust, 1998. Wide-
The convergence rate depends upon the block numbers and Area Implementations of the MPI. Parallel Computing
the number of processors for a given grid. For a given 24 pp 1735 – 1749.
number of blocks, the convergence rate increases with
[16.] A. Geist A. Beguelin, J. Dongarra, 1994. Parallel Differential Equations on VPP 700. Parallel
Virtual Machine (PVM). Cambridge, MIT Press Computing 42, pp 324 – 340
[17.] G.A Geist, V.M Sunderami, 1992. Network Based [33.] B. Rubinsky, et al., 1976. Analysis of a Steufen-Like
Concurrent Computing on the PVM System. Problem in a Biological Tissue Around a Cryosurgical
Concurrency Practice and Experience, pp 293 – 311 Problem. ASME J. Biomech. Eng., vol. 98, pp 514 –
[18.] W. Groop, E. Lusk, A. Skjellum, 1999. Using MPI, 519
Portable and Parallel Programming with the Message [34.] Sahimi, M.S., Sundararajan, E., Subramaniam, M. and
Passing Interface, 2nd Ed., Cambridge MA, MIT Press Hamid, N.A.A. (2001). The D’Yakonov fully explicit
[19.] Guang-Wei Y., Long-Jun S., Yu-Lin Z. 2001. variant of the iterative decomposition method,
Unconditional Stability of Parallel Alternating International Journal of Computers and Mathematics
Difference Schemes for Semilinear parabolic Systems. with Applications, 42, 1485 – 1496
Applied Mathematics and Computation 117, pp 267 – [35.] V. T Sahni, 1996. Performance Metrics: Keeping the
283 Focus in Routine. IEEE Parallel and Distributed
[20.] M. Gupta, P. Banerjee, 1992a. Demonstration of Technology, Spring pp 43 – 56.
Automatic Data Partitioning for Parallelizing [36.] G.D Smith, 1985. Numerical Solution of Partial
Compilers on Multi-Computers. IEEE Trans. Parallel Differential Equations: Finite Difference Methods 3rd
Distributed System, 3, vol. 2, pp 179 – 193 Ed., Oxford University Press New York.
[21.] K. Jaris, D.G. Alan, 2003. A High-Performance [37.] M. Snir, S. Otto et. al., 1998. MPI the Complete
Communication Service for Parallel Computing on Reference, 2nd Ed., Cambridge, MA: MIT Press.
Distributed Systems. Parallel Computing 29, pp 851 – [38.] X.H Sun, J. Gustafson, 1991. Toward a Better Parallel
878 Performance Metric. Parallel Computing 17.
[22.] J. Z. Jennifer, et al., 2002. A Two Level Finite [39.] M. Tian, D. Yang, 2007. Parallel Finite-Difference
Difference Scheme for 1-D Penne’s Bio-Heat Schemes for Heat Equation based upon Overlapping
Equation. Domain Decomposition. Applied Maths and
[23.] J. Liu, L.X Xu, 2001. Estimation of Blood Perfusion Computation, 186, pp 1276 – 1292
Using Phase Shift Temperature Response to Sinusoidal
Heating at Skin Surfaces. IEEE Trans. Biomed. Eng., AUTHOR
Vol. 46, pp 1037 – 1043
[24.] Mitchell, A.R., Fairweather, G. (1964). Improved Ewedafe Simon Uzezi received the B.Sc. and
forms of the Alternating direction methods of Douglas, M.Sc. degrees in Industrial-Mathematics and
Peaceman and Rachford for solving parabolic and Mathematics from Delta Stae University and The
elliptic equations, Numer. Maths, 6, 285 – 292. University of Lagos in 1998 and 2003 respectively.
He further obtained his Ph.D. in Numerical Parallel
[25.] J. Noye, 1996. Finite Difference Methods for Partial Computing 2010 from the Universityof Malaya,
Differential Equations. Numerical Solutions of Partial Malaysia. In 2011 he joined UTAR in Malaysia as an Assistant
Differential Equations. North-Hilland Publishing Professor and later lectured in Oman and Nigeria as Senior
Company. Lecturer in Computing. Currently he is a Senior Lecturer in the
[26.] D.W Peaceman, H.H Rachford, 1955. The Numerical Department of Computing UWI Jamaica.
Solution of Parabolic and Elliptic Differential
Equations. Journal of Soc. Indust. Applied Math. 8 (1)
pp 28 – 41
[27.] Peizong L., Z. Kedem, 2002. Automatic Data and
Computation Decomposition on Distributed Memory
Parallel Computers. ACM Transactions on
Programming Languages and Systems, vol. 24, number
1, pp 1 – 50
[28.] H.H Pennes, 1948. Analysis of Tissue and Arterial
Blood Temperature in the Resting Human Forearm.
Journal of Applied Physiol. I, pp 93 – 122
[29.] M.J Quinn, 2001. Parallel Programming in C. MC-
Graw Hill Higher education New York.
[30.] R. Rajamony, A. L. Cox, 1997. Performance
Debugging Shared Memory Parallel Programs Using
Run-Time Dependence Analysis. Performance Review
25 (1), pp 75 – 87
[31.] J. Rantakokko, 2000. Partitioning Strategies for
Structuring Multi Blocks Grids. Parallel Computing 26
pp 166 – 1680
[32.] B. V Rathish Kumar, et al., 2001. A Parallel MIMD
Cell Partitioned ADI Solver for Parabolic Partial

Distributed Performance Improvement of Alternating Iterative Methods Running On Master-Worker Paradigm With MPI

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Performance Improvement of Alternating Iterative Methods Running On Master-Worker Paradigm With MPI

Uploaded by

Copyright:

Available Formats

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

Distributed Performance Improvement of

Volume 8, Issue 2, March - April 2019 Page 17

Bv (ki 1)  g k 1/2 , i  1,2,  m (2.7)

Volume 8, Issue 2, March - April 2019 Page 18

( I  A2 )vin, j 1(2)  vin, j 1(1)  A2 vin, j 1(*) (2.9)

2.2 IADE-DY  1 1   n 1(2)

Volume 8, Issue 2, March - April 2019 Page 19

Volume 8, Issue 2, March - April 2019 Page 20

4 PARALLEL IMPLEMENTATION provides an easy-to-use interface more appropriate for

Assign the group G i1j1 , to the workers

MPI like most other network-oriented middleware services

Volume 8, Issue 2, March - April 2019 Page 21

therefore, the speedup can be written as:

T(1) T (1) Tsc(1) Tser(1) Tsc(1)

The corresponding efficiency is given by:

T (1) Tser (1)  Tsc(1)

The total efficiency can be decomposed as follows:

TsN o (1) TsN o (1) TBNo1 (1)

5.1 Benchmark Problem

Volume 8, Issue 2, March - April 2019 Page 22

increases with increasing grid size for given block number

Consider the telegraph equation of the form:

5.2 Parallel Efficiency

To obtain a high efficiency, the worker computational time

2 816.0 992.3 1329.6 17.76 25.37 10

Volume 8, Issue 2, March - April 2019 Page 25

decreasing number of processors and for a given number of

Volume 8, Issue 2, March - April 2019 Page 27

You might also like