Professional Documents
Culture Documents
Abstract: In this paper we present methods that can improve developed MPI [13] is chosen here since it has a large user
numerical parallel implementation using a scalability-related group.
measure called input file affinity measure on 2-D Telegraph An alternative and cost effective means of achieving a
Equation running on a master-worker paradigm. The comparable performance is by way of distributed
implementation was carried out using Message Passing computing, using a system of processors loosely connected
Interface (MPI). The Telegraph Equation was discretized using
through a Local Area Network (LAN). Relevant data need
Alternating Direction Implicit (ADI), Iterative Alternating
Direction Explicit Method by D’Yakonov (IADE-DY) and to be passed from processor to processor through a message
Mitchell-Fairweather Double Sweep Method (MF-DS). Parallel passing mechanism [12, 18, 27]. The telegraph equation is
performance compare of these methods using the Input File important for modeling several relevant problems such as
Affinity Measure (Iaff) were experimentally evaluated and wave propagation [24], and others. In this paper, a number
numerical results show their effectiveness and parallel of iterative methods are developed to solve the telegraph
performance. We show in this paper that, the input file affinity equations [8]. Some of these iterative schemes are
measure has scalability performance with less information to employed in various parallel platforms [15, 16, 30]. Parallel
schedule tasks. algorithms have been implemented for the finite difference
Keywords: Parallel Performance, Input File Affinity, method [9, 10, 8], the discrete eigen functions method used
Master-Worker, 2-D Telegraph Equation, MPI. in [1] and [27] used the AGE method on 1-D telegraph
problem, but not implemented on parallel platform for
1. INTRODUCTION parallel improvement. We present in this paper algorithms
Computing infrastructures are reaching an unprecedented which have scalability performance comparable to other
degree of complexity. First, parallel processing is coming algorithms in several circumstances. We describe the
to mainstream, because of the frequency and power design of our parallel system through Iaff for understanding
consumption wall that leads to the design of multi-core parallel execution. Through several implementations and
processors. Second, there is a wide adoption of distributed performance improvement analysis of the alternating
processing technologies because of the deployment of the iterative methods, we explore how to use the Iaff on master-
Internet and consequently large-scale grid and cloud slave paradigm for identifying parallel performance on 2-D
infrastructures. In master-worker paradigm a master node is telegraph equation. We assume the amount of work is
responsible for scheduling computation among the worker increased exclusively by increasing the number of tasks.
and collecting the results [5]. The master-worker paradigm The reason for this is that it is relatively easy to increase the
has fundamental limitations: both communications from the range of parameters to be analyzed in an application. On
master and contention in accessing file repositories may the order hand, in order to increase the amount of work
become bottlenecks to the overall scheduling scheme, related to each individual task, the input data for that task
causing scalability problems. Parallelization of Partial should be changed, which may change the nature of the
Differential Equations (PDEs) by time decomposition was problem to be processed by those tasks and it is not always
first proposed by [22] following earlier efforts at space- feasible.
time methods [15, 16, 30]. The application of the It is worth noting that some of the related works mentioned
alternating iterative methods to solve problem of 2-D focus on the problem of efficient scheduling of the
telegraph equations have shown that they need high decomposition to execute on distributed architectures
computational power and communications. In the world of organized either as pure master-slave or hierarchical
parallel computing MPI is the de facto standard for platforms, while lacking scalability analysis. We consider
implementing programs on multiprocessors. To help with that the amount of computation associated with each task is
program development under a distributed computing fixed, each task depends on one or more input files for
environment a number of software tools have been execution, and input file can be shared among tasks [14]. In
this work we assume that both the size of the input files, the 2. TELEGRAPH EQUATION
dependencies among input files and tasks are known. We
We consider the second order telegraph equation as given
also consider that the master node has access to a storage
in [8]:
system which serves as the repository of all input and
output files. This paper is organized as follows: Section 2
discusses previous research work. Section 2 presents the 2v v 2v 2v
model for the 2-D telegraph equation and introduces the a b 0x1, 0 y1, t 0 (2.1)
IADE-DY and DS-MF scheme. Section 3 describes the t2 t x2 y2
input file affinity measure. Section 4 describes the parallel
implementation of the algorithms. Section 5 presents the extending the finite difference scheme on the telegraph
numerical and experimental result of the schemes under equation of (3.1) becomes:
consideration. Finally, section 6 presents our conclusions.
vin, j 1 2vin, j vin, j 1 vin, j 1 vin, j 1
1.1 Previous research work a
Parallel performance of algorithms has been studied in (t )2 2t (2.2)
several papers during the last years [7]. Bag of Task (BoT) vin1,1 j 2vin, j 1 vin1,1 j vin, j 11 2vin, j 1 vin, j 11
applications composed of independent tasks with file b 0
sharing have appeared in papers [8, 14, 19]. Their relevance (x)2 (y)2
has motivated the development of specialized environments
which aim to facilitate the execution of large BoT Although this simple implicit scheme is unconditionally
applications on computational grids and clusters, such as stable, we need to solve a penta-diagonal system of
the AppLes Parameter-Sweep Template (APST) [4]. algebraic equations at each time step. Therefore, the
Giersch et al., [14] proved theoretical limits for the computational time is huge.
computational complexity associated to the scheduling
problem. In [19], they proposed an iterative scheduling 2.1 The 2-D ADI Method
approach that produces effective and efficient schedules,
compared to the previous works in [5, 14]. However, for In this section, we derive the 2-D ADI scheme of the
homogeneous platforms, the algorithms proposed in [19] simple implicit FDTD method by using a general ADI
can be considerably simplified, in a way that it becomes procedure extended to (2.1). Equation (2.4) can be rewritten
equivalent to algorithms proposed previously. Fabricio [11] as:
analyzes the scalability of BoT applications running on
master-slave platforms and proposed a scalability related
measure. Eric [2] presents a detailed study on the 2
scheduling of tasks in the parareal algorithm. It proposes I A m v ni, j 1 2C o v i,n j C1 v i,n j 1 0 (2.3)
two algorithms, one which uses a manager-worker m1
paradigm with overlap of sequential and parallel phases, sub-iteration 1 is given by:
and a second that is completely distributed. Hinde et al.
[17], proposed a generic approach to embed the master-
worker paradigm into software component models and ρ x vin1,j1(1) (1 2ρ x )v i,jn 1(1) ρ x v ni 1,j1(1)
describes how this generic approach can be implemented (2.4)
within an existing software component model. On the
(A 2 )v i,jn 1(*) (2C o vi,jn C1v i,jn 1 ) i,j
numerical computing part, [29] went in search of numerical
consistency in parallel programming and presented For various values of i and j , (2.4) can be written in a more
methods that can drastically improve numerical consistency compact matrix form at the (k + ½ ) time level as:
for parallel calculations across varying numbers of
processors. In [9, 10], parallel implementation of 2-D Av (kj1/2) f k , j 1,2, n. (2.5)
telegraph equation using the SPMD technique and domain
decomposition strategy have been researched and they sub-iteration 2 is given by:
presented a model that enhances overlap communication
with computation to avoid unnecessary synchronization, ρ y vi,jn 1(2) n 1(2)
1 (1 2ρ y )v i,j ρ y vi,jn 1(2)
1
thus yield significant speed up. On the parallel computing (2.6)
front, [28], has proposed a parallel ADI solver for linear vi,jn 1(1) A 2 vi,jn 1(*) . i,j
array of processors. The ADI method in [23, 26] have been
used for solving heat equations in 2D. Approach to solve
For various values of i and j , (2.6) can be written in a more
the telegraphic equations using different numerical schemes
compact matrix form as:
has been treated in [8].
n 1
this is a prediction of vi , j by the extrapolation method. involving at each iterates the solution of an explicit
equation.
Splitting by using an ADI procedure, we get a set of
recursion relations as follows:
2.3 DS-MF
( I A1 )vin, j 1(1) ( A2 )vin, j 1(*) (2Covin, j C1vin, j 1 ) (2.8) The numerical representative of Eq. (2.8) and (2.9) using
the Mitchell and Fairweather scheme is as follows:
The two-stage iterative procedure in the IADE-DY t total t init t comp t end (3.1)
algorithm corresponds to sweeping through the mesh
as the machine model, we consider a cluster composed of P the second term in the right hand side of Eq. (3.5) shows
+ 1 processors. For the rest of this paper we assume that T the time needed to start the first (P – 1) tasks in the other P
» P. One processor is the master and the other processors – 1 processors. If we have a platform where the number of
are workers. Communication among master and workers is processors is larger than Peff the overall makespan is
carried out through Ethernet and master can only send files
through the network to a single worker at a given time. We dominated by communication times between the master and
assume the communication link is full-duplex, i.e., the the workers. We then have:
master can receive an output file from a worker at the same
time it sends an input file to another worker. We also t makespan MPt init t 'comp (3.6)
assume that communication computation begins as soon as
the input files are completely received. This assumption is the set of Eqs. (3.2) – (3.6) will be considered in
coherent to the master-worker paradigm that is currently subsequent sections of this paper. It is worth noting that Eq.
available in [14]. (3.5) is valid when workers are constantly busy, either
performing computation or
For the sake of simplicity and without loss of generality, we
consider in this section that there is no contention related to communication. Eq. (3.6) is applicable when workers have
the transmission of output files from worker nodes to the idle periods, i.e., are not performing either computation or
master. Indeed, it is possible to merge the computation communication. Eq. (3.2) occurs mainly in two cases:
phase with the completion phase without affecting the
results of this section. Therefore, we merge both phases as (1) For very large platforms (P large).
t 'comp in the equations that follow. A worker is idle when it t comp
is not involved with the execution of any of the three (2) For applications with small ratio, such as
phases of a task. For the results below, the task model is
t init
composed of T heterogeneous tasks. All tasks and files fine-grain applications.
have the same size and each task depends upon a single In order to measure the degree of affinity of a set of tasks
non-shared file. Note that the problem of scheduling an concerning their input files, we introduce the concept of
application where each task depends upon a single non- input file affinity. Given a set of G of tasks, composed of K
shared file, all tasks and files have the same size and the tasks, G T1 , T2 , , Tk , and the set F of the Y input
master-worker paradigm is heterogeneous, has polynomial files needed by the tasks belonging to group G,
complexity [14]. These assumptions are considered in the
analysis that follows. We define the effective number of
F f1 , f 2 , , f y , we define I aff as follows:
processors Peff as the maximum number of workers needed
y
to run an application with no idle periods on any worker
I aff (G)
(N 1) f
i 1 i
(3.7)
processor. Taking into account the task and platform y
models described in this section, a processor may have idle Nf i 1 i i
periods if:
where f i is the size in bytes of file f i and
t 'comp (P 1)t init (3.2)
N i (1 N i K) is the number of tasks in group G which
Peff is then given by the following equation: have file f i as an input file. The term (N i 1) in the
numerator of the above equation can be explained as
follows: if N i tasks share an input file f i , that file may be
t 'comp
Peff 1 (3.3) sent only once (instead of N i times) when the group of
t init tasks is executed on a worker node. The potential reduction
the total number of tasks to be executed on a processor is at of the number of bytes transferred from a master node to a
most worker node considering only input file f i is then
(N i 1) f i . Therefore, the input file affinity indicates the
T
M (3.4) overall reduction of the amount of data that needs to be
P transferred to a worker node, when all tasks of a group are
sent to one node. Note that 0 I aff 1 . For the special
for a platform with Peff processors, the upper bound for the
k 1
total execution time (makespan) will be case where all tasks share the same input file I aff ,
k
t makespan M(t init t 'comp ) (P 1)t init (3.5)
where k is the number of tasks of a group.
T ( N ) Tm ( N ) Tsd ( N ) Tsc ( N ) (4.4) Table 1 The wall time TW, the master time TM, the worker
data time TSD, the worker computational time TSC, the total
generally, time T, the parallel speed-up Spar and the efficiency Epar for
Tm(N) Tm(1), Tsd(N) Tsd(1), Tsc(N) Tsc(1) , (4.5) a mesh of 300x300, with B = 50 blocks and Niter = 100 for
N PVM and MPI.
N NL
where TB1B ( n) has the same meaning as TB1B ( n) except
the idle time is not included. Comparing (6.5) and (6.2), we
obtain:
TBN1B ( N ) TBN1B (1)
Eload ( N ) , E par (n) ,
TBN1BL n.TBN1B ( N )
(4.9)
T No (1) TsN o (1) TBNo1 (1) TBNoB (1)
Enum( N ) sN1 ,
TB B (1) TBNo1 (1) TBNoB (1) TBN1B (1) 5 Results and Discussion
Table 2 The wall time TW, the master time TM, the worker
where C n nTn , T1 is the execution time on a serial data time TSD, the worker computational time TSC, the total
machine and Tn is the computing time on parallel machine time T, the parallel speed-up Spar and the efficiency Epar for
a mesh of 300x300, with B = 100 blocks and Niter = 100
with N processors. Hence, effectiveness can be written as: MPI.
Ln S n (nTn ) E n Tn E n S n T1 (4.12)
Table 3 The wall time TW, the master time TM, the worker
data time TSD, the worker computational time TSC, the total
time T, the parallel speed-up Spar and the efficiency Epar for
a mesh of 300x300, with B = 200 blocks and Niter = 100
for MPI.
2U 2U 2U U
2 2 U (5.1)
x 2 y t t
100
Table4 Effectiveness of the various schemes with MPI for B=200 MF-
80
300 x 300 mesh size DS
Speedup
60
B=200 IADE
40
N MPI Ln
T(s) 20
B=200 ADI
2 816.05 0.118 0
8 371.35 0.142 1 4 16 24 38
B=100 MF-
ADI 16 261.35 0.143 Number of workers DS
20 217.92 0.165
30 171.73 0.177
120
48 137.65 0.172 B=200 MF-DS
100
80 B=200 IADE
Speedup
2 992.3 0.098
60 B=200 ADI
IADE 8 417.55 0.138 40
16 293.22 0.140 B=100 MF-DS
20
20 250.52 0.154 0 B=100 IADE
30 202.77 0.157 1 4 16 24 38 B=100 ADI
48 161.03 0.155
Number of workers B=50 MF-DS
2 1329.61 0.074
MF-DS 8 495.18 0.133
Fig.1 Speed-up versus the number of workers for
16 357.94 0.128
various block sizes. a mesh 200x200 MPI, b. mesh
20 307.31 0.139 300x300 MPI
30 259.00 0.130
48 207.38 0.127 10 B=200 MF-
DS
8
Efficiency
B=200 IADE
6
Table5 Performance improvement of different schemes for
300x300 mesh size 4
B=200 ADI
2
0 B=100 MF-
N MPI % % DS
1 4 16 24 38
ADI+ IADE+ B=100 IADE
ADI IADE MF- IADE MF- Number of workers
DS DS
5 6
16 261.3 293.22 357.94 10.87 18.08 B=200 IADE
5 4
20 217.9 250.52 307.31 13.01 18.48 2
2 B=200 ADI
30 171.7 202.77 259.00 15.31 21.71 0
3
48 137.6 161.03 207.38 14.52 22.35 1 4 16 24 38
B=100 MF-
5
Number of workers DS
b
Fig.2 Parallel efficiency versus the number of
workers for various block sizes. a mesh 200x200
MPI, b mesh 300x300 MPI
Volume 8, Issue 2, March - April 2019 Page 24
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 2, March - April 2019 ISSN 2278-6856
Table 7 shows the effectiveness of the various schemes than the serial one. Despite these inherent short comes. A
with MPI. As the number of processor increases, the high efficiency is obtained for large scale problems.
effectiveness of MF-DS scheme performs significantly
better than the ADI. As the total number of processor
1.2
increases, the bottleneck of parallel computers appears and
200X200 ADI
the global reduction consumes a large part of time, we
1
anticipate that the improvement in Table 5, will move to be
significant. Fig.3 and Fig. 4 show the efficiency using MPI 0.8 200X200
with varying block sizes, converges using the Iaff measure. IADE
0.6 200X200
5.3 Numerical Efficiency
MF-DS
0.4
The numerical efficiency E num includes the Domain 300X300 ADI
0.2
Decomposition efficiency E DD and convergence rate
behavior N o / N 1 , as defined in Eq. (6.10). The DD 0 300X300
IADE
N N 20 30 50 100 200
efficiency E dd TB o1 (1) / TB oB (1) includes the increase
of floating point operations induced by grid overlap at Fig.3 The DD efficiency versus the number of sub-domains
interfaces and the CPU time variation generated by DD for various blocks of MPI
techniques. In Tables 2 and 3, we listed the total CPU time
distribution over various grid sizes and block numbers Table6 The number of iteration to achieve given tolerance
running on different processors for MPI. The DD efficiency of 10-3 for a grid of 300x300 for MPI
EDD is calculated and the result is shown in Fig3. Note that
the DD efficiency can be greater than one, even with one
processor. Fig.3 and Fig.4 show that the optimum number
of sub-domains, which maximizes the DD efficiency EDD Scheme N B=20 B=50 B=100 B=200
increases with the grid size. The convergence rate behavior
2 1818 3249 3611 1521
No / N1, the ratio of the iteration number for the best 4 1818 3568 3823 1768
sequential CPU time on one processor and the iteration ADI 8 1818 3992 4094 1904
number for the parallel CPU time on N processor describe 16 1818 4327 4468 2112
the increase in the number of iterations required by the 30 1818 4509 4713 2307
48 1818 4813 4994 2548
parallel method to achieve a specified accuracy as
compared to the serial method. This increase is caused IADE 2 2751 3976 4418 2154
mainly by the deterioration in the rate of convergence with 4 2751 4132 4632 2294
increasing number of processors and sub-domains. Because 8 2751 4378 4895 2478
16 2751 4599 5109 2653
the best serial algorithm is not known generally, we take 30 2751 4708 5343 2801
the existing parallel program running on one processor to 48 2751 4996 5599 2994
replace it. Now the problem is that how the decomposition
strategy affects the convergence rate? The results are 2 3101 4321 4852 3211
MF-DS 4 3101 4582 5074 3528
summarized in Table 5 with Fig.5 8 3101 4718 5362 3769
16 3101 4992 5621 3928
It can be seen that No / N1 decreases with increasing block 30 3101 5284 5895 4325
48 3101 5476 6122 4518
number and increasing number of processors for given grid
size. The larger the grid size, the higher the convergence
rate. For a given block number, a higher convergence rate
1.2
is obtained with less processors. This is because one
processor may be responsible for a few sub-domains at 1
each iteration. If some of this sub-domains share some 0.8 B=20
common interfaces, the subsequent blocks to be computed
0.6 B=50
will use the new updated boundary values, and therefore, an
improved convergence rate results. The convergence rate is 0.4
B=100
reduced when the block number is large. The reason for this 0.2
is evident: the boundary conditions propagate to the interior B=200
0
domain in the serial computation after one iteration, but this
is delayed in the parallel computation. In addition, the 2 4 8 16 30 48
values of variables at the interfaces used in the current
iteration are the previous values obtained in the last
iteration. Therefore, the parallel algorithm is less “implicit” Fig.4 Convergence behavior with domain decomposition
for mesh 200x200 MF-DS MPI
[16.] A. Geist A. Beguelin, J. Dongarra, 1994. Parallel Differential Equations on VPP 700. Parallel
Virtual Machine (PVM). Cambridge, MIT Press Computing 42, pp 324 – 340
[17.] G.A Geist, V.M Sunderami, 1992. Network Based [33.] B. Rubinsky, et al., 1976. Analysis of a Steufen-Like
Concurrent Computing on the PVM System. Problem in a Biological Tissue Around a Cryosurgical
Concurrency Practice and Experience, pp 293 – 311 Problem. ASME J. Biomech. Eng., vol. 98, pp 514 –
[18.] W. Groop, E. Lusk, A. Skjellum, 1999. Using MPI, 519
Portable and Parallel Programming with the Message [34.] Sahimi, M.S., Sundararajan, E., Subramaniam, M. and
Passing Interface, 2nd Ed., Cambridge MA, MIT Press Hamid, N.A.A. (2001). The D’Yakonov fully explicit
[19.] Guang-Wei Y., Long-Jun S., Yu-Lin Z. 2001. variant of the iterative decomposition method,
Unconditional Stability of Parallel Alternating International Journal of Computers and Mathematics
Difference Schemes for Semilinear parabolic Systems. with Applications, 42, 1485 – 1496
Applied Mathematics and Computation 117, pp 267 – [35.] V. T Sahni, 1996. Performance Metrics: Keeping the
283 Focus in Routine. IEEE Parallel and Distributed
[20.] M. Gupta, P. Banerjee, 1992a. Demonstration of Technology, Spring pp 43 – 56.
Automatic Data Partitioning for Parallelizing [36.] G.D Smith, 1985. Numerical Solution of Partial
Compilers on Multi-Computers. IEEE Trans. Parallel Differential Equations: Finite Difference Methods 3rd
Distributed System, 3, vol. 2, pp 179 – 193 Ed., Oxford University Press New York.
[21.] K. Jaris, D.G. Alan, 2003. A High-Performance [37.] M. Snir, S. Otto et. al., 1998. MPI the Complete
Communication Service for Parallel Computing on Reference, 2nd Ed., Cambridge, MA: MIT Press.
Distributed Systems. Parallel Computing 29, pp 851 – [38.] X.H Sun, J. Gustafson, 1991. Toward a Better Parallel
878 Performance Metric. Parallel Computing 17.
[22.] J. Z. Jennifer, et al., 2002. A Two Level Finite [39.] M. Tian, D. Yang, 2007. Parallel Finite-Difference
Difference Scheme for 1-D Penne’s Bio-Heat Schemes for Heat Equation based upon Overlapping
Equation. Domain Decomposition. Applied Maths and
[23.] J. Liu, L.X Xu, 2001. Estimation of Blood Perfusion Computation, 186, pp 1276 – 1292
Using Phase Shift Temperature Response to Sinusoidal
Heating at Skin Surfaces. IEEE Trans. Biomed. Eng., AUTHOR
Vol. 46, pp 1037 – 1043
[24.] Mitchell, A.R., Fairweather, G. (1964). Improved Ewedafe Simon Uzezi received the B.Sc. and
forms of the Alternating direction methods of Douglas, M.Sc. degrees in Industrial-Mathematics and
Peaceman and Rachford for solving parabolic and Mathematics from Delta Stae University and The
elliptic equations, Numer. Maths, 6, 285 – 292. University of Lagos in 1998 and 2003 respectively.
He further obtained his Ph.D. in Numerical Parallel
[25.] J. Noye, 1996. Finite Difference Methods for Partial Computing 2010 from the Universityof Malaya,
Differential Equations. Numerical Solutions of Partial Malaysia. In 2011 he joined UTAR in Malaysia as an Assistant
Differential Equations. North-Hilland Publishing Professor and later lectured in Oman and Nigeria as Senior
Company. Lecturer in Computing. Currently he is a Senior Lecturer in the
[26.] D.W Peaceman, H.H Rachford, 1955. The Numerical Department of Computing UWI Jamaica.
Solution of Parabolic and Elliptic Differential
Equations. Journal of Soc. Indust. Applied Math. 8 (1)
pp 28 – 41
[27.] Peizong L., Z. Kedem, 2002. Automatic Data and
Computation Decomposition on Distributed Memory
Parallel Computers. ACM Transactions on
Programming Languages and Systems, vol. 24, number
1, pp 1 – 50
[28.] H.H Pennes, 1948. Analysis of Tissue and Arterial
Blood Temperature in the Resting Human Forearm.
Journal of Applied Physiol. I, pp 93 – 122
[29.] M.J Quinn, 2001. Parallel Programming in C. MC-
Graw Hill Higher education New York.
[30.] R. Rajamony, A. L. Cox, 1997. Performance
Debugging Shared Memory Parallel Programs Using
Run-Time Dependence Analysis. Performance Review
25 (1), pp 75 – 87
[31.] J. Rantakokko, 2000. Partitioning Strategies for
Structuring Multi Blocks Grids. Parallel Computing 26
pp 166 – 1680
[32.] B. V Rathish Kumar, et al., 2001. A Parallel MIMD
Cell Partitioned ADI Solver for Parabolic Partial