Module 3

CMPE 4003/6004
Dept of ECE, Curtin University
Module 3
Distributed Memory
Programming with
MPI
Writing your first MPI program.

Using
g the common MPI functions.
The Trapezoidal Rule in MPI.
Collective communication.
MPI derived datatypes.
Performance evaluation of MPI programs.
Parallel sorting.
Safety in MPI programs.
# Chapte
er Subtitle
Roadmap
A distributed memory system
A shared memory system
Hello World!
(a classic)
5
Identifying MPI processes
Common practice to identify processes by

nonnegative integer ranks
ranks.
p processes are numbered 0,
0 1
1, 2
2, .. p
p-1
1
Our first MPI program
Compilation
wrapper script to compile
source file
mpicc -g
g -Wall
Wall -o
o mpi_hello
mpi hello mpi_hello.c
mpi hello c
produce
debugging
information
create
t this
thi executable
t bl fil
file name
(as opposed to default a.out)
turns on all warnings
Execution
mpiexec -n
n <number of processes> <executable>
mpiexec -n 1 ./mpi_hello
run with
ith 1 process
run with 4 processes
Execution
Greetings from process 0 of 1 !
10
MPI Programs
Written in C.
Has main.
Uses stdio.h, string.h, etc.
Need to add mpi.h header file.

Identifiers defined by MPI start with
MPI_.
MPI
First letter following underscore is
uppercase.
uppercase
For function names and MPI-defined types.

Helps to avoid confusion.
11
MPI Components
MPI_Init
Tells MPI to do all the necessary setup

setup.
MPI Finalize
MPI_Finalize
Tells MPI were done, so clean up anything

allocated
a
ocated for
o tthis
sp
program.
og a
12
Basic Outline
13
Communicators
A collection of processes that can send

messages to each other
other.
MPI_Init defines a communicator that
consists of all the processes created when
the program is started.
Called MPI_COMM_WORLD.
14
Communicators
number of processes in the communicator
my rank
(the process making this call)
15
SPMD
Single-Program Multiple-Data
We compile one program.
program
Process 0 does something different.
Receives messages and prints them while the

other processes do the work.
The if-else construct makes our program

SPMD.
SPMD
16
Communication
17
Data types
18
Communication
19
Message matching
r
MPI_Send
src = q
MPI_Recv
dest = r
20
Receiving messages
A receiver can get a message without

knowing:
the amount of data in the message,

the sender of the message
message,
or the tag of the message.
21
status_p argument
MPI Status*
MPI_Status
MPI_Status* status;
status.MPI_SOURCE
t t MPI SOURCE
status.MPI_TAG
MPI_SOURCE
MPI
SOURCE
MPI_TAG
MPI_ERROR
22
How much data am I receiving?
23
Issues with send and receive
Exact behavior is determined by the MPI

implementation.
implementation
MPI_Send may behave differently with
regard to buffer size,
size cutoffs and blocking
blocking.
MPI_Recv always blocks until a matching
message is
i received.
i d
Know your implementation;
dont make assumptions!
24
TRAPEZOIDAL RULE IN MPI
25
The Trapezoidal Rule
26
The Trapezoidal Rule
27
One trapezoid
28
Pseudo-code for a serial

program
29
Parallelizing the Trapezoidal Rule

1. Partition problem solution into tasks.
2 Identify communication channels between
2.
tasks.
3 Aggregate
3.
A
tasks
k iinto composite
i tasks.
k
4. Map composite tasks to cores.
30
Parallel pseudo-code
31
Tasks and communications for

Trapezoidal Rule
32
First version (1)
33
First version (2)
34
First version (3)
35
Dealing with I/O

Each process just
prints a message.
36
Running with 6 processes
unpredictable output
37
Input
Most MPI implementations only allow

process 0 in MPI_COMM_WORLD
MPI COMM WORLD access
to stdin.
Process 0 must read the data (scanf) and
send to the other processes.
38
Function for reading user input
39
COLLECTIVE
COMMUNICATION
40
Tree-structured communication
1. In the first phase:
(a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and
7 sends to 6.
(b) Processes 0, 2, 4, and 6 add in the received values.
((c)) Processes 2 and 6 send their new values to
processes 0 and 4, respectively.
(d) Processes 0 and 4 add the received values into their
new values
values.
( ) Process 4 sends its newest value to process 0.
2. (a)
(b) Process 0 adds the received value to its newest
value.
41
A tree-structured global sum
42
An alternative tree-structured
global sum
43
MPI_Reduce
44
Predefined reduction operators

in MPI
45
Collective vs. Point-to-Point

Communications
All the processes in the communicator

must call the same collective function.
For example, a program that attempts to
match a call to MPI_Reduce
MPI Reduce on one
process with a call to MPI_Recv on
another process is erroneous
erroneous, and
and, in all
likelihood, the program will hang or crash.
46

Communications
The arguments passed by each process to

an MPI collective communication must be
compatible.
compatible.
For example
example, if one process passes in 0 as
the dest_process and another passes in 1,
then the outcome of a call to MPI_Reduce
MPI Reduce
is erroneous, and, once again, the
program is likely to hang or crash
crash.
47

Communications
The output_data_p
output data p argument is only used
on dest_process.
However, all of the processes still need to
pass in an actual argument corresponding
to output_data_p, even if its just NULL.
48

Communications
Point to point communications are

Point-to-point
matched on the basis of tags and
communicators.
Collective communications dont
don t use tags
tags.
Theyre matched solely on the basis of the
communicator
i t and
d th
the order
d iin which
hi h
theyre called.
49
Example (1)
Multiple calls to MPI_Reduce

MPI Reduce
50
Example (2)
Suppose that each process calls

MPI Reduce with operator MPI_SUM,
MPI_Reduce
MPI SUM and
destination process 0.
At first glance, it might seem that after the
two calls to MPI_Reduce,
MPI Reduce the value of b
will be 3, and the value of d will be 6.
51
Example (3)
However, the names of the memory

locations are irrelevant to the matching of
the calls to MPI_Reduce.
The order of the calls will determine the
matching so the value stored in b will be
1+2+1 = 4, and the value stored in d will
be 2+1+2 = 5.
5
52
MPI_Allreduce
Useful in a situation in which all of the

processes need the result of a global sum
in order to complete some larger
computation.
computation
53
A global
l b l sum ffollowed
ll
d
by distribution of the
result.
54
A butterfly-structured global sum.
55
Broadcast
Data belonging to a single process is sent

to all of the processes in the
communicator.
56
A tree-structured broadcast.
57
A version of Get_input that uses

MPI Bcast
MPI_Bcast
58
Data distributions
Compute a vector sum.
59
Serial implementation of vector

addition
60
Different partitions of a 12component vector among 3

processes
61
Partitioning options
Block partitioning
Cyclic partitioning
Assign blocks of consecutive components to

each process.
Assign components in a round robin fashion.
Bl k
Block-cyclic
li partitioning
titi i
Use a cyclic distribution of blocks of

components.
t
62
Parallel implementation of
vector addition
63
Scatter
MPI_Scatter can be used in a function that

reads in an entire vector on process 0 but
only sends the needed components to
each of the other processes
processes.
64
Reading and distributing a vector
65
Gather
Collect all of the components of the vector

onto process 0
0, and then process 0 can
process all of the components.
66
Print a distributed vector (1)
67
Print a distributed vector (2)
68
Allgather
Concatenates the contents of each

process send_buf_p
process
send buf p and stores this in
each process recv_buf_p.
As usual
usual, recv_count
recv count is the amount of data
being received from each process.
69
Matrix-vector multiplication
i-th component of y
Dot product of the ith

row of A with x.
70
Matrix-vector multiplication
71
Multiply a matrix by a vector
Serial pseudo-code
pseudo code
72
C style arrays
stored as
73
Serial matrix-vector
multiplication
74
An MPI matrix-vector
multiplication function (1)
75
An MPI matrix-vector
multiplication function (2)
76
MPI DERIVED DATATYPES
77
Derived datatypes
Used to represent any collection of data items in

memoryy byy storing
g both the types
yp of the items
and their relative locations in memory.
The idea is that if a function that sends data
knows this information about a collection of data
items, it can collect the items from memory
b f
before
th
they are sent.
t
Similarly, a function that receives data can
di t ib t th
distribute
the it
items iinto
t th
their
i correctt d
destinations
ti ti
in memory when theyre received.
78
Derived datatypes
Formally, consists of a sequence of basic

MPI data types together with a
displacement for each of the data types.
Trapezoidal Rule example:
79
MPI_Type create_struct
Builds a derived datatype that consists of

individual elements that have different
basic types.
80
MPI_Get_address
Returns the address of the memory

location referenced by location_p.
location p
The special type MPI_Aint is an integer
type that is big enough to store an address
on the system.
81
MPI_Type_commit
Allows the MPI implementation to optimize

its internal representation of the datatype
for use in communication functions.
82
MPI_Type_free
When were finished with our new type,

this frees any additional storage used
used.
83
Get input function with a derived

datatype (1)
84

datatype (2)
85

datatype (3)
86
PERFORMANCE EVALUATION
87
Elapsed parallel time
Returns the number of seconds that have

elapsed since some time in the past
past.
88
Elapsed serial time
In this case, you dont need to link in the

MPI libraries
libraries.
Returns time in microseconds elapsed
from some point in the past
past.
89
Elapsed serial time
90
MPI_Barrier
Ensures that no process will return from

calling it until every process in the
communicator has started calling it.
91
MPI_Barrier
92
Run-times of serial and parallel

matrix vector multiplication
matrix-vector
(Seconds)
93
Speedup
94
Efficiency
95
Speedups of Parallel MatrixVector Multiplication
96
Efficiencies of Parallel MatrixVector Multiplication
97
Scalability
A program is scalable if the problem size

can be increased at a rate so that the
efficiency doesnt decrease as the number
of processes increase
increase.
98
Scalability
Programs that can maintain a constant

efficiency without increasing the problem
size are sometimes said to be strongly
scalable.
scalable
Programs that can maintain a constant
efficiency if the problem size increases at
the same rate as the number of processes
are sometimes said to be weakly scalable.
99
A PARALLEL SORTING
ALGORITHM
100
Sorting
n keys and p = comm sz processes.

n/p keys assigned to each process
process.
No restrictions on which keys are assigned
to which
hi h processes.
When the algorithm terminates:
The keys assigned to each process should be

sorted in (say) increasing order.
If 0 q < r < p, then each key assigned to
process q should be less than or equal to
every key
k assigned
i
d to process r.
101
Serial bubble sort
102
Odd-even transposition sort
A sequence of phases.
Even phases
phases, compare swaps:
Odd phases, compare swaps:
103
Example
Start: 5, 9, 4, 3
Even phase: compare-swap
compare swap (5
(5,9)
9) and (4
(4,3)
3)
getting the list 5, 9, 3, 4
Odd phase: compare
compare-swap
swap (9
(9,3)
3)
E
Even
phase:
h
compare-swap (5,3)
(5 3) and
d (9
(9,4)
4)
Odd phase: compare-swap (5,4)
104
Serial odd-even transposition

sort
105
Communications among tasks in

odd even sort
odd-even
Tasks determining a[i] are labeled with a[i].
106
Parallel odd-even transposition

sort
107
Pseudo-code
108
Compute_partner
109
Safety in MPI programs
The MPI standard allows MPI_Send to

behave in two different ways:
it can simply copy the message into an MPI

managed buffer and return
return,
or it can block until the matching call to
MPI Recv starts.
MPI_Recv
110
Many implementations of MPI set a

threshold at which the system switches
from buffering to blocking.
Relatively small messages will be buffered
by MPI_Send.
L
Larger
messages, will
ill cause it tto bl
block.
k
111
If the MPI_Send executed by each process

blocks no process will be able to start
blocks,
executing a call to MPI_Recv, and the
program will hang or deadlock.
deadlock
Each process is blocked waiting for an
event that will never happen.
(see pseudo-code)
112
A program that relies on MPI provided

buffering is said to be unsafe.
unsafe
Such
S
h a program may run without
ih
problems
bl
for various sets of input, but it may hang or
crash
h with
ith other
th sets.
t
113
MPI_Ssend
An alternative to MPI_Send defined by the

MPI standard
standard.
The extra s stands for synchronous and
MPI Ssend is guaranteed to block until the
MPI_Ssend
matching receive starts.
114
Restructuring communication
115
MPI_Sendrecv
An alternative to scheduling the

communications ourselves
ourselves.
Carries out a blocking send and a receive
in a single call
call.
The dest and the source can be the same
or different.
diff
t
Especially useful because MPI schedules
the communications so that the program
wont hang or crash.
116
MPI_Sendrecv
117
Safe communication with five

processes
118
Parallel odd-even transposition sort
119
Run-times of parallel odd-even

sort
(times are in milliseconds)
120
Concluding Remarks (1)
MPI or the Message-Passing Interface is a

library of functions that can be called from
C, C++, or Fortran programs.
A communicator is a collection of
processes that can send messages to
each other
other.
Many parallel programs use the singleprogram multiple
lti l data
d t or SPMD approach.
h
121
Most serial programs are deterministic: if

we run the same program with the same
input well get the same output.
Parallel programs often dont
don t possess this
property.
C ll ti communications
Collective
i ti
iinvolve
l allll th
the
processes in a communicator.
122
When we time parallel programs, were

usually interested in elapsed time or wall
wall
clock time.
Speedup is the ratio of the serial run
run-time
time
to the parallel run-time.
Effi i
Efficiency
iis th
the speedup
d di
divided
id d b
by th
the
number of parallel processes.
123
If its possible to increase the problem size

(n) so that the efficiency doesnt
doesn t decrease
as p is increased, a parallel program is
said to be scalable
scalable.
An MPI program is unsafe if its correct
behavior depends on the fact that
MPI_Send is buffering its input.
124

Module 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 3

Uploaded by

Copyright:

Available Formats

CMPE 4003/6004

Dept of ECE, Curtin University

Writing your first MPI program.

A distributed memory system

A shared memory system

Identifying MPI processes

Common practice to identify processes by

Our first MPI program

Need to add mpi.h header file.

For function names and MPI-defined types.

Tells MPI to do all the necessary setup

Tells MPI were done, so clean up anything

A collection of processes that can send

number of processes in the communicator

Receives messages and prints them while the

The if-else construct makes our program

A receiver can get a message without

the amount of data in the message,

How much data am I receiving?

Issues with send and receive

Exact behavior is determined by the MPI

TRAPEZOIDAL RULE IN MPI

The Trapezoidal Rule

The Trapezoidal Rule

Pseudo-code for a serial

Parallelizing the Trapezoidal Rule

Tasks and communications for

First version (1)

First version (2)

First version (3)

Dealing with I/O

Running with 6 processes

Most MPI implementations only allow

Function for reading user input

A tree-structured global sum

Predefined reduction operators

Collective vs. Point-to-Point

All the processes in the communicator

Collective vs. Point-to-Point

The arguments passed by each process to

Collective vs. Point-to-Point

Collective vs. Point-to-Point

Point to point communications are

Multiple calls to MPI_Reduce

Suppose that each process calls

However, the names of the memory

Useful in a situation in which all of the

A butterfly-structured global sum.

Data belonging to a single process is sent

A version of Get_input that uses

Compute a vector sum.

Serial implementation of vector

Different partitions of a 12component vector among 3

Assign blocks of consecutive components to

Use a cyclic distribution of blocks of

MPI_Scatter can be used in a function that

Reading and distributing a vector

Collect all of the components of the vector

Print a distributed vector (1)

Print a distributed vector (2)

Concatenates the contents of each

Dot product of the ith

Multiply a matrix by a vector