Parallel Programming and MPI

Parallel
Programming
and MPI
A course for IIT-M. September 2008
R Badrinath, STSD Bangalore
(ramamurthy.badrinath@hp.com)
2006 Hewlett-Packard Development Company, L.P.

The information contained herein is subject to change without notice
Context and Background
IIT- Madras has recently added a good deal of

compute power.
Why
Further R&D in sciences, engineering
Provide computing services to the region
Create new opportunities in education and skills
Why this course

Update skills to program modern cluster computers
Length -2 theory and 2 practice sessions, 4 hrs

each
IIT-Madras
Audience Check
Contents
1.
2.
3.
4.
MPI_InitInstead we
Understand Issues
MPI_Comm_rank
MPI_Comm_size
Understand Concepts
MPI_Send
5.
MPI_Recv Learn
6.
MPI_Bcast
7.
MPI_Create_comm
8.
MPI_Sendrecv
9.
MPI_Scatter
10.
MPI_Gather
Go by motivating examples
Try out some of the examples
enough to pickup from the man
IIT-Madras
Outline
Sequential
vs Parallel programming
Shared
vs Distributed Memory
Parallel
work breakdown models
Communication
MPI
Examples
MPI
Concepts
The
role of IO
IIT-Madras
vs Computation
Sequential vs Parallel
We
are used to sequential programming C,

Java, C++, etc. E.g., Bubble Sort, Binary Search,
Strassen Multiplication, FFT, BLAST,
Main
idea Specify the steps in perfect order
Reality
We are used to parallelism a lot more

than we think as a concept; not for
programming
Methodology
Launch a set of tasks;

communicate to make progress. E.g., Sorting 500
answer papers by making 5 equal piles, have
them sorted by 5 people, merge them together.
IIT-Madras
Shared vs Distributed Memory

Programming
Shared Memory All tasks access the same

memory, hence the same data. pthreads
Distributed Memory All memory is local. Data

sharing is by explicitly transporting data from one
task to another (send-receive pairs in MPI, e.g.)
Program
Memory
Communications channel
HW Programming model relationship Tasks vs

CPUs;
SMPs vs Clusters
IIT-Madras
Designing Parallel Programs
Simple Parallel Program sorting

numbers in a large array A
Notionally
divide A into 5 pieces

[0..99;100..199;200..299;300..399;400..499
].
Each
part is sorted by an independent

sequential algorithm and left within its
region.
The
resultant parts are merged by simply

reordering among adjacent parts.
IIT-Madras
What is different Think about

How
many people doing the work. (Degree of

Parallelism)
What
is needed to begin the work.

(Initialization)
Who
does what. (Work distribution)
Access
to work part. (Data/IO access)
Whether
they need info from each other to

finish their own job. (Communication)
When
What
1
are they all done. (Synchronization)

needs to be done to collate the result.
IIT-Madras
Work Break-down
Parallel
Prefer
algorithm
simple intuitive breakdowns
Usually
highly optimized sequential

algorithms are not easily parallelizable
Breaking
work often involves some pre- or

post- processing (much like divide and
conquer)
Fine
vs large grain parallelism and

relationship to communication
IIT-Madras
Digression
Lets get a simple MPI Program to
work
#include <mpi.h>
#include <stdio.h>
int main()
{
int total_size, my_rank;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &total_size);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
printf("\n Total number of programs = %d, out of
which rank of this process is %d\n", total_size,
my_rank);
MPI_Finalize();
return 0;
}
1
IIT-Madras
Getting it to work
Compile it:
mpicc o simple simple.c
# If you want HP-MPI set your path

# /opt/hpmpi/bin
Run it
This depends a bit on the system
mpirun -np2 simple
qsub l ncpus=2 o simple.out /opt/hpmpi/bin/mpirun <your
program location>/simple
[Fun: qsub l ncpus=2 I hostname ]
Results are in the output file.

What is mpirun ?
What does qsub have to do with MPI?... More about qsub in a
separate talk.
IIT-Madras
What goes on
Same
program is run at the same time on 2

different CPUs
Each
is slightly different in that each returns

different values for some simple calls like
MPI_Comm_rank.
This
gives each instance its identity
We
can make different instances run

different pieces of code based on this
identity difference
Typically
1
it is an SPMD model of computation
IIT-Madras
Continuing work breakdown
Simple Example: Find shortest

distances
PROBLEM:
7
Find shortest path
distances
2
5
1
2
7
Let Nodes be numbered 0,1,,n-1

Let us put all of this in a matrix
A[i][j] is the distance from i to j
IIT-Madras
0
7
2
0
1
..
..
..
6
..
1
..
..
5
..
..
0
2
..
2
0
..
3
2
0
Floyds (sequential) algorithm

For (k=0; k<n; k++)
For (i=0; i<n; i++)
for (j=0; j<n; j++)
a[i][j]=min( a[i][j] , a[i,k]+a[k][j] );
Observation:
For a fixed k,
Computing i-th row needs i-th row and k-th
row
1
IIT-Madras
Parallelizing Floyd
Actually
we just need n2 tasks, with each

task iterating n times (once for each value of
k).
After
each iteration we need to make sure

everyone sees the matrix.
Ideal
for shared memory.. Programming
What
if we have less than n2 tasks?... Say
Need
to divide the work among the p tasks.
p<n.
We
1
can simply divide up the rows.

IIT-Madras
Dividing the work

Each
task gets [n/p] rows, with the last

possibly getting a little more.
T0
i-th row
q x [ n/p ]
Tq
k-th row
IIT-Madras
Remember
the
observation
/* id is TASK NUMBER, each node has only the part of A

that it owns. This is approximate code
Note*/
that each node
calls its own matrix by

the same name name a
current_owner_task = GET_BLOCK_OWNER(k);
[ ][ ] but has only
if (id == current_owner_task)
{
[p/n] rows.
for (k=0;k<n;k++) {
The MPI
Model
-All nodes run
the same
code!! P
replica tasks!!
k_here = k - LOW_END_OF_MY_BLOCK(id);
for(j=0;j<n;j++)
Distributed Memory
rowk[j]=a[k_here][j];
Model
-Some
times
/* rowk
is broadcast by the owner and received by
they
need to
others..
do different
The MPI code will come here later */
things
for(i=0;i<GET_MY_BLOCK_SIZE(id);i++)
for(j=0;j<n;j++)
a[i,j]=min(a[i][j],
a[i][k]+rowk[j]);
1
IIT-Madras
The MPI model

Recall
MPI tasks are typically created when

the jobs are launched not inside the MPI
program (no forking).
mpirun usually creates the task set
mpirun np 2 a.out <args to a.out>
a.out is run on all nodes and a communication
channel is setup between them
Functions
allow for tasks to find out
Size of the task group

Ones own position within the group
2
IIT-Madras
MPI Notions [ Taking from the

example ]
Communicator A group of tasks in a program
Rank Each tasks ID in the group

MPI_Comm_rank() /* use this to set id */
Size Of the group

MPI_Comm_size() /* use to set p */
Notion of send/receive/broadcast
MPI_Bcast() /* use to broadcast rowk[] */
For actual syntax use a good MPI book or manual
Online resource: http://www-unix.mcs.anl.gov/mpi/www/
IIT-Madras
MPI Prologue to our Floyd example

int
int
can
int
a[MAX][MAX];
n=20; /* real size of the matrix,
be read in */
id,p;
MPI_Init(argc,argv);
MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&p);
.
./* This is where all the real work happens */
.
MPI_Finalize();
/* Epilogue */
2
IIT-Madras
This is the time to try out several

simple MPI programs using the
few functions we have seen.
- use mpicc
- use mpirun
23
Tasks/CPUs
Visualizing the executionMultiple
maybe on the same
Job is Launched
Tasks On
CPUs
node
Scheduler ensures 1
task per cpu
MPI_INIT, MPI_Comm_rank, MPI_Comm_size etc

Other initializations, like reading in the array
For initial values of k, task with rank 0 broadcasts row k, others re
For each value of k they do their computation with the correct

Loop above for all values of k
Task 0 receives all blocks of the final array and prints them out
MPI_Finalize
2
IIT-Madras
Communication vs Computation
Often communication is needed between iterations
to complete the work.
Often the more the tasks the more the
communication can become.
In Floyd, bigger p indicates that rowk will be sent to a

larger number of tasks.
If each iteration depends on more data, it can get very busy.
This may mean network contention; i.e., delays.

Try to count the numbr of as in a string. Time vs p
This is why for a fixed problem size increasing
number of CPUs does not continually increase
performance
This needs experimentation problem specific
IIT-Madras
Communication primitives
MPI_Send(sendbuffer,
senddatalength,
datatype, destination, tag,
communicator);
MPI_Send(Hello, strlen(Hello),
MPI_CHAR, 2 ,
100,
MPI_COMM_WORLD);
MPI_Recv(recvbuffer, revcdatalength,
MPI_CHAR, source, tag,
MPI_COMM_WORLD,
&status);
Send-Recv happen in pairs.
2
IIT-Madras
Collectives
Broadcast
is one-to-all communication
Both receivers and sender call the same
function
All MUST call it. All end up with SAME result.
MPI_Bcast (buffer, count, type, root, comm);
Examples
MPI_Bcast(&k, 1, MPI_INT, 0,
MPI_Comm_World);
Task 0 sends its integer k and all others receive it.
MPI_Bcast(rowk,n,MPI_INT,current_owner_task
,MPI_COMM_WORLD);
Current_owner_task sends rowk to all
others.
2
IIT-Madras
Try out a simple MPI program with

send-recvs and braodcasts.
Try out Floyds algorithm.
What if you have to read a file to
initialize Floyds algorithm?
28
A bit more on Broadcast

Ranks: 0
x : 0
1
1
MPI_Bcast(&x,1,..,0,..);
x : 0
MPI_Bcast(&x,1,..,0,..);
2
2
MPI_Bcast(&x,1,..,0,..);
IIT-Madras
Other useful collectives

MPI_Reduce(&values,&results,count,type,op
erator,
root,comm);
MPI_Reduce(&x, &res, 1, MPI_INT, MPI_SUM,

9, MPI_COMM_WORLD);
Task
number 9 gets in the variable res the

sum of whatever was in x in all of the tasks
(including itself).
Must
be called by ALL tasks.
IIT-Madras
Scattering as opposed to
broadcasting
MPI_Scatterv(sndbuf, sndcount[], send_disp[],

type,
recvbuf, recvcount, recvtype,
root, comm);
All
nodes MUST call
Rank0
Rank1
Rank0
3
IIT-Madras
Rank2
Rank3
Common Communication pitfalls!!

Make
sure that communication

primitives are called by the
right number of tasks.
Make
sure they are called in

the right sequence.
Make
sure that you use the

proper tags.
If
not, you can easily get into

deadlock (My program seems to
be hung)
IIT-Madras
More on work breakdown

Finding
the right work breakdown can be

challenging
Sometime
dynamic work breakdown is good
Master
(usually task 0) decides who will do

what and collects the results.
E.g.,
you have a huge number of 5x5

matrices to multiply (chained matrix
multiplication).
E.g.,
Search for a substring in a huge

collection of strings.
IIT-Madras
Master-slave dynamic work

assignment
Master
1
IIT-Madras
Slaves
Master slave example Reverse

strings
Slave(){
do{
MPI_Recv(&work,MAX,MPI_CHAR,i,0,MPI_COMM_WORLD,&stat);
n=strlen(work);
if(n==0) break; /* detecting the end */
reverse(work);
MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD);
} while (1);
MPI_Finalize();
}
IIT-Madras
Master slave example Reverse

Master(){ /* rank 0 task */
strings
initialize_work_tems();
for(i=1;i<np;i++){ /* Initial work distribution */
work=next_work_item();
n = strlen(work)+1;
MPI_Send(&work,n,MPI_CHAR,i,0,MPI_COMM_WORLD);
}
unfinished_work=np;
while (unfinished_work!=0) {
MPI_Recv(&res,MAX,MPI_CHAR,MPI_ANY_SOURCE,0,
MPI_COMM_WORLD,&status);
process(res);
work=next_work_item();
if(work==NULL) unfinished_work--;
else {
n=strlen(work)+1;
MPI_Send(&work,n,MPI_CHAR,status->MPI_source,
0,MPI_COMM_WORLD);
}
}
3
IIT-Madras
Master slave example

Main(){
...
MPI_Comm_Rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&np);
if (id ==0 )
Master();
else
Slave();
...
}
3
IIT-Madras
Matrix Multiply and Communication

Patterns
38
Block Distribution of Matrices

Matrix
Mutliply:
Cij = (Aik * Bkj)
BMR Algorithm:
Each task owns a block

its own part of A,B and C
The old formula holds for
blocks!
Example:
C21=A20 * B01
A21 * B11
A22 * B21
A23 * B31
Each is a smaller Block a submatrix

3
IIT-Madras
Block Distribution of Matrices

Matrix
Mutliply:
Cij = (Aik * Bkj)
BMR Algorithm:
C21 = A20 * B01

A21 * B11
A22 * B21
A23 * B31
A22 is row broadcast

A22*B21 added into C21
B_1 is Rolled up one slot
Out task now has B31
Now repeat the above block excep

Each is a smaller Block a submatrix
4
IIT-Madras
the item to broadcast is A23
Attempt doing this with just SendRecv and Broadcast
41
Communicators and Topologies

BMR
example shows limitations of

broadcast.. Although there is pattern
Communicators
can be created on
subgroups of processes.
Communicators
topology
can be created that have a
Will make programming natural

Might improve performance by matching to
hardware
IIT-Madras
for (k = 0; k < s; k++) {

sender = (my_row + k) % s;
if (sender == my_col) {
MPI_Bcast(&my_A, m*m,
MPI_INT,
sender,
row_comm);
T = my_A;
else
MPI_Bcast(&T, m*m, MPI_INT,

sender, row_comm);
my_C = my_C + T x my_B;

}
MPI_Sendrecv_replace(my_B, m*m, MPI_INT,
dest, 0, source, 0, col_comm, &status); }
4
IIT-Madras
Creating topologies and

communicators
Creating
a grid
MPI_Cart_create(MPI_COMM_WORLD,
2,
dim_sizes, istorus, canreorder, &grid_comm);
int dim_sizes[2], int istorus[2], int canreorder,
MPI_Comm grid_comm
Divide
a grid into rows- each with own

communicator
MPI_Cart_sub(grid_comm,free,&rowcom)
MPI_Comm rowcomm; int free[2]
IIT-Madras
Try implementing the BMR

algorithm with communicators
45
A brief on other MPI Topics The

last leg
MPI+Multi-threaded
One
MPI
/ OpenMP
sided Communication
and IO
IIT-Madras
MPI and OpenMP

Grain
Communication
Where does
the interesting
pragma omp for
fit in our MPI
Floyd?
How do I
assign exactly
one MPI task
per CPU?
IIT-Madras
One-Sided Communication
Have
no corresponding send-recv pairs!
RDMA
Get
Put
IIT-Madras
IO in Parallel Programs
Typically
a root task, does the IO.
Simpler to program
Natural because of some post processing
occasionally needed (sorting)
All nodes generating IO requests might overwhelm
fileserver, essentially sequentializing it.
Performance
not the limitation for Lustre/SFS.
Parallel
IO interfaces such as MPI-IO can make

use of parallel filesystems such as Lustre.
IIT-Madras
MPI-BLAST exec time vs other

time[4]
IIT-Madras
How IO/Comm Optimizations help

MPI-BLAST[4]
IIT-Madras
What did we learn?

Distributed
Parallel
Work
Memory Programming Model
Algorithm Basics
Breakdown
Topologies
in Communication
Communication
Impact
Overhead vs Computation
of Parallel IO
IIT-Madras
What MPI Calls did we see here?

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
MPI_Init
MPI_Finalize
MPI_Comm_size
MPI_Comm_Rank
MPI_Send
MPI_Recv
MPI_Sendrecv_replace
MPI_Bcast
MPI_Reduce
MPI_Cart_create
MPI_Cart_sub
MPI_Scatter
IIT-Madras
References
1.
Parallel Programming in C with MPI and OpenMP,

M J Quinn, TMH. This is an excellent practical
book. Motivated much of the material here,
specifically Floyds algorithm.
2.
BMR Algorithm for Matrix Multiply and topology

ideas is motivated by
http://www.cs.indiana.edu/classes/b673/notes/ma
trix_mult.html
MPI online manual
http://www-unix.mcs.anl.gov/mpi/www/
3.
4.
Efficient Data Access For Parallel BLAST, IPDPDS05
IIT-Madras

Parallel Programming and MPI

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Programming and MPI

Uploaded by

Copyright:

Available Formats

Parallel

2006 Hewlett-Packard Development Company, L.P.

Context and Background

IIT- Madras has recently added a good deal of

Why this course

Length -2 theory and 2 practice sessions, 4 hrs

enough to pickup from the man

work breakdown models

are used to sequential programming C,

idea Specify the steps in perfect order

We are used to parallelism a lot more

Launch a set of tasks;

Shared vs Distributed Memory

Shared Memory All tasks access the same

Distributed Memory All memory is local. Data

HW Programming model relationship Tasks vs

Designing Parallel Programs

Simple Parallel Program sorting

divide A into 5 pieces

part is sorted by an independent

resultant parts are merged by simply

What is different Think about

many people doing the work. (Degree of

is needed to begin the work.

does what. (Work distribution)

to work part. (Data/IO access)

they need info from each other to

are they all done. (Synchronization)

simple intuitive breakdowns

highly optimized sequential

work often involves some pre- or

vs large grain parallelism and

Lets get a simple MPI Program to

# If you want HP-MPI set your path

Results are in the output file.

program is run at the same time on 2

is slightly different in that each returns

gives each instance its identity

can make different instances run

it is an SPMD model of computation

Continuing work breakdown

Simple Example: Find shortest

Let Nodes be numbered 0,1,,n-1

Floyds (sequential) algorithm

we just need n2 tasks, with each

each iteration we need to make sure

for shared memory.. Programming

if we have less than n2 tasks?... Say

to divide the work among the p tasks.

can simply divide up the rows.

Dividing the work

task gets [n/p] rows, with the last

/* id is TASK NUMBER, each node has only the part of A

calls its own matrix by

The MPI model

MPI tasks are typically created when

allow for tasks to find out

Size of the task group

MPI Notions [ Taking from the

Communicator A group of tasks in a program

Rank Each tasks ID in the group

Size Of the group

For actual syntax use a good MPI book or manual

Online resource: http://www-unix.mcs.anl.gov/mpi/www/