You are on page 1of 54

Parallel

Programming
and MPI
A course for IIT-M. September 2008
R Badrinath, STSD Bangalore
(ramamurthy.badrinath@hp.com)

2006 Hewlett-Packard Development Company, L.P.


The information contained herein is subject to change without notice

Context and Background

IIT- Madras has recently added a good deal of


compute power.

Why
Further R&D in sciences, engineering
Provide computing services to the region
Create new opportunities in education and skills

Why this course


Update skills to program modern cluster computers

Length -2 theory and 2 practice sessions, 4 hrs


each
IIT-Madras

Audience Check

Contents
1.
2.
3.
4.

MPI_InitInstead we
Understand Issues
MPI_Comm_rank
MPI_Comm_size
Understand Concepts
MPI_Send

5.

MPI_Recv Learn

6.

MPI_Bcast

7.

MPI_Create_comm

8.

MPI_Sendrecv

9.

MPI_Scatter

10.

MPI_Gather

Go by motivating examples
Try out some of the examples

enough to pickup from the man

IIT-Madras

Outline
Sequential

vs Parallel programming

Shared

vs Distributed Memory

Parallel

work breakdown models

Communication
MPI

Examples

MPI

Concepts

The

role of IO

IIT-Madras

vs Computation

Sequential vs Parallel
We

are used to sequential programming C,


Java, C++, etc. E.g., Bubble Sort, Binary Search,
Strassen Multiplication, FFT, BLAST,

Main

idea Specify the steps in perfect order

Reality

We are used to parallelism a lot more


than we think as a concept; not for
programming

Methodology

Launch a set of tasks;


communicate to make progress. E.g., Sorting 500
answer papers by making 5 equal piles, have
them sorted by 5 people, merge them together.

IIT-Madras

Shared vs Distributed Memory


Programming

Shared Memory All tasks access the same


memory, hence the same data. pthreads

Distributed Memory All memory is local. Data


sharing is by explicitly transporting data from one
task to another (send-receive pairs in MPI, e.g.)

Program
Memory

Communications channel

HW Programming model relationship Tasks vs


CPUs;

SMPs vs Clusters

IIT-Madras

Designing Parallel Programs

Simple Parallel Program sorting


numbers in a large array A
Notionally

divide A into 5 pieces


[0..99;100..199;200..299;300..399;400..499
].

Each

part is sorted by an independent


sequential algorithm and left within its
region.

The

resultant parts are merged by simply


reordering among adjacent parts.

IIT-Madras

What is different Think about


How

many people doing the work. (Degree of


Parallelism)

What

is needed to begin the work.


(Initialization)

Who

does what. (Work distribution)

Access

to work part. (Data/IO access)

Whether

they need info from each other to


finish their own job. (Communication)

When
What
1

are they all done. (Synchronization)


needs to be done to collate the result.
IIT-Madras

Work Break-down
Parallel
Prefer

algorithm

simple intuitive breakdowns

Usually

highly optimized sequential


algorithms are not easily parallelizable

Breaking

work often involves some pre- or


post- processing (much like divide and
conquer)

Fine

vs large grain parallelism and


relationship to communication

IIT-Madras

Digression

Lets get a simple MPI Program to

work

#include <mpi.h>
#include <stdio.h>
int main()
{
int total_size, my_rank;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &total_size);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
printf("\n Total number of programs = %d, out of
which rank of this process is %d\n", total_size,
my_rank);
MPI_Finalize();
return 0;
}
1

IIT-Madras

Getting it to work

Compile it:
mpicc o simple simple.c

# If you want HP-MPI set your path


# /opt/hpmpi/bin

Run it
This depends a bit on the system
mpirun -np2 simple
qsub l ncpus=2 o simple.out /opt/hpmpi/bin/mpirun <your
program location>/simple
[Fun: qsub l ncpus=2 I hostname ]

Results are in the output file.


What is mpirun ?
What does qsub have to do with MPI?... More about qsub in a
separate talk.

IIT-Madras

What goes on
Same

program is run at the same time on 2


different CPUs

Each

is slightly different in that each returns


different values for some simple calls like
MPI_Comm_rank.

This

gives each instance its identity

We

can make different instances run


different pieces of code based on this
identity difference

Typically
1

it is an SPMD model of computation

IIT-Madras

Continuing work breakdown

Simple Example: Find shortest


distances
PROBLEM:
7
Find shortest path
distances

2
5
1
2
7

Let Nodes be numbered 0,1,,n-1


Let us put all of this in a matrix
A[i][j] is the distance from i to j

IIT-Madras

0
7

2
0

1
..

..
..

6
..

1
..
..

5
..
..

0
2
..

2
0
..

3
2
0

Floyds (sequential) algorithm


For (k=0; k<n; k++)
For (i=0; i<n; i++)
for (j=0; j<n; j++)
a[i][j]=min( a[i][j] , a[i,k]+a[k][j] );
Observation:
For a fixed k,
Computing i-th row needs i-th row and k-th
row
1

IIT-Madras

Parallelizing Floyd
Actually

we just need n2 tasks, with each


task iterating n times (once for each value of
k).

After

each iteration we need to make sure


everyone sees the matrix.

Ideal

for shared memory.. Programming

What

if we have less than n2 tasks?... Say

Need

to divide the work among the p tasks.

p<n.

We
1

can simply divide up the rows.


IIT-Madras

Dividing the work


Each

task gets [n/p] rows, with the last


possibly getting a little more.
T0
i-th row
q x [ n/p ]
Tq
k-th row

IIT-Madras

Remember
the
observation

/* id is TASK NUMBER, each node has only the part of A


that it owns. This is approximate code
Note*/
that each node

calls its own matrix by


the same name name a
current_owner_task = GET_BLOCK_OWNER(k);
[ ][ ] but has only
if (id == current_owner_task)
{
[p/n] rows.

for (k=0;k<n;k++) {

The MPI
Model
-All nodes run
the same
code!! P
replica tasks!!

k_here = k - LOW_END_OF_MY_BLOCK(id);
for(j=0;j<n;j++)

Distributed Memory
rowk[j]=a[k_here][j];
Model

-Some
times
/* rowk
is broadcast by the owner and received by
they
need to
others..
do different
The MPI code will come here later */
things
for(i=0;i<GET_MY_BLOCK_SIZE(id);i++)
for(j=0;j<n;j++)
a[i,j]=min(a[i][j],
a[i][k]+rowk[j]);
1

IIT-Madras

The MPI model


Recall

MPI tasks are typically created when


the jobs are launched not inside the MPI
program (no forking).
mpirun usually creates the task set
mpirun np 2 a.out <args to a.out>
a.out is run on all nodes and a communication
channel is setup between them

Functions

allow for tasks to find out

Size of the task group


Ones own position within the group
2

IIT-Madras

MPI Notions [ Taking from the


example ]

Communicator A group of tasks in a program

Rank Each tasks ID in the group


MPI_Comm_rank() /* use this to set id */

Size Of the group


MPI_Comm_size() /* use to set p */

Notion of send/receive/broadcast
MPI_Bcast() /* use to broadcast rowk[] */

For actual syntax use a good MPI book or manual

Online resource: http://www-unix.mcs.anl.gov/mpi/www/

IIT-Madras

MPI Prologue to our Floyd example


int
int
can
int

a[MAX][MAX];
n=20; /* real size of the matrix,
be read in */
id,p;

MPI_Init(argc,argv);
MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&p);
.
./* This is where all the real work happens */
.
MPI_Finalize();
/* Epilogue */
2

IIT-Madras

This is the time to try out several


simple MPI programs using the
few functions we have seen.
- use mpicc
- use mpirun

23

Tasks/CPUs
Visualizing the executionMultiple
maybe on the same
Job is Launched
Tasks On
CPUs

node
Scheduler ensures 1
task per cpu

MPI_INIT, MPI_Comm_rank, MPI_Comm_size etc


Other initializations, like reading in the array

For initial values of k, task with rank 0 broadcasts row k, others re

For each value of k they do their computation with the correct


Loop above for all values of k
Task 0 receives all blocks of the final array and prints them out
MPI_Finalize
2

IIT-Madras

Communication vs Computation
Often communication is needed between iterations
to complete the work.
Often the more the tasks the more the
communication can become.

In Floyd, bigger p indicates that rowk will be sent to a


larger number of tasks.
If each iteration depends on more data, it can get very busy.

This may mean network contention; i.e., delays.


Try to count the numbr of as in a string. Time vs p
This is why for a fixed problem size increasing
number of CPUs does not continually increase
performance
This needs experimentation problem specific

IIT-Madras

Communication primitives
MPI_Send(sendbuffer,

senddatalength,
datatype, destination, tag,
communicator);
MPI_Send(Hello, strlen(Hello),
MPI_CHAR, 2 ,
100,
MPI_COMM_WORLD);
MPI_Recv(recvbuffer, revcdatalength,
MPI_CHAR, source, tag,
MPI_COMM_WORLD,
&status);
Send-Recv happen in pairs.
2

IIT-Madras

Collectives
Broadcast

is one-to-all communication
Both receivers and sender call the same
function
All MUST call it. All end up with SAME result.
MPI_Bcast (buffer, count, type, root, comm);
Examples
MPI_Bcast(&k, 1, MPI_INT, 0,
MPI_Comm_World);
Task 0 sends its integer k and all others receive it.
MPI_Bcast(rowk,n,MPI_INT,current_owner_task
,MPI_COMM_WORLD);
Current_owner_task sends rowk to all
others.
2

IIT-Madras

Try out a simple MPI program with


send-recvs and braodcasts.
Try out Floyds algorithm.
What if you have to read a file to
initialize Floyds algorithm?
28

A bit more on Broadcast


Ranks: 0
x : 0

1
1

MPI_Bcast(&x,1,..,0,..);

x : 0

MPI_Bcast(&x,1,..,0,..);

2
2
MPI_Bcast(&x,1,..,0,..);

IIT-Madras

Other useful collectives


MPI_Reduce(&values,&results,count,type,op

erator,

root,comm);

MPI_Reduce(&x, &res, 1, MPI_INT, MPI_SUM,


9, MPI_COMM_WORLD);

Task

number 9 gets in the variable res the


sum of whatever was in x in all of the tasks
(including itself).

Must

be called by ALL tasks.

IIT-Madras

Scattering as opposed to
broadcasting

MPI_Scatterv(sndbuf, sndcount[], send_disp[],


type,
recvbuf, recvcount, recvtype,
root, comm);

All

nodes MUST call

Rank0

Rank1

Rank0
3

IIT-Madras

Rank2

Rank3

Common Communication pitfalls!!


Make

sure that communication


primitives are called by the
right number of tasks.

Make

sure they are called in


the right sequence.

Make

sure that you use the


proper tags.

If

not, you can easily get into


deadlock (My program seems to
be hung)

IIT-Madras

More on work breakdown


Finding

the right work breakdown can be


challenging

Sometime

dynamic work breakdown is good

Master

(usually task 0) decides who will do


what and collects the results.

E.g.,

you have a huge number of 5x5


matrices to multiply (chained matrix
multiplication).

E.g.,

Search for a substring in a huge


collection of strings.

IIT-Madras

Master-slave dynamic work


assignment
Master
1

IIT-Madras

Slaves

Master slave example Reverse


strings
Slave(){
do{
MPI_Recv(&work,MAX,MPI_CHAR,i,0,MPI_COMM_WORLD,&stat);
n=strlen(work);
if(n==0) break; /* detecting the end */
reverse(work);
MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD);
} while (1);
MPI_Finalize();
}

IIT-Madras

Master slave example Reverse


Master(){ /* rank 0 task */
strings
initialize_work_tems();
for(i=1;i<np;i++){ /* Initial work distribution */
work=next_work_item();
n = strlen(work)+1;
MPI_Send(&work,n,MPI_CHAR,i,0,MPI_COMM_WORLD);
}
unfinished_work=np;
while (unfinished_work!=0) {
MPI_Recv(&res,MAX,MPI_CHAR,MPI_ANY_SOURCE,0,
MPI_COMM_WORLD,&status);
process(res);
work=next_work_item();
if(work==NULL) unfinished_work--;
else {
n=strlen(work)+1;
MPI_Send(&work,n,MPI_CHAR,status->MPI_source,
0,MPI_COMM_WORLD);
}

}
3

IIT-Madras

Master slave example


Main(){
...
MPI_Comm_Rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&np);
if (id ==0 )
Master();
else
Slave();
...
}
3

IIT-Madras

Matrix Multiply and Communication


Patterns

38

Block Distribution of Matrices


Matrix

Mutliply:
Cij = (Aik * Bkj)
BMR Algorithm:

Each task owns a block


its own part of A,B and C
The old formula holds for
blocks!
Example:
C21=A20 * B01
A21 * B11
A22 * B21
A23 * B31

Each is a smaller Block a submatrix


3

IIT-Madras

Block Distribution of Matrices


Matrix

Mutliply:
Cij = (Aik * Bkj)
BMR Algorithm:

C21 = A20 * B01


A21 * B11
A22 * B21
A23 * B31

A22 is row broadcast


A22*B21 added into C21
B_1 is Rolled up one slot
Out task now has B31

Now repeat the above block excep


Each is a smaller Block a submatrix
4

IIT-Madras

the item to broadcast is A23

Attempt doing this with just SendRecv and Broadcast

41

Communicators and Topologies


BMR

example shows limitations of


broadcast.. Although there is pattern

Communicators

can be created on
subgroups of processes.

Communicators

topology

can be created that have a

Will make programming natural


Might improve performance by matching to
hardware

IIT-Madras

for (k = 0; k < s; k++) {


sender = (my_row + k) % s;
if (sender == my_col) {
MPI_Bcast(&my_A, m*m,
MPI_INT,
sender,
row_comm);
T = my_A;
else

MPI_Bcast(&T, m*m, MPI_INT,


sender, row_comm);

my_C = my_C + T x my_B;


}
MPI_Sendrecv_replace(my_B, m*m, MPI_INT,
dest, 0, source, 0, col_comm, &status); }
4

IIT-Madras

Creating topologies and


communicators
Creating

a grid

MPI_Cart_create(MPI_COMM_WORLD,

2,
dim_sizes, istorus, canreorder, &grid_comm);
int dim_sizes[2], int istorus[2], int canreorder,
MPI_Comm grid_comm

Divide

a grid into rows- each with own


communicator

MPI_Cart_sub(grid_comm,free,&rowcom)

MPI_Comm rowcomm; int free[2]

IIT-Madras

Try implementing the BMR


algorithm with communicators

45

A brief on other MPI Topics The


last leg
MPI+Multi-threaded
One
MPI

/ OpenMP

sided Communication
and IO

IIT-Madras

MPI and OpenMP


Grain
Communication

Where does
the interesting
pragma omp for
fit in our MPI
Floyd?
How do I
assign exactly
one MPI task
per CPU?

IIT-Madras

One-Sided Communication
Have

no corresponding send-recv pairs!

RDMA
Get
Put

IIT-Madras

IO in Parallel Programs
Typically

a root task, does the IO.

Simpler to program
Natural because of some post processing
occasionally needed (sorting)
All nodes generating IO requests might overwhelm
fileserver, essentially sequentializing it.
Performance

not the limitation for Lustre/SFS.

Parallel

IO interfaces such as MPI-IO can make


use of parallel filesystems such as Lustre.

IIT-Madras

MPI-BLAST exec time vs other


time[4]

IIT-Madras

How IO/Comm Optimizations help


MPI-BLAST[4]

IIT-Madras

What did we learn?


Distributed
Parallel
Work

Memory Programming Model

Algorithm Basics

Breakdown

Topologies

in Communication

Communication
Impact

Overhead vs Computation

of Parallel IO

IIT-Madras

What MPI Calls did we see here?


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.

MPI_Init
MPI_Finalize
MPI_Comm_size
MPI_Comm_Rank
MPI_Send
MPI_Recv
MPI_Sendrecv_replace
MPI_Bcast
MPI_Reduce
MPI_Cart_create
MPI_Cart_sub
MPI_Scatter
IIT-Madras

References
1.

Parallel Programming in C with MPI and OpenMP,


M J Quinn, TMH. This is an excellent practical
book. Motivated much of the material here,
specifically Floyds algorithm.

2.

BMR Algorithm for Matrix Multiply and topology


ideas is motivated by
http://www.cs.indiana.edu/classes/b673/notes/ma
trix_mult.html
MPI online manual
http://www-unix.mcs.anl.gov/mpi/www/

3.
4.

Efficient Data Access For Parallel BLAST, IPDPDS05

IIT-Madras

You might also like