You are on page 1of 35

Beginner MPI Tutorial

Welcome to the MPI tutorial for beginners! In this tutorial, you will learn all of the basic concepts of MPI by going through various examples. The different parts of the tutorial are meant to build on top of one another. If you feel lost during any lesson, feel free to leave a comment on the post explaining your dilemma. Either myself or another MPI expert will likely be able to get back to you soon. This beginning tutorial assumes that the reader has a general knowledge of how parallel programming works, has experience with working on a Linux system, and can also understand the C programming language. Introduction

MPI Introduction Installing MPICH2 Running an MPI Hello World Application

Blocking Point-to-Point Communication


MPI Send and Receive Dynamic Receiving with MPI_Probe (and MPI_Status) Point-to-Point Communication Application Example Random Walk

Collective Communication

MPI Broadcast and Collective Communication MPI Scatter, Gather, and Allgather

MPI Introduction
The Message Passing Interface (MPI) first appeared as a standard in 1994 for performing distributed-memory parallel computing. Since then, it has become the dominant model for high-performance computing, and it is used widely in research, academia, and industry. The functionality of MPI is extremely rich, offering the programmer with the ability to perform: point-to-point communication, collective communication, one-sided communication, parallel I/O, and even dynamic process management. These terms probably sound quite strange to a beginner, but by the end of all of the tutorials, the terminology will be common place. Before starting the tutorials, familiarize yourself with the basic concepts below. These are all related to MPI, and many of these concepts are referred to throughout the tutorials. The Message Passing Model

The message passing model is a model of parallel programming in which processes can only share data by messages. MPI adheres to this model. If one process wishes to transfer data to another, it must initiate a message and explicitly send data to that process. The other process will also have to explicitly receive the other message (except in the case of one-sided communication, but we will get to that later). Forcing communication to happen in this way offers several advantages for parallel programs. For example, the message passing model is portable across a wide range of architectures. An MPI program can run across computers that are spread across the globe and connected by the internet, or it can execute on tightly-coupled clusters. An MPI program can even run on the cores of a shared-memory processor and pass messages through the shared memory. All of these details are abstracted by the interface. The debugging of these programs is often easier too, since one does not need to worry about processes overwriting the address space of another. MPIs Design for the Message Passing Model MPI has a couple classic concepts that encourage clear parallel program design using the message passing model. The first is the notion of a communicator. A communicator defines a group of processes that have the ability to communicate with one another. In this group of processes, each is assigned a unique rank, and they explicitly communicate with one another by their ranks. The foundation of communication is built upon the simple send and receive operations. A process may send a message to another process by providing the rank of the process and a unique tag to identify the message. The receiver can then post a receive for a message with a given tag (or it may not even care about the tag), and then handle the data accordingly. Communications such as this which involve one sender and receiver are known as point-to-point communications. There are many cases where processes may need to communicate with everyone else. For example, when a master process needs to broadcast information to all of its worker processes. In this case, it would be cumbersome to write code that does all of the sends and receives. In fact, it would often not use the network in an optimal manner. MPI can handle a wide variety of these types of collective communications that involve all processes. Mixtures of point-to-point and collective communications can be used to create highly complex parallel programs. In fact, this functionality is so powerful that it is not even necessary to start describing the advanced mechanisms of MPI. We will save that until a later lesson. For now, you should work on installing MPI on your machine. If you already have MPI installed, great! You can head over to the MPI Hello World lesson.

Installing MPICH2
MPI is simply a standard which others follow in their implementation. Because of this, there are a wide variety of MPI implementations out there. One of the most popular implementations, MPICH2, will be used for all of the examples provided through this site. Users are free to use any implementation they wish, but only instructions for

installing MPICH2 will be provided. Furthermore, the scripts and code provided for the lessons are only guaranteed to execute and run with the lastest version of MPICH2. MPICH2 is a widely-used implementation of MPI that is developed primarily by Argonne National Laboratory in the United States. The main reason for choosing MPICH2 over other implementations is simply because of my familiarity with the interface and because of my close relationship with Argonne National Laboratory. I also encourage others to check out OpenMPI, which is also a widely-used implementation. Installing MPICH2 The latest version of MPICH2 is available here. The version that I will be using for all of the examples on the site is 1.4, which was released June 16, 2011. Go ahead and download the source code, uncompress the folder, and change into the MPICH2 directory.

Once doing this, you should be able to configure your installation by performing ./configure. I added a couple of parameters to my configuration to avoid building the MPI Fortran library. If you need to install MPICH2 to a local directory (for example, if you dont have root access to your machine), type ./configure -prefix=/installation/directory/path For more information about possible configuration parameters, type ./configure --help

When configuration is done, it should say Configuration completed. Once this is through, it is time to build and install MPICH2 with make; sudo make install.

If your build was successful, you should be able to type mpich2version and see something similar to this.

Hopefully your build finished successfully. If not, you may have issues with missing dependencies. For any issue, I highly recommend copying and pasting the error message directly into Google. Running an MPI Program Now that you have installed MPICH, whether its on your local machine or cluster, it is time to run a simple application. The MPI Hello World lesson goes over the basics of an MPI program, along with a guide on how to run MPICH2 for the first time.

MPI Hello World


In this lesson, I will show you a basic MPI Hello World application and also discuss how to run an MPI program. The lesson will cover the basics of initializing MPI and running an MPI job across several processes. This lesson is intended to work with installations of MPICH2 (specifically 1.4). If you have not installed MPICH2, please refer back to the installing MPICH2 lesson. MPI Hello World First of all, the source code for this lesson can be downloaded here. Download it, extract it, and change to the example directory. The directory should contain three files: makefile, mpi_hello_world.c, and run.perl.

Open the mpi_hello_world.c source code. Below are some excerpts from the code.

#include <mpi.h> int main(int argc, char** argv) { // Initialize the MPI environment MPI_Init(NULL, NULL); // Get the number of processes int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); // Get the rank of the process int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // Get the name of the processor char processor_name[MPI_MAX_PROCESSOR_NAME]; int name_len; MPI_Get_processor_name(processor_name, &name_len); // Print off a hello world message printf("Hello world from processor %s, rank %d" " out of %d processors\n", processor_name, world_rank, world_size); // Finalize the MPI environment. MPI_Finalize(); }

You will notice that the first step to building an MPI program is including the MPI header files with #include <mpi.h>. After this, the MPI environment must be initialized with MPI_Init(NULL, NULL). During MPI_Init, all of MPIs global and internal variables are constructed. For example, a communicator is formed around all of the processes that were spawned, and unique ranks are assigned to each process. Currently, MPI_Init takes two arguments that are not necessary, and the extra parameters are simply left as extra space in case future implementations might need them. After MPI_Init, there are two main functions that are called. These two functions are used in almost every single MPI program that you will write.

Returns the size of a communicator. In our example, MPI_COMM_WORLD (which is constructed for us by MPI) encloses all of the processes in the job, so this call should return the amount of processes that were requested for the job. MPI_Comm_rank(MPI_Comm communicator, int* rank) Returns the rank of a process in a communicator. Each process inside of a communicator is assigned an incremental rank starting from zero. The ranks of the processes are primarily used for identification purposes when sending and receiving messages.
MPI_Comm_size(MPI_Comm communicator, int* size)

A miscellaneous and less-used function in this program is


MPI_Get_processor_name(char* name, int* name_length),

which can obtain the actual name of the processor on which the process is executing. The final call in this program, MPI_Finalize() is used to clean up the MPI environment. No more MPI calls can be made after this one.

Running MPI Hello World

Now compile the example by typing make. My makefile looks for the MPICC environment variable. If you installed MPICH2 to a local directory, set your MPICC environment variable to point to your mpicc binary. The mpicc program in your installation is really just a wrapper around gcc, and it makes compiling and linking all of the necessary MPI routines much easier.

After your program is compiled, it is ready to be executed. Now comes the part where you might have to do some additional configuration. If you are running MPI programs on a cluster of nodes, you will have to set up a host file. If you are simply running MPI on a laptop or a single machine, disregard the next piece of information. The host file contains names of all of the computers on which your MPI job will execute. For ease of execution, you should be sure that all of these computers have SSH access, and you should also setup an authorized keys file to avoid a password prompt for SSH. My host file looks like this.

For the run script that I have provided in the download, you should set an environment variable called MPI_HOSTS and have it point to your hosts file. My script will automatically include it in the command line when the MPI job is launched. If you do not need a hosts file, simply do not set the environment variable. Also, if you have a local installation of MPI, you should set the MPIRUN environment variable to point to the mpirun binary from the installation. After this, call ./run.perl mpi_hello_world to run the example application.

As expected, the MPI program was launched across all of the hosts in my host file. Each process was assigned a unique rank, which was printed off along with the process name. As one can see from my example output, the output of the processes is in an arbitrary order since there is no synchronization involved before printing.

Notice how the script called mpirun. This is program that the MPI implementation uses to launch the job. Processes are spawned across all the hosts in the host file and the MPI program executes across each process. My script automatically supplies the -n flag to set the number of MPI processes to four. Try changing the run script and launching more processes! Dont accidentally crash your system though. Now you might be asking, My hosts are actually dual-core machines. How can I get MPI to spawn processes across the individual cores first before individual machines? The solution is pretty simple. Just modify your hosts file and place a colon and the number of cores per processor after the host name. For example, I specified that each of my hosts has two cores.

When I execute the run script again, voila!, the MPI job spawns two processes on only two of my hosts.

Up Next Now that you have a basic understanding of how an MPI program is executed, it is now time to learn fundamental point-to-point communication routines. In the next lesson, I cover basic sending and receiving routines in MPI. Feel free to also examine the beginner MPI tutorial for a complete reference of all of the beginning MPI lessons.

MPI Send and Receive


Sending and receiving are the two foundational concepts of MPI. Almost every single function in MPI can be implemented with basic send and receive calls. In this lesson, I will discuss how to use MPIs blocking sending and receiving functions, and I will also overview other basic concepts associated with transmitting data using MPI. The code for this tutorial is available here. Overview of Sending and Receiving with MPI MPIs send and receive calls operate in the following manner. First, process A decides a message needs to be sent to process B. Process A then packs up all of its necessary data into a buffer for process B. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters

are packed into envelopes before transmission to the post office). After the data is packed into a buffer, the communication device (which is often a network) is responsible for routing the message to the proper location. The location of the message is defined by the processs rank. Even though the message is routed to B, process B still has to acknowledge that it wants to receive As data. Once it does this, the data has been transmitted. Process A is acknowledged that the data has been transmitted and may go back to work. Sometimes there are cases when A might have to send many different types of messages to B. Instead of B having to go through extra measures to differentiate all these messages, MPI allows senders and receivers to also specify message IDs with the message (known as tags). When process B only requests a message with a certain tag number, messages with different tags will be buffered by the network until B is ready for them. With these concepts in mind, lets look at the prototypes for the MPI sending and receiving functions.
MPI_Send(void* data, int count, MPI_Datatype datatype, int destination, int tag, MPI_Comm communicator) MPI_Recv(void* data, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm communicator, MPI_Status* status)

Although this might seem like a mouthful when reading all of the arguments, they become easier to remember since almost every MPI call uses similar syntax. The first argument is the data buffer. The second and third arguments describe the count and type of elements that reside in the buffer. MPI_Send sends the exact count of elements, and MPI_Recv will receive at most the count of elements (more on this in the next lesson). The fourth and fifth arguments specify the rank of the sending/receiving process and the tag of the message. The sixth argument specifies the communicator and the last argument (for MPI_Recv only) provides information about the received message. Elementary MPI Datatypes The MPI_Send and MPI_Recv functions utilize MPI Datatypes as a means to specify the structure of a message at a higher level. For example, if the process wishes to send one integer to another, it would use a count of one and a datatype of MPI_INT. The other elementary MPI datatypes are listed below with their equivalent C datatypes. MPI_CHAR MPI_SHORT MPI_INT MPI_LONG MPI_LONG_LONG MPI_UNSIGNED_CHAR MPI_UNSIGNED_SHORT MPI_UNSIGNED MPI_UNSIGNED_LONG char short int int long int long long int unsigned char unsigned short int unsigned int unsigned long int

MPI_UNSIGNED_LONG_LONG unsigned long long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE For now, we will only make use of these datatypes in the beginner MPI tutorial. Once we have covered enough basics, you will learn how to create your own MPI datatypes for characterizing more complex types of messages. MPI Send / Recv Program The code for this tutorial is available here. Go ahead and download and extract the code. I refer the reader back to the MPI Hello World Lesson for instructions on how to use my code packages. The first example is in send_recv.c. Some of the major parts of the program are shown below.
// Find out rank, size int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); int number; if (world_rank == 0) { number = -1; MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if (world_rank == 1) { MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process 1 received number %d from process 0\n", number); }

MPI_Comm_rank and MPI_Comm_size are first used to determine the world size along with the rank of the process. Then process zero initializes a number to the value of negative one and sends this value to process one. As you can see in the else if statement, process one is calling MPI_Recv to receive the number. It also prints off the received value. Since we are sending and receiving exactly one integer, each process requests that one MPI_INT be sent/received. Each process also uses a tag number of zero to identify the message. The processes could have also used the predefined constant MPI_ANY_TAG for the tag number since only one type of message was being transmitted. Running the example program looks like this.

As expected, process one receives negative one from process zero. MPI Ping Pong Program The next example is a ping pong program. In this example, processes use MPI_Send and MPI_Recv to continually bounce messages off of each other until they decide to stop. Take a look at ping_pong.c in the example code download. The major portions of the code look like this.
int ping_pong_count = 0; int partner_rank = (world_rank + 1) % 2; while (ping_pong_count < PING_PONG_LIMIT) { if (world_rank == ping_pong_count % 2) { // Increment the ping pong count before you send it ping_pong_count++; MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD); printf("%d sent and incremented ping_pong_count " "%d to %d\n", world_rank, ping_pong_count, partner_rank); } else { MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%d received ping_pong_count %d from %d\n", world_rank, ping_pong_count, partner_rank); } }

This example is meant to be executed with only two processes. The processes first determine their partner with some simple arithmetic. A ping_pong_count is initiated to zero and it is incremented at each ping pong step by the sending process. As the ping_pong_count is incremented, the processes take turns being the sender and receiver. Finally, after the limit is reached (ten in my code), the processes stop sending and receiving. The output of the example code will look something like this.

The output of the programs of others will likely be different. However, as you can see, process zero and one are both taking turns sending and receiving the ping pong counter to each other. Ring Program

I have included one more example of MPI_Send and MPI_Recv using more than two processes. In this example, a value is passed around by all processes in a ring-like fashion. Take a look at ring.c in the example code download. The major portion of the code looks like this.
int token; if (world_rank != 0) { MPI_Recv(&token, 1, MPI_INT, world_rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", world_rank, token, world_rank - 1); } else { // Set the token's value if you are process 0 token = -1; } MPI_Send(&token, 1, MPI_INT, (world_rank + 1) % world_size, 0, MPI_COMM_WORLD); // Now process 0 can receive from the last process. if (world_rank == 0) { MPI_Recv(&token, 1, MPI_INT, world_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", world_rank, token, world_size - 1); }

The ring program initializes a value from process zero, and the value is passed around every single process. The program terminates when process zero receives the value

from the last process. As you can see from the program, extra care is taken to assure that it doesnt deadlock. In other words, process zero makes sure that it has completed its first send before it tries to receive the value from the last process. All of the other processes simply call MPI_Recv (receiving from their neighboring lower process) and then MPI_Send (sending the value to their neighboring higher process) to pass the value along the ring. MPI_Send and MPI_Recv will block until the message has been transmitted. Because of this, the printfs should occur by the order in which the value is passed. Using five processes, the output should look like this.

As we can see, process zero first sends a value of negative one to process one. This value is passed around the ring until it gets back to process zero. Up Next Now that you have a basic understanding of MPI_Send and MPI_Recv, it is now time to go a little bit deeper into these functions. In the next lesson, I cover how to probe and dynamically receive messages. Feel free to also examine the beginner MPI tutorial for a complete reference of all of the beginning MPI lessons.

Dynamic Receiving with MPI Probe (and MPI Status)


In the previous lesson, I discussed how to use MPI_Send and MPI_Recv to perform standard point-to-point communication. I only covered how to send messages in which the length of the message was known beforehand. Although it is possible to send the length of the message as a separate send / recv operation, MPI natively supports dynamic messages with just a few additional function calls. I will be going over how to use these functions in this lesson. The code for this tutorial is located here. The MPI_Status Structure

As covered in the previous lesson, the MPI_Recv operation takes the address of an MPI_Status structure as an argument (which can be ignored with MPI_STATUS_IGNORE). If we pass an MPI_Status structure to the MPI_Recv function, it will be populated with additional information about the receive operation after it completes. The three primary pieces of information include:

1. The rank of the sender. The rank of the sender is stored in the MPI_SOURCE element of the structure. That is, if we declare an MPI_Status stat variable, the rank can be accessed with stat.MPI_SOURCE. 2. The tag of the message. The tag of the message can be accessed by the MPI_TAG element of the structure (similar to MPI_SOURCE). 3. The length of the message. The length of the message does not have a predefined element in the status structure. Instead, we have to find out the length of the message with
MPI_Get_count(MPI_Status* status, MPI_Datatype datatype, int* count)

where count is the total number of datatype elements that were received. Why would any of this information be necessary? It turns out that MPI_Recv can take MPI_ANY_SOURCE for the rank of the sender and MPI_ANY_TAG for the tag of the message. For this case, the MPI_Status structure is the only way to find out the actual sender and tag of the message. Furthermore, MPI_Recv is not guaranteed to receive the entire amount of elements passed as the argument to the function call. Instead, it receives the amount of elements that were sent to it (and returns an error if more elements were sent than the desired receive amount). The MPI_Get_count function is used to determine the actual receive amount. An Example of Querying the MPI_Status Structure The program that queries the MPI_Status structure, check_status.c, is provided in the example code. The program sends a random amount of numbers to a receiver, and the receiver then finds out how many numbers were sent. The main part of the code looks like this.
const int MAX_NUMBERS = 100; int numbers[MAX_NUMBERS]; int number_amount; if (world_rank == 0) { // Pick a random amont of integers to send to process one srand(time(NULL)); number_amount = (rand() / (float)RAND_MAX) * MAX_NUMBERS; // Send the amount of integers to process one MPI_Send(numbers, number_amount, MPI_INT, 1, 0, MPI_COMM_WORLD); printf("0 sent %d numbers to 1\n", number_amount); } else if (world_rank == 1) { MPI_Status status; // Receive at most MAX_NUMBERS from process zero MPI_Recv(numbers, MAX_NUMBERS, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); // After receiving the message, check the status to determine // how many numbers were actually received MPI_Get_count(&status, MPI_INT, &number_amount); // Print off the amount of numbers, and also print additional // information in the status object printf("1 received %d numbers from 0. Message source = %d, " "tag = %d\n", number_amount, status.MPI_SOURCE, status.MPI_TAG);

As we can see, process zero randomly sends up to MAX_NUMBERS integers to process one. Process one then calls MPI_Recv for a total of MAX_NUMBERS integers. Although process one is passing MAX_NUMBERS as the argument to MPI_Recv, process one will receive at most this amount of numbers. In the code, process one calls MPI_Get_count with MPI_INT as the datatype to find out how many integers were actually received. Along with printing off the size of the received message, process one also prints off the source and tag of the message by accessing the MPI_SOURCE and MPI_TAG elements of the status structure. As a clarification, the return value from MPI_Get_count is relative to the datatype which is passed. If the user were to use MPI_CHAR as the datatype, the returned amount would be four times as large (assuming an integer is four bytes and a char is one byte). If you run the check_status program, the output should look similar to this.

As expected, process zero sends a random amount of integers to process one, which prints off information about the received message. Using MPI_Probe to Find Out the Message Size Now that you understand how the MPI_Status object works, we can now use it to our advantage a little bit more. Instead of posting a receive and simply providing a really large buffer to handle all possible sizes of messages (as we did in the last example), you can use MPI_Probe to query the message size before actually receiving it. The function prototype looks like this.

MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status* status)

MPI_Probe looks quite similar to MPI_Recv. In fact, you can think of MPI_Probe as an MPI_Recv that does everything but receive the message. Similar to MPI_Recv, MPI_Probe will block for a message with a matching tag and sender. When the message is available, it will fill the status structure with information. The user can then use MPI_Recv to receive the actual message. The provided code has an example of this in probe.c. Heres what the main source code looks like.
int number_amount; if (world_rank == 0) {

const int MAX_NUMBERS = 100; int numbers[MAX_NUMBERS]; // Pick a random amont of integers to send to process one srand(time(NULL)); number_amount = (rand() / (float)RAND_MAX) * MAX_NUMBERS; // Send the random amount of integers to process one MPI_Send(numbers, number_amount, MPI_INT, 1, 0, MPI_COMM_WORLD); printf("0 sent %d numbers to 1\n", number_amount); } else if (world_rank == 1) { MPI_Status status; // Probe for an incoming message from process zero MPI_Probe(0, 0, MPI_COMM_WORLD, &status); // When probe returns, the status object has the size and other // attributes of the incoming message. Get the size of the message MPI_Get_count(&status, MPI_INT, &number_amount); // Allocate a buffer just big enough to hold the incoming numbers int* number_buf = (int*)malloc(sizeof(int) * number_amount); // Now receive the message with the allocated buffer MPI_Recv(number_buf, number_amount, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("1 dynamically received %d numbers from 0.\n", number_amount); free(number_buf); }

Similar to the last example, process zero picks a random amount of numbers to send to process one. What is different in this example is that process one now calls MPI_Probe to find out how many elements process zero is trying to send (using MPI_Get_count). Process one then allocates a buffer of the proper size and receives the numbers. Running the code will look similar to this.

Although this example is trivial, MPI_Probe forms the basis of many dynamic MPI applications. For example, master / slave programs will often make heavy use of MPI_Probe when exchanging variable-sized worker messages. As an exercise, make a wrapper around MPI_Recv that uses MPI_Probe for any dynamic applications you might write. It makes the code look much nicer Next Do you feel comfortable using the standard blocking point-to-point communication routines? If so, then you already have the ability to write endless amounts of parallel applications! Lets look at a more advanced example of using the routines you have learned. Check out the application example using MPI_Send, MPI_Recv, and MPI_Probe.

Point-to-Point Communication Application Random Walk


Its time to go through an application example using some of the concepts introduced in the sending and receiving tutorial and the MPI_Probe and MPI_Status lesson. The code for the application can be downloaded here. The application simulates a process which I refer to as random walking. The basic problem definition of a random walk is as follows. Given a Min, Max, and random walker W, make walker W take S random walks of arbitrary length to the right. If the process goes out of bounds, it wraps back around. S can only move one unit to the right or left at a time.

Although the application in itself is very basic, the parallelization of random walking can simulate the behavior of a wide variety of parallel applications. More on that later. For now, lets overview how to parallelize the random walk problem. Parallelization of the Random Walking Problem Our first task, which is pertinent to many parallel programs, is splitting the domain across processes. The random walk problem has a one-dimensional domain of size Max Min + 1 (since Max and Min are inclusive to the walker). Assuming that walkers can only take integer-sized steps, we can easily partition the domain into near-equal-sized chunks across processes. For example, if Min is 0 and Max is 20 and we have four processes, the domain would be split like this.

The first three processes own five units of the domain while the last process takes the last five units plus the one remaining unit. Once the domain has been partitioned, the application will initialize walkers. As explained earlier, a walker will take S walks with a random total walk size. For example, if the walker takes a walk of size six on process zero (using the previous domain decomposition), the execution of the walker will go like this: 1. The walker starts taking incremental steps. When it hits value four, however, it has reached the end of the bounds of process zero. Process zero now has to communicate the walker to process one.

2. Process one receives the walker and continues walking until it has reached its total walk size of six. The walker can then proceed on a new random walk. In this example, W only had to be communicated one time from process zero to process one. If W had to take a longer walk, however, it may have needed to be passed through more processes along its path through the domain. Coding the Application using MPI_Send and MPI_Recv This application can be coded using MPI_Send and MPI_Recv. Before we begin looking at code, lets establish some preliminary characteristics and functions of the program:

Each process determines their part of the domain. Each process initializes exactly N walkers, all which start at the first value of their local domain. Each walker has two associated integer values: the current position of the walker and the number of steps left to take. Walkers start traversing through the domain and are passed to other processes until they have completed their walk. The processes terminate when all walkers have finished.

Lets begin by writing code for the domain decomposition. The function will take in the total domain size and find the appropriate subdomain for the MPI process. It will also give any remainder of the domain to the final process. For simplicity, I just call MPI_Abort for any errors that are found. The function, called decompose_domain, looks like this:
void decompose_domain(int domain_size, int world_rank, int world_size, int* subdomain_start, int* subdomain_size) { if (world_size > domain_size) { // Don't worry about this special case. Assume the domain size // is greater than the world size. MPI_Abort(MPI_COMM_WORLD, 1); } *subdomain_start = domain_size / world_size * world_rank; *subdomain_size = domain_size / world_size; if (world_rank == world_size - 1) { // Give remainder to last process *subdomain_size += domain_size % world_size; } }

As you can see, the function splits the domain in even chunks, taking care of the case when a remainder is present. The function returns a subdomain start and a subdomain size. Next, we need to create a function that initializes walkers. We first define a walker structure that looks like this:
typedef struct { int location; int num_steps_left_in_walk; } Walker;

Our initialization function, called initialize_walkers, takes the subdomain bounds and adds walkers to an incoming_walkers vector (by the way, this application is in C++).
void initialize_walkers(int num_walkers_per_proc, int max_walk_size, int subdomain_start, int subdomain_size, vector<Walker>* incoming_walkers) { Walker walker; for (int i = 0; i < num_walkers_per_proc; i++) { // Initialize walkers in the middle of the subdomain walker.location = subdomain_start; walker.num_steps_left_in_walk = (rand() / (float)RAND_MAX) * max_walk_size; incoming_walkers->push_back(walker); } }

After initialization, it is time to progress the walkers. Lets start off by making a walking function. This function is responsible for progressing the walker until it has finished its walk. If it goes out of local bounds, it is added to the outgoing_walkers vector.
void walk(Walker* walker, int subdomain_start, int subdomain_size, int domain_size, vector<Walker>* outgoing_walkers) { while (walker->num_steps_left_in_walk > 0) { if (walker->location == subdomain_start + subdomain_size) { // Take care of the case when the walker is at the end // of the domain by wrapping it around to the beginning if (walker->location == domain_size) { walker->location = 0; } outgoing_walkers->push_back(*walker); break; } else { walker->num_steps_left_in_walk--; walker->location++; } } }

Now that we have established an initialization function (that populates an incoming walker list) and a walking function (that populates an outgoing walker list), we only need two more functions: a function that sends outgoing walkers and a function that receives incoming walkers. The sending function looks like this:

void send_outgoing_walkers(vector<Walker>* outgoing_walkers, int world_rank, int world_size) { // Send the data as an array of MPI_BYTEs to the next process. // The last process sends to process zero. MPI_Send((void*)outgoing_walkers->data(), outgoing_walkers->size() * sizeof(Walker), MPI_BYTE, (world_rank + 1) % world_size, 0, MPI_COMM_WORLD); // Clear the outgoing walkers outgoing_walkers->clear(); }

The function that receives incoming walkers should use MPI_Probe since it does not know beforehand how many walkers it will receive. This is what it looks like:
void receive_incoming_walkers(vector<Walker>* incoming_walkers, int world_rank, int world_size) { // Probe for new incoming walkers MPI_Status status; // Receive from the process before you. If you are process zero, // receive from the last process int incoming_rank = (world_rank == 0) ? world_size - 1 : world_rank - 1; MPI_Probe(incoming_rank, 0, MPI_COMM_WORLD, &status); // Resize your incoming walker buffer based on how much data is // being received int incoming_walkers_size; MPI_Get_count(&status, MPI_BYTE, &incoming_walkers_size); incoming_walkers->resize(incoming_walkers_size / sizeof(Walker)); MPI_Recv((void*)incoming_walkers->data(), incoming_walkers_size, MPI_BYTE, incoming_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); }

Now we have established the main functions of the program. We have to tie all these function together as follows: 1. 2. 3. 4. 5. Initialize the walkers. Progress the walkers with the walk function. Send out any walkers in the outgoing_walkers vector. Receive new walkers and put them in the incoming_walkers vector. Repeat steps two through four until all walkers have finished.

The first attempt at writing this program is below. For now, we will not worry about how to determine when all walkers have finished. Before you look at the code, I must warn you this code is incorrect! With this in mind, lets look at my code and hopefully you can see what might be wrong with it.
// Find your part of the domain decompose_domain(domain_size, world_rank, world_size, &subdomain_start, &subdomain_size); // Initialize walkers in your subdomain initialize_walkers(num_walkers_per_proc, max_walk_size, subdomain_start, subdomain_size, &incoming_walkers);

while (!all_walkers_finished) { // Determine walker completion later // Process all incoming walkers for (int i = 0; i < incoming_walkers.size(); i++) { walk(&incoming_walkers[i], subdomain_start, subdomain_size, domain_size, &outgoing_walkers); } // Send all outgoing walkers to the next process. send_outgoing_walkers(&outgoing_walkers, world_rank, world_size); // Receive all the new incoming walkers receive_incoming_walkers(&incoming_walkers, world_rank, world_size); }

Everything looks normal, but the order of function calls has introduced a very likely scenario deadlock. Deadlock and Prevention According to Wikipedia, deadlock refers to a specific condition when two or more processes are each waiting for the other to release a resource, or more than two processes are waiting for resources in a circular chain. In our case, the above code will result in a circular chain of MPI_Send calls.

It is worth noting that the above code will actually not deadlock most of the time. Although MPI_Send is a blocking call, the MPI specification says that MPI_Send blocks until the send buffer can be reclaimed. This means that MPI_Send will return when the network can buffer the message. If the sends eventually cant be buffered by the network, they will block until a matching receive is posted. In our case, there are enough small sends and frequent matching receives to not worry about deadlock, however, a big enough network buffer should never be assumed. Since we are only focusing on MPI_Send and MPI_Recv in this lesson, the best way to avoid the possible sending and receiving deadlock is to order the messaging such that sends will have matching receives and vice versa. One easy way to do this is to change our loop around such that even-numbered processes send outgoing walkers before receiving walkers and odd-numbered processes do the opposite. Given two stages of execution, the sending and receiving will now look like this:

Note Executing this with one process can still deadlock. To avoid this, simply dont perform sends and receives when using one process. You may be asking, does this still work with an odd number of processes? We can go through a similar diagram again with three processes:

As you can see, at all three stages, there is at least one posted MPI_Send that matches a posted MPI_Recv, so we dont have to worry about the occurrence of deadlock. Determining Completion of All Walkers Now comes the final step of the program determining when every single walker has finished. Since walkers can walk for a random length, they can finish their journey on any process. Because of this, it is difficult for all processes to know when all walkers have finished without some sort of additional communication. One possible solution is to have process zero keep track of all of the walkers that have finished and then tell all the other processes when to terminate. This solution, however, is quite cumbersome since each process would have to report any completed walkers to process zero and then also handle different types of incoming messages. For this lesson, we will keep things simple. Since we know the maximum distance that any walker can travel and the smallest total size it can travel for each pair of sends and receives (the subdomain size), we can figure out the amount of sends and receives each process should do before termination. Using this characteristic of the program along with our strategy to avoid deadlock, the final main part of the program looks like this:
// Find your part of the domain decompose_domain(domain_size, world_rank, world_size, &subdomain_start, &subdomain_size); // Initialize walkers in your subdomain

initialize_walkers(num_walkers_per_proc, max_walk_size, subdomain_start, subdomain_size, &incoming_walkers); // Determine the maximum amount of sends and receives needed to // complete all walkers int maximum_sends_recvs = max_walk_size / (domain_size / world_size) + 1; for (int m = 0; m < maximum_sends_recvs; m++) { // Process all incoming walkers for (int i = 0; i < incoming_walkers.size(); i++) { walk(&incoming_walkers[i], subdomain_start, subdomain_size, domain_size, &outgoing_walkers); } // Send and receive if you are even and vice versa for odd if (world_rank % 2 == 0) { send_outgoing_walkers(&outgoing_walkers, world_rank, world_size); receive_incoming_walkers(&incoming_walkers, world_rank, world_size); } else { receive_incoming_walkers(&incoming_walkers, world_rank, world_size); send_outgoing_walkers(&outgoing_walkers, world_rank, world_size); } }

Running the Application The code for the application can be downloaded here. In contrast to the other lessons, this code uses C++. When installing MPICH2, you also installed the C++ MPI compiler (unless you explicitly configured it otherwise). If you installed MPICH2 in a local directory, make sure that you have set your MPICXX environment variable to point to the correct mpicxx compiler in order to use my makefile. In my code, I have set up the run script to provide default values for the program: 100 for the domain size, 500 for the maximum walk size, and 20 for the number of walkers per process. The run script should spawn five MPI processes, and the output should look similar to this:

The output continues until processes finish all sending and receiving of all walkers. So Whats Next?

If you have made it through this entire application and feel comfortable, then good! This application is quite advanced for a first real application. If you still dont feel comfortable with MPI_Send, MPI_Recv, and MPI_Probe, Id recommend going through some of the examples in my recommended books for more practice. Next, we will start learning about collective communication in MPI, so stay tuned! Also, at the beginning, I told you that the concepts of this program are applicable to many parallel programs. I dont want to leave you hanging, so I have included some additional reading material below for anyone that wishes to learn more. Enjoy ADDITIONAL READING Random Walking and Its Similarity to Parallel Particle Tracing The random walk problem that we just coded, although seemingly trivial, can actually form the basis of simulating many types of parallel applications. Some parallel applications in the scientific domain require many types of randomized sends and receives. One example application is parallel particle tracing. Parallel particle tracing is one of the primary methods that are used to visualize flow fields. Particles are inserted into the flow field and then traced along the flow using numerical integration techniques (such as Runge-Kutta). The traced paths can then be rendered for visualization purposes. One example rendering is of the tornado image at the top left.

Performing efficient parallel particle tracing can be very difficult. The main reason for this is because the direction in which particles travel can only be determined after each incremental step of the integration. Therefore, it is hard for processes to coordinate and balance all communication and computation. To understand this better, lets look at a typical parallelization of particle tracing.

In this illustration, we see that the domain is split among six process. Particles (sometimes referred to as seeds) are then placed in the subdomains (similar to how we placed walkers in subdomains), and then they begin tracing. When particles go out of bounds, they have to be exchanged with processes which have the proper subdomain. This process is repeated until the particles have either left the entire domain or have reached a maximum trace length. The parallel particle tracing problem can be solved with MPI_Send, MPI_Recv, and MPI_Probe in a similar manner to our application that we just coded. There are, however, much more sophisticated MPI routines that can get the job done more efficiently. We will talk about these in the coming lessons I hope you can now see at least one example of how the random walk problem is similar to other parallel applications. Stay tuned for more lessons and applications!

MPI Broadcast and Collective Communication


So far in the beginner MPI tutorial, we have examined point-to-point communication, which is communication between two processes. This lesson is the start of the collective communication section. Collective communication is a method of communication which involves participation of all processes in a communicator. In this lesson, we will discuss the implications of collective communication and go over a standard collective routine broadcasting. The code for the lesson can be downloaded here. Collective Communication and Synchronization Points One of the things to remember about collective communication is that it implies a synchronization point among processes. This means that all processes must reach a point in their code before they can all begin executing again. Before going into detail about collective communication routines, lets examine synchronization in more detail. As it turns out, MPI has a special function that is dedicated to synchronizing processes:
MPI_Barrier(MPI_Comm communicator)

The name of the function is quite descriptive the function forms a barrier, and no processes in the communicator can pass the barrier until all of them call the function. Heres an illustration. Imagine the horizontal axis represents execution of the program and the circles represent different processes:

Process zero first calls MPI_Barrier at the first time snapshot (T 1). While process zero is hung up at the barrier, process one and three eventually make it (T 2). When process two finally makes it to the barrier (T 3), all of the processes then begin execution again (T 4). MPI_Barrier can be useful for many things. One of the primary uses of MPI_Barrier is to synchronize a program so that portions of the parallel code can be timed accurately. Want to know how MPI_Barrier is implemented? Sure you do Do you remember the ring program from the MPI_Send and MPI_Recv tutorial? To refresh your memory, we wrote a program that passed a token around all processes in a ring-like fashion. This type of program is one of the simplest methods to implement a barrier since a token cant be passed around completely until all processes work together. One final note about synchronization Always remember that every collective call you make is synchronized. In other words, if you cant successfully complete an MPI_Barrier, then you also cant successfully complete any collective call. If you try to call MPI_Barrier or other collective routines without ensuring all processes in the communicator will also call it, your program will idle. This can be very confusing for beginners, so be careful! Broadcasting with MPI_Bcast A broadcast is one of the standard collective communication techniques. During a broadcast, one process sends the same data to all processes in a communicator. One of the main uses of broadcasting is to send out user input to a parallel program, or send out configuration parameters to all processes. The communication pattern of a broadcast looks like this:

In this example, process zero is the root process, and it has the initial copy of data. All of the other processes receive the copy of data. In MPI, broadcasting can be accomplished by using MPI_Bcast. The function prototype looks like this:
MPI_Bcast(void* data, int count, MPI_Datatype datatype, int root, MPI_Comm communicator)

Although the root process and receiver processes do different jobs, they all call the same MPI_Bcast function. When the root process (in our example, it was process zero) calls MPI_Bcast, the data variable will be sent to all other processes. When all of the receiver processes call MPI_Bcast, the data variable will be filled in with the data from the root process. Broadcasting with MPI_Send and MPI_Recv At first, it might seem that MPI_Bcast is just a simple wrapper around MPI_Send and MPI_Recv. In fact, we can make this wrapper function right now. Our function, called my_bcast can be downloaded in the example code for this lesson (my_bcast.c). It takes the same arguments as MPI_Bcast and looks like this:
void my_bcast(void* data, int count, MPI_Datatype datatype, int root, MPI_Comm communicator) { int world_rank; MPI_Comm_rank(communicator, &world_rank); int world_size; MPI_Comm_size(communicator, &world_size); if (world_rank == root) { // If we are the root process, send our data to everyone int i; for (i = 0; i < world_size; i++) { if (i != world_rank) { MPI_Send(data, count, datatype, i, 0, communicator); } } } else { // If we are a receiver process, receive the data from the root MPI_Recv(data, count, datatype, root, 0, communicator, MPI_STATUS_IGNORE); } }

The root process sends the data to everyone else while the others receive from the root process. Easy, right? If you download the code and run the program, the program will print output like this:

Believe it or not, our function is actually very inefficient! Imagine that each process has only one outgoing/incoming network link. Our function is only using one network link from process zero to send all the data. A smarter implementation is a tree-based communication algorithm that can use more of the available network links at once. For example:

In this illustration, process zero starts off with the data and sends it to process one. Similar to our previous example, process zero also sends the data to process two in the second stage. The difference with this example is that process one is now helping out the root process by forwarding the data to process three. During the second stage, two network connections are being utilized at a time. The network utilization doubles at every subsequent stage of the tree communication until all processes have received the data. Do you think you can code this? Writing this code is a bit outside of the purpose of the lesson. If you are feeling brave, Parallel Programming with MPI is an excellent book with a complete example of the problem with code. Comparison of MPI_Bcast with MPI_Send and MPI_Recv The MPI_Bcast implementation utilizes a similar tree broadcast algorithm for good network utilization. How does our broadcast function compare to MPI_Bcast? We can run compare_bcast, an example program included in the lesson code. Before looking at the code, lets first go over one of MPIs timing functions MPI_Wtime(). MPI_Wtime takes no arguments, and it simply returns a floating-point number of seconds since a set time in the past. Similar to Cs time function, you can call multiple MPI_Wtime functions throughout your program and subtract their differences to obtain timing of code segments.

Lets take a look of our code that compares my_bcast to MPI_Bcast


for (i = 0; i < num_trials; i++) { // Time my_bcast // Synchronize before starting timing MPI_Barrier(MPI_COMM_WORLD); total_my_bcast_time -= MPI_Wtime(); my_bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD); // Synchronize again before obtaining final time MPI_Barrier(MPI_COMM_WORLD); total_my_bcast_time += MPI_Wtime(); // Time MPI_Bcast MPI_Barrier(MPI_COMM_WORLD); total_mpi_bcast_time -= MPI_Wtime(); MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD); MPI_Barrier(MPI_COMM_WORLD); total_mpi_bcast_time += MPI_Wtime(); }

In this code, num_trials is a variable stating how many timing experiments should be executed. We keep track of the accumulated time of both functions in two different variables. The average times are printed at the end of the program. To see the entire code, just download the lesson code and look at compare_bcast.c. When you use the run script to execute the code, the output will look similar to this.

The run script executes the code using 16 processors, 100,000 integers per broadcast, and 10 trial runs for timing results. As you can see, my experiment using 16 processors connected via ethernet shows significant timing differences between our naive implementation and MPIs implementation. Here is what the timing results look like at all scales. Processors my_bcast MPI_Bcast 2 0.0344 0.0344 4 0.1025 0.0817 8 0.2385 0.1084 16 0.5109 0.1296 As you can see, there is no difference between the two implementations at two processors. This is because MPI_Bcasts tree implementation does not provide any additional network utilization when using two processors. However, the differences can clearly be observed when going up to even as little as 16 processors.

Try running the code yourself and experiment at larger scales! Conclusions / Up Next Feel a little better about collective routines? In the next MPI tutorial, I go over other essential collective communication routines gathering and scattering. For all beginner lessons, go the the beginner MPI tutorial.

MPI Scatter, Gather, and Allgather


In the previous lesson, we went over the essentials of collective communication. We covered the most basic collective communication routine MPI_Bcast. In this lesson, we are going to expand on collective communication routines by going over two very important routines MPI_Scatter and MPI_Gather. We will also cover a variant of MPI_Gather, known as MPI_Allgather. The code for this tutorial is available here. An Introduction to MPI_Scatter MPI_Scatter is a collective routine that is very similar to MPI_Bcast (If you are unfamiliar with these terms, please read the previous lesson). MPI_Scatter involves a designated root process sending data to all processes in a communicator. The primary difference between MPI_Bcast and MPI_Scatter is small but important. MPI_Bcast sends the same piece of data to all processes while MPI_Scatter sends chunks of an array to different processes. Check out the illustration below for further clarification.

In the illustration, MPI_Bcast takes a single data element at the root process (the red box) and copies it to all other processes. MPI_Scatter takes an array of elements and distributes the elements in the order of process rank. The first element (in red) goes to process zero, the second element (in green) goes to process one, and so on. Although

the root process (process zero) contains the entire array of data, MPI_Scatter will copy the appropriate element into the receiving buffer of the process. Here is what the function prototype of MPI_Scatter looks like.
MPI_Scatter(void* send_data, int send_count, MPI_Datatype send_datatype, void* recv_data, int recv_count, MPI_Datatype recv_datatype, int root, MPI_Comm communicator)

Yes, the function looks big and scary, but lets examine it in more detail. The first parameter, send_data, is an array of data that resides on the root process. The second and third parameters, send_count and send_datatype, dictate how many elements of a specific MPI Datatype will be sent to each process. If send_count is one and send_datatype is MPI_INT, then process zero gets the first integer of the array, process one gets the second integer, and so on. If send_count is two, then process zero gets the first and second integers, process one gets the third and fourth, and so on. In practice, send_count is often equal to the number of elements in the array divided by the number of processes. Whats that you say? The number of elements isnt divisible by the number of processes? Dont worry, we will cover that in a later lesson The receiving parameters of the function prototype are nearly identical in respect to the sending parameters. The recv_data parameter is a buffer of data that can hold recv_count elements that have a datatype of recv_datatype. The last parameters, root and communicator, indicate the root process that is scattering the array of data and the communicator in which the processes reside. An Introduction to MPI_Gather MPI_Gather is the inverse of MPI_Scatter. Instead of spreading elements from one process to many processes, MPI_Gather takes elements from many processes and gathers them to one single process. This routine is highly useful to many parallel algorithms, such as parallel sorting and searching. Below is a simple illustration of this algorithm.

Similar to MPI_Scatter, MPI_Gather takes elements from each process and gathers them to the root process. The elements are ordered by the rank of the process from which they were received. The function prototype for MPI_Gather is identical to that of MPI_Scatter.
MPI_Gather(void* send_data, int send_count, MPI_Datatype send_datatype, void* recv_data, int recv_count, MPI_Datatype recv_datatype, int root, MPI_Comm communicator)

In MPI_Gather, only the root process needs to have a valid receive buffer. All other calling processes can pass NULL for recv_data. Also, dont forget that the recv_count parameter is the count of elements received per process, not the total summation of counts from all processes. This can often confuse beginning MPI programmers. Computing Average of Numbers with MPI_Scatter and MPI_Gather In the code for this lesson, I have provided an example program that computes the average across all numbers in an array. The program is in avg.c. Although the program is quite simple, it demonstrates how one can use MPI to divide work across processes, perform computation on subsets of data, and then aggregate the smaller pieces into the final answer. The program takes the following steps: 1. Generate a random array of numbers on the root process (process 0). 2. Scatter the numbers to all processes, giving each process an equal amount of numbers. 3. Each process computes the average of their subset of the numbers. 4. Gather all averages to the root process. The root process then computes the average of these numbers to get the final average. The main part of the code with the MPI calls looks like this:
if (world_rank == 0) { rand_nums = create_rand_nums(elements_per_proc * world_size); } // Create a buffer that will hold a subset of the random numbers float *sub_rand_nums = malloc(sizeof(float) * elements_per_proc); // Scatter the random numbers to all processes MPI_Scatter(rand_nums, elements_per_proc, MPI_FLOAT, sub_rand_nums, elements_per_proc, MPI_FLOAT, 0, MPI_COMM_WORLD); // Compute the average of your subset float sub_avg = compute_avg(sub_rand_nums, elements_per_proc); // Gather all partial averages down to the root process float *sub_avgs = NULL; if (world_rank == 0) { sub_avgs = malloc(sizeof(float) * world_size); } MPI_Gather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 0, MPI_COMM_WORLD); // Compute the total average of all numbers. if (world_rank == 0) { float avg = compute_avg(sub_avgs, world_size); }

At the beginning of the code, the root process creates an array of random numbers. When MPI_Scatter is called, each process now contains elements_per_proc elements of the original data. Each process computes the average of their subset of data and then the root process gathers each individual average. The total average is computed on this much smaller array of numbers.

Using the run script included in the code for this lesson, the output of your program should be similar to the following. Note that the numbers are randomly generated, so your final result might be different from mine.

MPI_Allgather and Modification of Average Program So far, we have covered two MPI routines that perform many-to-one or one-to-many communication patterns, which simply means that many processes send/receive to one process. Oftentimes it is useful to be able to send many elements to many processes (i.e. a many-to-many communication pattern). MPI_Allgather has this characteristic. Given a set of elements distributed across all processes, MPI_Allgather will gather all of the elements to all the processes. In the most basic sense, MPI_Allgather is an MPI_Gather followed by an MPI_Bcast. The illustration below shows how data is distributed after a call to MPI_Allgather.

Just like MPI_Gather, the elements from each process are gathered in order of their rank, except this time the elements are gathered to all processes. Pretty easy, right? The function declaration for MPI_Allgather is almost identical to MPI_Gather with the difference that there is no root process in MPI_Allgather.
MPI_Allgather(void* send_data, int send_count, MPI_Datatype send_datatype, void* recv_data, int recv_count, MPI_Datatype recv_datatype, MPI_Comm communicator)

I have modified the average computation code to use MPI_Allgather. You can view the source in all_avg.c from the lesson code. The main difference in the code is shown below.
// Gather all partial averages down to all the processes float *sub_avgs = (float *)malloc(sizeof(float) * world_size); MPI_Allgather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, MPI_COMM_WORLD);

// Compute the total average of all numbers. float avg = compute_avg(sub_avgs, world_size);

The partial averages are now gathered to everyone using MPI_Allgather. The averages are now printed off from all of the processes. Example output of the program should look like the following:

As you may have noticed, the only difference between all_avg.c and avg.c is that all_avg.c prints the average across all processes with MPI_Allgather. Up Next In the next lesson, I will cover some of the more complex collective communication algorithms. Stay tuned! Feel free to leave any comments or questions about the lesson. For all beginner lessons, go the the beginner MPI tutorial.