You are on page 1of 36

Course Title: Introduction to OpenMP

1 Introduction 1.1 Introduction


In high performance computing, there are tools that assist programmers with multi-threaded parallel processing on distributed-memory and shared-memory multiproccessor platforms.
q

On distributed-memory multiprocessor platforms, each processor has its own memory whose content is not readily available to other processors. Sharing of information among processors is customarily facilitated by message passing using routines from standard message passing libraries such as MPI.

On shared-memory multiprocessors, memory among processors can be shared. Message passing libraries such as MPI can be, and are, used for the parallel processing tasks. However, a directive-based OpenMP Application Program Interface (API) has been developed specifically for shared-memory parallel processing.

OpenMP has broad support from many major computer hardware and software manufacturers. Similar to MPI's achievement as the standard for distributed-memory parallel processing, OpenMP has emerged as the standard for shared-memory parallel computing. Both of these standards can be used in conjunction with Fortran 77, Fortran 90, C or C++ for parallel computing applications. It is worth noting that for a cluster of single-processor and shared-memory multiprocessor computers, it is possible to use both paradigms in the same application program to effect an increase in the (agregate) processing power. MPI is used to connect all machines within the cluster to form one virtual machine, while OpenMP is used to exploit the shared-memory parallelism on individual shared-memory machines within the cluster. This

approach is commonly referred to as Multi-Level Parallel Programming (MLP). In this course, we will focus on the fundamentals of OpenMP. The topics of MPI and MLP are covered in two separate tutorials on the CI-Tutor site - "Introduction to MPI" and "Multi-Level Parallel Programming."

1.2 What is OpenMP?


OpenMP is comprised of three complementary components: 1. a set of directives used by the programmer to communicate with the compiler on parallelism. 2. a runtime library which enables the setting and querying of parallel parameters such as number of participating threads and the thread number. 3. a limited number of environment variables that can be used to define runtime system parallel parameters such as the number of threads.

Figure 1.1. The three components of the OpenMP API. A Compiler Directive Example Code segments that consume substantial CPU cycles frequently involve do loops (or FOR loops in C). For loops that are parallelizable, OpenMP provides a rather simple directive to instruct the compiler to parallelize the loop immediately following the directive. Lets take a look at a Fortran example:
call omp_set_num_threads(nthread) !$OMP PARALLEL DO DO i=1,N DO j=1,M . . . END DO END DO !$OMP END PARALLEL DO !requests "nthread" threads

In this Fortran code fragment, the OpenMP library function OMP_SET_NUM_THREADS is called to set the number of threads to "nthread". Next, the !$OMP PARALLEL DO directive notifies the compiler to parallelize the (outer) do loop that follows. The current, or master, thread is responsible for spawning "nthread-1" child threads. The matching !$OMP END PARALLEL DO directive makes it clear to the compiler (and the programmer) the extent of the PARALLEL DO directive. It also serves to provide a barrier to ensure that all threads complete their tasks and that all child threads are subsequently released. In the figure below, the execution stream starts in a serial region, followed by a parallel region using four threads. Upon completion, the child threads are released and serial execution continues until the next parallel region with 2 threads is in effect.

Figure 1.2. OpenMP Programming Model. For C programs, a similar set of rules applies:

omp_set_num_threads(nthread); #pragma omp parallel for { for (i=0; i&#060n; i++) { for (j=0; j&#060m; j++) { . . . } } }

/* requests nthread threads */

Note in the above that the scope of the parallel for directive is delimited by the pair of curly braces ({}).

1.3 Why OpenMP?


For computers with shared-memory architecture, the ability to use directives that assist the compiler in the parallelization of application codes has been around for many years. Almost all major manufacturers of high performance shared-memory multiprocessor computers have their own sets of directives. Unfortunately, the functionalities and syntaxes of these directive sets vary among vendors and because of that variance code portability (from the viewpoint of directives) is practically impossible. Primarily driven by the need of both the high performance computing user community and industry to have a standard to ensure code portability across shared-memory platforms, an independent organization, openmp.org, was established in 1996. This organization's charter is to formulate and oversee the establishment and maintenance of the OpenMP standard. As a result, the OpenMP API came into being in 1997. The primary benefit of using OpenMP is the relative ease of code parallelization made possible by the shared-memory architecture.

1.4 Pros and Cons of OpenMP


Pros
q

Due to its shared-memory attribute, the programmer need not deal with message passing which is relatively difficult and potentially harder to debug. Generally the bulk of data decomposition is handled automatically by directives. Hence, data layout effort is minimal.

q q

Unlike message passing MPI or PVM, OpenMP directives or library calls may be incorporated incrementally. Since codes that need parallelism are often large, incremental implementation allows for gradual realization of performance gains instead of having to convert the whole code at once - as is often the case in an MPI code. Since directives, as well as OpenMP function calls, are treated as comments in the event that OpenMP invocation is not preferred or available during compilation, the code is in effect a serial code. This affords a unified code for both serial and parallel applications which can ease code maintenance. Original (serial) code statements need not, in general, be modified when parallelized with OpenMP. This reduces the chance of inadvertently introducing bugs. Code size increase due to OpenMP is generally smaller than that which uses MPI (or other message passing methods). OpenMP-enabled codes tend to be more readable than an equivalent MPI-enabled version. This could, at least indirectly, help in code maintainability.

Cons
q

q q

Codes parallelized with OpenMP can only be run in multiprocessor mode on shared-memory environments; this restricts the portability (in the multiprocessing sense) of the programs on distributed-memory environments. Requires a compiler that supports OpenMP. Because OpenMP codes tend to rely more on parallelizable loops, this could leave a relatively high percentage of a code in serial processing mode. This results in lower parallel efficiency; if 10% of a code remains in serial operations, it can only attend a theorectical best of halving the wall clock time (per Amdahl's Law) even if a large number of processors is used. Codes implemented to run on shared-memory parallel systems, for example those parallelized via OpenMP, are limited by the number of processors available on the respective systems. Conceptually, parallel paradigms such as MPI do not have such a hardware limitation.

The above summary of advantages and disadvantages should be taken in a broad sense, as exceptions do exist. Some classes of embarrassingly-parallel codes can be very trivial to parallelize with MPI, such as some Monte Carlo codes. On the other hand, while OpenMP's primary model is to go after do loops, it can also be done in a data decomposition model -akin to the MPI approach. With coarser grain parallelism, a higher percentage of a code could then be parallelized to achieve better parallel efficiency.

2 Basics 2.1 Basics


The primary goal of code parallelization is to distribute the code's operations among a number of processors so they can be performed simultaneously. This goal can be elusive since there are usually operations that must be performed in a certain sequence and therefore can't be performed in parallel. Codes where a vast majority of the operations are completely independent are relatively easy to parallelize. Some examples of such codes are ones that perform Monte-Carlo simulations and optimization problems. However, most codes contain a rather intricate combination of independent (parallelizable) and dependent (serial) operations. One common example is the solution of hyperbolic partial differential equations such as the simulation of time-varying phenomena. Since the state of a system at a later time is a function of the state at the previous time, states at various times must be deduced in a sequential fashion. In these cases one must look for other aspects of the algorithm that can be parallelized. In the following sections we will discuss a few basic concepts of code parallelization and introduce our first simple, but powerful, OpenMP directive. Although the amount of material presented in this chapter is limited, it can go quite far in parallelizing many real-world applications in an efficient manner.

2.2 Basics - Approaches to Parallelism


In OpenMP, there are two main approaches for assigning work to threads. They are 1. loop-level 2. parallel regions In the first approach, loop-level, individual loops are parallelized with each thread being assigned a unique range of the loop index. This is sometimes called fine-grained parallelism and is the approach taken by many automatic parallelizers. Code that is not within parallel loops is executed serially (on a single thread). In the second approach, parallel regions, any sections of the code can be parallelized, not just loops. This is sometimes called coarse-grained parallelism. The work within the parallel regions is explicitly distributed among the threads using the unique identifier assigned to each thread. This is frequently done by using if statements, e.g., if(myid == 0) ... , where myid is the thread identifier. At the limit, the entire code can be executed on each thread, as is usually done with message-passing approaches such as MPI.

The two approaches are illustrated in the following diagram in which vertical lines represent individual threads. In the loop-level approach, execution starts on a single thread. Then, when a parallel loop is encountered, multiple threads are spawned. When the parallel loop is finished, the extra threads are discarded and the execution is once again serial until the next parallel loop (or the end of the code) is reached. In the parallel-regions approach, multiple threads are maintained, irrespective of whether or not loops are encountered.

The main advantage of loop-level parallelism is that it's relatively easy to implement. It is most effective when a small number of loops perform a large amount of work in the code. The parallel regions method requires a bit more work to implement, but is more flexible. Any parallelization task perfomed using the loop-level method can also be implemented using parallel regions by adding logic to the code, but not vice-versa. The disadvantage of the loop-level approach is the overhead incurred in creating new threads at the beginning of a parallel loop and in destroying and resynchronizing threads and data with the master thread at the end of the parallel loop. The amount of overhead incurred depends on the details of how the slave threads are implemented, e.g. as separate os-level processes or as so-called "light-weight" processes such as pthreads. Parallel regions allow the programmer to exploit data parallelism on a scale larger than the " do loop" by avoiding the need to resynchronize after every loop.

2.3 Data Dependencies


Unfortunately, all of the operations in a code cannot be performed simultaneously (except in rare instances). There are some operations which must wait for the completion of other operations before they are performed. When an operation depends upon completion of another operation, it is called a data dependency (or sometimes data dependence). A simple data dependency is shown in the following code fragment:

Fortran

do i = 2, 5 a(i) = a(i) + a(i-1) enddo

C/C++: for(i=2; i<=5; i++) a[i] = a[i] + a[i-1];

In this example, each element i of the 1 dimensional array a is replaced by the sum of the original elements of the array up to index i. Assume that the array a has been initialized with integers from 1-5. The final values of the a array obtained by serial execution of the loop above are shown in the following table: i 1 2 3 4 5 1 3 6 10 15 a(i)

Consider what might happen if we executed the loop in parallel using 2 threads. Assume the first thread is assigned loop indices 2 and 3, and the second thread is assigned 4 and 5. One possible order of execution is that thread 1 performs the computation on i=4, reading the value of a(3), before thread 0 has completed the computations for i=3, which update a(3). In this case, the results of each thread are: Thread 0 a(2) = a(2) + a(1) = 2 + 1 = 3 a(3) = a(3) + a(2) = 3 + 3 = 6 Thread 1 a(4) = a(4) + a(3) = 4 + 3 = 7 a(5) = a(5) + a(4) = 5 + 4 = 9

Comparing the above tables, it's clear that a(4) should be equal to 10, not 7, and a(5) should be equal to 15, not 9. The problem is that the values of a(3) and a(4) were used by thread 1 before the new values were calculated by thread 0. This is the simplest example of a "race condition", in which the result of the operation depends upon the order in which the data is accessed by the threads. There are 3 simple criteria, that if satisfied, guarantee that there is no data dependency in a loop:
q q q

All assigments are performed on arrays. Each element of an array is assigned to by at most one iteration. No loop iteration reads array elements modified by any other iteration.

If these criteria are not met, you should carefully examine your loop for data dependency. OpenMP will do exactly what it is instructed to do, and if a loop with a data dependency is parallelized naively, it will give the wrong result. The programmer is responsible for the correctness of the code!

2.4 PARALLEL DO / PARALLEL FOR


In OpenMP, the primary means of parallelization is through the use of directives inserted in the source code. One of the most fundamental and most powerful of these directives is PARALLEL DO (Fortran) or PARALLEL FOR (C). Here are examples of these directives:

Fortran
!$omp parallel do do i = 1, n a(i) = b(i) + c(i) enddo

C/C++
#pragma omp parallel for for(i=1; i

2.5 Clauses
There are many situations in which we would like to modify the behavior of directives in some way in order to solve a specific problem or to make the directive more convenient to use. Some directives can be modified using clauses. Shared vs. Private Variables Since OpenMP is used on shared-memory systems, all variables in a given loop share the same address space. This means that all threads can modify and access all variables (except the loop index), and sometimes this results in undesirable behavior. Consider the following example: Fortran
!$omp parallel do do i = 1, n temp = 2.0*a(i) a(i) = temp b(i) = c(i)/temp enddo

C/C++
#pragma omp parallel for { for(i=1; i

2.6 Self Test


@import url(base.css); @import url(content.css); Introduction to OpenMP - Basics Question 1

Question 1 Which of the following is NOT true: The Parallel Regions approach is used to parallelize individual loops only. is also sometimes called Coarse-Grained parallelism. can emulate message-passing type parallelism, such as that which is used with MPI. can be used to parallelize regions of code with loops within them.

Question 2

Question 2 The following code fragment contains a data dependency:

do i = 1, 10 x(i) = x(i) + y(i) enddo

True False

Question 3

Question 3 The following code fragment contains a data dependency:


for(i=2; i<=10; i++) x[i]=x[i] + y[i-1];

True False

Question 4

Question 4 The default shared/private clause is shared private

Question 5

Question 5 Firstprivate must be used to qualify the first private variable to be accessed in a loop. makes the first access of the specified variable private, with subsequent accesses being shared. copies the value of the specified variable(s) from the master thread to all threads. causes the default to be private rather than shared.

Question 6

Question 6 Lastprivate retains the value of the specified variable(s) after the parallel region has finished. The value of the variable is the final value on the master thread. the value from on last thread to finish the parallel region. indeterminate. the value that would have been obtained from serial execution.

Question 7

Question 7 In Fortran, .and., and in C, &&, are allowable reduction operations. True False

Question 8

Multi-choice The 'ordered' directive causes the affected part of the code to run serially. True False

3 Compile and Run


Having been introduced to the workhorse PARALLEL DO directive and several associated clauses, you are now armed with a surprisingly powerful set of tools for parallelizing many codes. This being the case, we will take a slight detour here in order to briefly discuss how to compile and run OpenMP codes so you can try out some of these concepts. Compilation is, of course, platform dependent. Here are the compiler flags for several popular platform: Platform
SGI IRIX IBM AIX Portland Group linux Intel linux

Compiler Flag
-mp -qsmp=omp -mp -openmp

Adding the appropriate flag causes the compiler to interpret OpenMP directives, functions, etc. For C/C++ codes, the header file omp.h should be included when using OpenMP functions. So far, we have one way to specify the number of threads, the OMP_NUM_THREADS environment variable. (Later a function will be introduced to do the same thing.) The behavior of the code if you fail to specify the number of threads is platform dependent, and it is good practice to always specify the number of threads. It may be convenient to create a simple run script such as those shown in the following examples. Example 1
#!/bin/tcsh setenv OMP_NUM_THREADS 4 mycode my.out exit

Example 2
#!/bin/tcsh setenv OMP_NUM_THREADS mycode my.out

exit

In the first case the number of threads is hard-wired, and in the second it's a substitutable argument.

4 Conditional Compilation
Portability among different platforms is often a concern for writers of large-scale scientific codes. Lines starting with OpenMP sentinels (the first part of all directives, e.g., !$OMP, C$OMP, *$OMP, #PRAGMA OMP) are ignored if the code is compiled on a system that doesn't support OpenMP. This allows the same source code to be used to create serial and parallel executables. Conditional compilation is also possible with constructs other than directives. It is handled differently in Fortran and C/C++, so they will be discussed individually. Fortran When a Fortran/OpenMP compiler encounters a !$, c$, or *$ sentinel, the two characters are replaced with spaces. When the first character of the sentinel is a comment character, a compiler without OpenMP will simply interpret the line as a comment. This behavior can be used for conditional compilation. Suppose there are lines of code that are to be executed only in parallel versions of the code. If a !$ prefix is added to each line of code, it will result in conditional compilation: Fortran
!$ call parallel_stuff(num_threads)

If compiled without OpenMP, this line will simply be interpreted as a comment. With OpenMP, the compiler will replace !$ with spaces, and the line will be compiled as an executable statement. C/C++ In C and C++, there is a macro name, _OPENMP, that is automatically defined by OpenMP. This can be used for conditional compilation as follows: C/C++
#ifdef _OPENMP parallel_stuff(num_threads); #endif

5 PARALLEL Directive
Up to this point, we have been examining ways to parallelize loops with the PARALLEL DO (PARALLEL FOR) directive. It is possible to break this directive into separate PARALLEL and DO (FOR) directives. For example, the following parallel loop: Fortran
!$OMP PARALLEL DO do i = 1, maxi a(i) = b(i) enddo

C/C++
#pragma omp parallel for for(i=1; i

6 Basic Functions 6.1 Basic Functions


In this chapter, we will cover three very basic OpenMP library functions: OMP_SET_NUM_THREADS, OMP_GET_NUM_THREADS and OMP_GET_THREAD_NUM. These functions enable us to set the thread count, find out how many threads are in use as well as determine the rank of individual threads. These basic functionalities, together with basic directives such as PARALLEL DO or PARALLEL FOR, are sufficient for many applications.

However, some applications may require more specialized functionalities. For these occasions, the OpenMP library provides additional functions to deal with them. These functions will be introduced in Chapter 10.

6.2 OMP_GET_THREAD_NUM
Returns the thread rank in a parallel region. Note that:
q q

The rank of threads ranges from 0 to OMP_GET_NUM_THREADS() - 1. When invoked in a serial region, this function returns the value of 0, which is the rank of the master thread.

C/C++
#include &#060omp.h&#062 int omp_get_thread_num()

Fortran
INTEGER FUNCTION OMP_GET_THREAD_NUM()

Example Print thread numbers in a parallel region. If four processors are used, the output may look like these:

Thread Thread Thread Thread

rank: rank: rank: rank:

2 0 3 1

Note that in general the rank output are not in order. C/C++
#pragma omp parallel { printf("Thread rank: %d\n", omp_get_thread_num()); }

Fortran (replace "C" in column 1 with "!" for F90)


C$OMP PARALLEL write(*,*)'Thread rank: ', OMP_GET_THREAD_NUM() C$OMP END PARALLEL

6.3 OMP_SET_NUM_THREADS
Sets the number of threads for use in subsequent parallel region(s). Note that:
q

The number of threads deployed in a run is determined by the user. There are two ways with which the user dictates that: 1. call OMP_SET_NUM_THREADS prior to the beginning of a parallel region for it to take effect; it can be called as often as needed to dynamically control the thread counts in different parallel regions. 2. alternatively, the number of threads can be set through the environment variable OMP_NUM_THREADS before a run as follows:
setenv OMP_NUM_THREADS threads (c shell) OMP_NUM_THREADS = threads (korn shell) export OMP_NUM_THREADS (korn shell)

This method can be employed if the thread count need not change in the entire code. The result is undefined if this subroutine is called within a parallel region.

q q q

The thread count remains fixed until the next call to this subprogram. Use of this subprogram to set number of threads has precedence over the environment variable OMP_NUM_THREADS. The number of threads used in a parallel region is guaranteed to be what is set via call to OMP_SET_NUM_THREADS provided that thread dynamic status is FALSE. Otherwise, the actual number of threads used at runtime is subject to what is available at the time the parallel region is executed. Furthermore, the threads used cannot exceed what is set by OMP_SET_NUM_THREADS.

C/C++
#include <omp.h> (void) omp_set_num_threads(int num_threads) num_threads -- Number of threads (input)

Fortran
SUBROUTINE OMP_SET_NUM_THREADS(num_threads) num_threads -- Number of threads (input)

The example below demonstrates:


r r

Request threads by OMP_SET_NUM_THREADS. Use OMP_GET_NUM_THREADS in a parallel region to see how many threads are active.

C/C++
/* set thread size before entering parallel region */ num_threads = 4; omp_set_num_threads(num_threads); #pragma omp parallel { printf("Threads allocated : %d\n", omp_get_num_threads()); }

Fortran (replace "C" in column 1 with "!" for F90)

C set thread size before entering parallel region num_threads = 4 call OMP_SET_NUM_THREADS(num_threads) C$OMP PARALLEL write(*,*)'Threads allocated : ', OMP_GET_NUM_THREADS() C$OMP END PARALLEL

6.4 OMP_GET_NUM_THREADS
Returns the number of threads used in a parallel region. Note that:
q q

When invoked in a parallel region, this function reports the number of participating threads. It returns a value of unity (1) when invoked in: a. a serial region. b. a nested parallel region that has been serialized; e.g., if the nested parallelism is turned off or not implemented by the vendor. See OMP_SET_NESTED in Section 11.1 for details. If the thread dynamic status is disabled, the thread counts returned by this function is determined by the user's call to the subprogram OMP_SET_NUM_THREADS or by the environment variable OMP_NUM_THREADS. If the thread dynamic status is enabled, the thread count returned by this subprogram cannot be larger than what is returned by OMP_GET_MAX_THREADS().

C/C++
int omp_get_num_threads( )

Fortran
integer function omp_get_num_threads( )

Example C/C++
#include &#060omp.h&#062 omp_set_num_threads(num_threads); #pragma omp parallel { printf("Threads allocated : %d\n", omp_get_num_threads()); }

Fortran (replace "C" in column 1 with "!" for F90)

call OMP_SET_NUM_THREADS(num_threads) C$OMP PARALLEL write(*,*)'Threads allocated : ', OMP_GET_NUM_THREADS() C$OMP END PARALLEL

6.5 Self Test


@import url(base.css); @import url(content.css); Introduction to OpenMP - Basic Functions Question 1

Question 1 Assuming that multiple threads are active, which one of the following is true? omp_get_num_threads() returns the number of active threads. omp_get_num_threads() returns the number of threads requested via omp_set_num_threads. omp_get_num_threads() returns the number of active threads in a parallel region and 1 otherwise.

Question 2

Question 2 OMP_GET_THREAD_NUM() returns current thread's rank number which ranges from 1 to omp_get_num_threads(). ranges from 0 to omp_get_num_threads()-1. is the id of the physical processor.

Question 3

Question 3 omp_set_num_threads has precedence over the environment variable OMP_NUM_THREADS.

is overriden by OMP_NUM_THREADS. like environment variable OMP_NUM_THREADS, can only be called once in the program to set the number of threads.

Question 4

Question 4 omp_set_num_threads can be called from anywhere in the application program. can only be called in a parallel region. can only be called in a serial region.

7 Parallel Regions
Discussion to this point has been concerned with methods for parallelizing individual loops. This is sometimes called fine-grained or loop-level parallelism. The term "fine-grained" can be a misnomer, since loops can be large, sometimes encompassing a majority of the work performed in the code, so we will use "loop-level" here. We saw in Chapter 5. PARALLEL Directive that the entire region of code between a PARALLEL directive and an END PARALLEL directive in Fortran, or within braces enclosing a parallel region in C, will be duplicated on all threads. This allows more flexibility than restricting parallel regions of code to loops, and one can parallelize code in a manner much like that used with MPI or other message-passing libraries. This approach is called coarse-grained parallelism or parallel regions. We will use the latter term for the same reason discussed above. In the loop-level approach, domain decomposition is performed automatically by distributing loop indices among the threads. In the parallel regions approach, domain decomposition is performed manually. Starting and ending loop indices are computed for each thread based on the number of threads available and the index of the current thread. The following code fragment shows a simple example of how this works. Fortran
!$OMP PARALLEL & !$OMP PRIVATE(myid,istart,iend,nthreads,nper) nthreads = OMP_GET_NUM_THREADS() nper = imax/nthreads myid = OMP_GET_THREAD_NUM() istart = myid*nper + 1 iend = istart + nper - 1 call do_work(istart,iend) do i = istart, iend a(i) = b(i)*c(i) enddo !$OMP END PARALLEL

C/C++
#pragma omp parallel \ private(myid,istart,iend,nthreads,nper) { nthreads = omp_get_num_threads(); nper = imax/nthreads; myid = omp_get_thread_num(); istart = myid*nper + 1; iend = istart + nper - 1;

do_work(istart,iend); for(i=istart; i

8 Thread Control 8.1 Thread Control


There are some instances in which additional control over the operation of the threads is required. Sometimes threads must be synchronized, such as at the beginning of serial parts of the code. There are cases in which a task needs to be performed only on one thread, and there are cases in which all threads must perform a task and do it one at a time. Also, you may want to assign specific tasks to specific threads. All of these functions are available in OpenMP using the directives discussed in this section.

8.2 BARRIER
There are instances in which threads must be synchronized This can be effected through the use of the BARRIER directive. Each thread waits at the BARRIER directive until all threads have reached this point in the source code, and they then resume parallel execution. In the following example, an array a is filled, and then operations are performed on a in the subprogram DOWORK. Fortran
do i = 1, n a(i) = a(i) - b(i) enddo call dowork(a)

C/C++
for(i=1; i

8.3 MASTER
In a parallel region, one might want to perform a certain task on the master thread only. Rather than ending the parallel region, performing the task, and re-starting the parallel region, the MASTER directive can be used to restrict a region of source code to the master thread. In Fortran, all operations between the MASTER and END MASTER directives are performed on the master thread only. In C, the operations in the structured block (between curly braces) following the MASTER directive are performed on the master thread only. In the following example, an array is computed and written to a file. Fortran
do i = 1, n a(i) = b(i) enddo write(21) a call do_work(1, n)

C/C++
for(i=1; i

8.4 SINGLE
Threads do not execute lines of source code in lockstep. There may be explicit logic in the code assigning different tasks to different threads, and different threads may execute specific lines of source code at different speeds, due to differing cache access patterns for example. The SINGLE directive is similar to the MASTER directive except that the specified region of code will be performed on the thread which is the first to reach the directive, not necessarily the master thread. Also, unlike the MASTER directive, there is an implied barrier at the end of the SINGLE region. Below is a serial example. Note that the routine DO_SOME_WORK has a(1) as its argument, while DO_MORE_WORK performs work on the whole array.

Fortran
do i = 1, n a(i) = b(i) enddo call do_some_work(a(1)) call do_more_work(a, 1, n)

C/C++
for(i=1; i

8.5 CRITICAL
The CRITICAL directive is similar to the ORDERED directive in that only one thread executes the specified section of source code at a time. With the CIRTICAL directive, however, the threads can perform the task in any order. Suppose we want to determine the maximum value in the 1 dimensional array a. Let the values of the array be computed in the function COMPUTE_A. In Fortran, the intrinsic function MAXVAL, which returns the largest value in the argument, will be used and it will be assumed that a similar function has been provided in C. The serial code will simply look like Fortran
call compute_a(a) the_max = maxval(a)

C/C++
compute_a(a); the_max = maxval(a);

In parallel, a different section of a will be computed on each thread. The maximum value will then be found for each section, and a global maximum will be computed using the Fortran MAX function, which returns the maximum of a series of scalar arguments. As before, it will be assumed that an analogous function has been written in C. Fortran

the_max = 0.0 !$omp parallel private(myid, istart, iend) call myrange(myid, nthreads, global_start, global_end, istart, iend) call compute_a(a(istart:iend)) !$omp critical the_max = max( maxval(a(istart:iend), the_max ) !$omp end critical call more_work_on_a(a) !$omp end parallel

C/C++

the_max = 0.0; #pragma omp parallel private(myid, istart, iend) { myrange(myid, nthreads, global_start, global_end, &istart, &iend); nvals = iend-istart+1; compute_a(a[istart],nvals); #pragma omp critical the_max = max( maxval(a[istart],nvals), the_max ); #pragma omp end critical call more_work_on_a(a) }

8.6 SECTIONS

There are some tasks which must be performed serially due to data dependencies, calls to serial libraries, input/output issues, etc. If there is more than one such task and they are independent, they can be performed by individual threads at the same time. Note that each task is still only performed by a single thread, i.e., the individual tasks are not parallelized. This configuration can be effected through the use of the SECTION and SECTIONS directives. Suppose we have a code which solves for a field on a computational grid. Two of the first steps are to initialize the field and to check the grid quality. These tasks can be performed through function or subroutine calls: Fortran
call init_field(field) call check_grid(grid)

C/C++
init_field(field); check_grid(grid);

Since these are independent tasks, we would like to perform them in parallel. The SECTIONS directive is used within a parallel region to indicate that the designated block of code (a structured block in C; the region of code between the SECTIONS directive and an END SECTIONS directive in Fortran) will contain a number of individual sections, each of which is to be executed on its own thread. Within the designated region, SECTIONS directives are used to delimit individual sections. There is an implied barrier at the end of the SECTIONS region. Be careful to note that the overall region of code which includes sections is designated with the SECTIONS directive (plural), and each individual section is designated with the SECTION directive (singular). Here's the same code fragment using sections: Fortran
!$omp parallel !$omp sections !$omp section call init_field(field) !$omp section call check_grid(grid) !$omp end sections !$omp end parallel

C/C++
#pragma omp parallel { #pragma omp sections { #pragma omp section init_field(field); #pragma omp section check_grid(grid); } }

Each of the two tasks will now be performed in parallel on individual threads. In this example, exactly two threads are used irrespective of the number of threads available in the current parallel region. There is also a PARALLEL SECTIONS directive, analogous to PARALLEL DO. The PARALLEL SECTIONS directive spawns multiple threads and enables the use of SECTIONS directives at the same time. Fortran
!$omp parallel sections

!$omp section call init_field(field) !$omp section call check_grid(grid) !$omp end parallel sections

C/C++
#pragma omp parallel sections { #pragma omp section init_field(field); #pragma omp section check_grid(grid); }

8.7 Self Test


@import url(base.css); @import url(content.css); Introduction to OpenMP - Thread Control Question 1

Question 1 The barrier directive causes a thread to wait until a specified event has occurred before continuing. causes threads to execute one at a time (serially). causes threads to wait at the barrier directive until all threads have reached that point in the code. causes a thread to wait for another specified thread to reach the barrier directive before continuing.

Question 2

Question 2 The code enclosed by master/end master directives will only execute on the master thread, and will be skipped over by the other threads. True False

Question 3

Question 3 The single directive causes the associated code to be executed on one thread at a time. True False

Question 4

Question 4 The only difference between the master and single directives is that the master directive specifies that only the master thread will execute the specified region of code, while the single directive allows it to execute on any available thread. True False

Question 5

Question 5 The critical directive causes the affected region of code to execute in the same way as it would execute on a single thread. causes the affected region of code to execute one thread at a time. causes the affected region of code to execute on the master thread only. causes the affected region of code to execute on an arbitrary thread.

Question 6

Question 6 In using the sections directive, each section is assigned to a single thread. True False

9 More Directives 9.1 More Directives


The previous section focused on exploiting loop-level parallelism using OpenMP. This form of parallelism is relatively easy to exploit and provides an incremental approach towards parallelizing an application, one loop at a time. However, since loop-level parallelism is based on local analysis of individual loops, it is limited in the forms of parallelism that it can exploit. A global analysis of the algorithm, potentially including multiple loops as well as other non-iterative constructs, can often be used to parallelize larger portions of an application such as an entire phase of an algorithm. Parallelizing larger and larger portions of an application in turn yields improved speedups and scalable performance. In the previous sections of this tutorial, we mentioned the support provided in OpenMP for moving beyond loop-level parallelism. For example, we discussed the generalized parallel region construct to express parallel execution. Rather than being restricted to a loop as with the PARALLEL DO construct discussed previously, this construct is attached to an arbitrary body of code that is executed concurrently by multiple threads. This form of replicated execution, with the body of code executing in a replicated fashion across multiple threads, is commonly referred to as "SPMD" style parallelism, for "single-program multiple-data." Some clauses which modify the PARALLEL DO directive were introduced, such as PRIVATE, SHARED, DEFAULT, and REDUCTION. They will continue to provide exactly the same behavior for the PARALLEL construct as they did for the PARALLEL DO construct. In the following sections we will discuss a few more directives: THREADPRIVATE, COPYIN, ATOMIC, and FLUSH.

9.2 THREADPRIVATE 9.2.1 THREADPRIVATE


A parallel region may include calls to other subprograms such as subroutines or functions. The lexical or static extent of a

parallel region is defined as the code that is lexically within the PARALLEL/END PARALLEL directive. The dynamic extent of a parallel region includes not only the code that is directly within the PARALLEL/END PARALLEL directive (the static extent), but also includes all the code in subprograms that are invoked either directly or indirectly from within the parallel region. This distinction is illustrated in the figure below.

Figure 1: Code illustrating difference between the static(lexical) and dynamic extents. The static extent includes the region highlighted in yellow. The dynamic extent includes the both of the highlighted regions. The importance of this distinction is that the data scoping clauses apply only to the lexical scope of a parallel region and not to the entire dynamic extent of the region. For variables which are global in scope (e.g, common block variables in Fortran and global variables in C/C++), references from within the lexical extent of a parallel region are affected by the data scoping clause (such as PRIVATE) on the parallel directive. However, references to such global variables from the dynamic extent which are outside of the lexical extent are not affected by any of the data scoping clauses and always refer to the global shared instance of the variable. Hence, if a global variable is declared private, then references to it from the static extent of a parallel region and that portion of the dynamic extent outside the static extent may not refer to the same memory location. This choice was made to simplify the implementation of the data scoping clauses. A simple way to control the scope of such variables is to pass them as arguments of the subroutine or function being referenced. By passing them as arguments, all references to the variables now refer to the private copy of the variables within the parallel region. While the problem can be solved this way, it is often cumbersome when the common blocks appear in several subprograms or when the list of variables in the common blocks is lengthy. The OpenMP THREADPRIVATE directive provides an easier method that does not require modification of argument lists. The syntax and specification for this directive are discussed in the following two sub-sections.

9.2.2 Specification
The THREADPRIVATE directive tells the compiler that a common block (or global variables in C/C++) is private to each thread. A private copy of each common block marked as threadprivate is created for each thread and within each thread all references to variables within that common block anywhere in the entire program refer to the variable instance within the private copy. Threads cannot refer to the private instance of the common block belonging to another thread. The important distinction between the PRIVATE and THREADPRIVATE directives is that the threadprivate directive affects the scope of the variable within the entire program, not just within the lexical scope of a parallel region. The THREADPRIVATE directive is provided after the declaration of the common block (global variable in C/C++) within a subprogram unit, not in the declaration of a parallel region. Moreover, the THREADPRIVATE directive must be supplied after the declaration of the common block in every subprogram unit that references the common block. Variables from

threadprivate common blocks cannot not appear in any other data scope clauses, nor are they affected by the DEFAULT (SHARED) clause. Thus, it is safe to use the default (shared) clause even when threadprivate common block variables are being referenced in the parallel region. How are threadprivate variables initialized? And what happens to them when the program moves from a parallel region to serial region and back? When the program begins execution the only executing thread is the master thread which has its own private copies of the threadprivate common blocks. As in the serial case, these blocks can be initialized by block data statements (Fortran) or by providing initial values with the definition of the variables (C/C++). When the first parallel region is entered, the slave threads get their own copies of the common blocks and the slave copies are initialized via the block data or initial value mechanisms. Any changes to common block variables made by executable statements within the master thread are lost. When the first parallel region exits, the slave threads stop executing, but they do not go away. Rather, they persist, retaining the states of their private copies of the common blocks for when the next parallel region is entered. There is one exception to this. If the user modifies the number of threads through an OpenMP runtime library call, then the common blocks are reinitialized.

9.2.3 Syntax and Sample


The syntax of the threadprivate directive is Fortran
!$omp threadprivate(/blk1/[, /blk2/]...)

C/C++
#pragma omp threadprivate(varlist)

where blk1, blk2, etc. are the names of common blocks to be made threadprivate. Note, threadprivate common blocks must be named common blocks. Unnamed or "blank" common blocks cannot be threadprivate. varlist is a list of named file scope or namespace scope variables. Sample Code in Fortran:

program thrd_pvt_example integer ibegin, iend, iblock integer iarray(5000), jarray(5000), karray(5000) integer N integer nthreads, ithread integer omp_get_num_threads, omp_get_thread_num common /DO_LOOP_BOUNDS/ ibegin, iend !$omp threadprivate(/DO_LOOP_BOUNDS/) N = 5000 !$omp parallel private(nthreads,ithread, iblock) nthreads = omp_get_num_threads() ithread = omp_get_thread_num() iblock = (N+nthreads-1)/nthreads ibegin = ithread*iblock + 1 iend = min((ithread+1)*iblock,N) call add_array(iarray,jarray,karray) !$omp end parallel end subroutine add_array(iarray,jarray,karray) common /DO_LOOP_BOUNDS/ ibegin, iend !$omp threadprivate(/DO_LOOP_BOUNDS/) integer iarray(5000),jarray(5000),karray(5000) do i = ibegin, iend iarray(i) = jarray(i)+karray(i) enddo return end

9.3 COPYIN
Specification

As mentioned previously, initialization of threadprivate data occurs when the slave threads are started for the first time (i.e., in the first parallel region) and then only by means of block data statements (Fortran) or initialization in the declaration statement (C/C++). OpenMP also provides limited support for another kind of initialization at the beginning of a parallel region via the COPYIN clause. The COPYIN clause allows a slave thread to read in a copy of the master thread's threadprivate variables. The COPYIN clause is supplied along with a parallel directive to initialize a threadprivate variable or set of threadprivate variables within a slave thread to the values of the threadprivate variables in the master thread's copy at the time that the parallel region starts. It takes as arguments either a list of variables from a THREADPRIVATE common block or names of entire THREADPRIVATE common blocks. The COPYIN clause is useful when the threadprivate variables are used for temporary storage within each thread but still need initial values that are either computed or read from an input file by the master thread. Syntax The syntax of the COPYIN clause is
copyin (list)

where list is a comma-separated list of names of either threadprivate common blocks or individual threadprivate common block variables (Fortran), or a file scope or global threadprivate variables (C/C++). In Fortran, the names of threadprivate common blocks appear between slashes. Fortran Example:

program copyinexample integer N common /blk/ N !$omp threadprivate(/blk/) N = 5000 !$omp parallel copyin(N) ! Slave's copy of N initialized to 5000. ! Use N or modify N. N=N+1 print *, "slave thread:", N !$omp end parallel !$omp parallel ! Initial value of the slave's copy of N is whatever ! it was at the end of the previous parallel region. ! Use N or modify N. N=N+1 !$omp end parallel N = 10000 print *, "master thread:", N !$omp parallel copyin(N) ! Slave's copy of N initialized to 10000. ! Use N or modify N. print *, "slave thread:", N !$omp end parallel end

9.4 ATOMIC
Specification The ATOMIC clause may be regarded as a special case of a CRITICAL directive. It provides exclusive access by a single thread to a shared variable. Unlike the CRITICAL directive which can enclose an arbitrary block of code, the ATOMIC directive can only enclose a critical section that consists of a single assignment statement that updates a scalar variable. This clause is provided to take advantage of the hardware support provided by most modern multiprocessors for atomically updating a single location in memory. This hardware support typically consists of special machine instructions used for performing common operations such as incrementing loop variables. These hardware instructions have the property that they maintain exclusive access to this single memory location for the duration of the update. Because the full overhead of the locking mechanism provided by a critical section is not needed, these primitives can greatly improve performance.

One restriction on the use of the ATOMIC directive is that within sections of the program that will be running concurrently, one cannot use a mixture of the ATOMIC and CRITICAL directives to provide mutually exclusive access to a shared variable. Instead, you must consistently use one of the two directives. In regions of the program that will not run concurrently, you are free to use different mechanisms. Syntax The syntax for the atomic clause is: Fortran
!$omp atomic x = x operator expr ...... !$omp atomic x = intrinsic (x, expr)

C/C++
#pragma omp atomic x <binop> = expr ...... #pragma omp atomic /* one of */ x++, ++x, x--, or --x

where x is a scalar variable of an intrinsic type, operator is one of a set of pre-defined operators (including most arithmetic and logical operators), intrinsic is one of a set of predefined intrinsic functions (including min, max, and logical intrinsics), and expr is a scalar expression that does not reference x. The following table provides a complete list of operators and intrinsics in the Fortran and C/C++ languages. Language Fortran C/C++ Operators and Intrinsics +, *, -, /, .AND., .OR., .EQV., .NEQV., MAX, MIN, IAND, IOR, IEOR +, *, -, /, &, ^, |, <<, >&gt

9.5 FLUSH
Specification The FLUSH directive in OpenMP is used to identify a sequence point in the execution of the program at which the executing threads need to have a consistent view of memory. All memory accesses (reads and writes) that occur before the FLUSH must be completed before the sequence point and all memory accesses that occur after the FLUSH must occur after the sequence point. For example, threads must ensure that all registers are written to memory and that all write buffers are flushed, ensuring that any shared variables are made visible to other threads. After the FLUSH, a thread must assume that shared variables may have been modified by other threads and read back all data from memory prior to using it. The FLUSH directive is useful for the writing of client-server applications. The FLUSH directive is implied by many directives except when the NOWAIT clause is used. For example, some of the directives for which a FLUSH directive is implied include:
q q q q q

BARRIER CRITICAL and END CRITICAL ORDERED and END ORDERED PARALLEL and END PARALLEL PARALLEL DO and END PARALLEL DO

By default, a FLUSH directive applies to all variables that could potentially be accessed by another thread. However, rather than applying to all shared variables, the user can also choose to provide an optional list of variables with the FLUSH directive. In this case, only the named variables are required to be synchronized. This allows the compiler to optimize performance by rearranging operations involving the variables that are not named. In summary, the FLUSH directive does not, by itself, perform any synchronization. It only provides memory consistency between the executing thread and global memory and must be used in combination with other read/write operations to

implement synchronization between threads. Syntax The syntax for the flush directive is as follows: Fortran
!$omp flush [(list)]

C/C++
#pragma omp flush [(list)]

where list is an optional list of variable names.

9.6 Self Test


@import url(base.css); @import url(content.css); Introduction to OpenMP - More Directives Question 1

Question 1 The threadprivate directive is used to identify a common block as: being private to each thread. being private to the master thread. being private to the slave thread. being private to the specifically declared thread.

Question 2

Question 2 Through the copyin clause, which thread can have the access to what other thread's copy of threadprivate data: master thread to slave thread. slave thread to master thread. each thread to all of other threads. The above all are not right.

Question 3

Question 3 In C/C++, which of the following operator can not be used for the atomic clause? ++ -% |

Question 4

Question 4 Does flush clause perform any synchronization by itself? Yes No

10 More Functions 10.1 More Functions


In Lesson 6. Basic Functions, we introduced three basic functions: OMP_SET_NUM_THREADS, OMP_GET_NUM_threads, and OMP_GET_THREAD_NUM. In this chapter, we will cover the following OpenMP library functions:
q

q q q q

OMP_SET_DYNAMIC provides the programmer the option to allow the runtime system to adjust the threads dynamically based on availability. OMP_GET_DYNAMIC reports the status of dynamic adjustment. OMP_GET_MAX_THREADS returns the maximum number of threads that may be available for use in a parallel region. OMP_GET_NUM_PROCS returns the number of processors available on the system. OMP_IN_PARALLEL reports if the code region is parallel or serial.

10.2 OMP_SET_DYNAMIC
Sets the thread dynamic status. On a multi-user, time-sharing computer, when a user requests more threads than there are processors, an individual processor may have to handle the work load of multiple threads. Consequently, while the job still gets done, the parallel performance suffers. To solve this problem, the OpenMP omp_set_dynamic function provides the capability to permit automatic adjustment of threads to match available processors. For application programs that depend on a fixed number of threads, the dynamic threading feature can be turned off to ensure that the prefixed thread count be used. Note that:
q

q q

For Fortran, this function admits a logical value (true or false). For C, it takes nonzero or 0. Upon setting the status to ".true. " (or "nonzero" for C), the runtime system is free to adjust the number of threads used in a parallel region conditioned upon thread availability. The number of threads used, however, can be no larger than OMP_GET_MAX_THREADS(). On the other hand, if the dynamic status is set to ".false.", then the parallel region uses the requested threads that are determined by a call to OMP_SET_NUM_THREADS. Alternatively, the dynamic adjustment can also be controlled by the environment variable OMP_DYNAMIC. If a specific number of threads must be used in a parallel region, disabling the dynamic control ensures that the requested number of threads will be used.

C/C++

#include &#060omp.h&#062 int omp_set_dynamic(int dynamic_threads)

Fortran
subroutine omp_set_dynamic(dynamic_threads) logical dynamic_threads

The example below demonstrates these functions


q q

Set thread dynamic status to ".true." ("nonzero" for C). Query the thread dynamic status.

C/C++
omp_set_dynamic(1); printf(STDOUT, " The status of thread dynamic is %d\n", omp_get_dynamic());

Fortran (replace "C" in column 1 with "!" for F90)

omp_set_dynamic(.true.) write(*,*)'The status of thread dynamic is : ', omp_get_dynamic()

10.3 OMP_GET_DYNAMIC
Returns dynamic thread status. Note that:
q

In Fortran, this function returns a logical value (true or false). In C, it returns 1 or 0. If the status is ".true. " (or "1" for C), the runtime system is free to adjust the number of threads used in a parallel region - conditioned upon thread availability. However, the number of threads used can be no larger than OMP_GET_MAX_THREADS(). On the other hand, if the dynamic status is ".false.", then the parallel region uses the requested threads that are determined by a call to OMP_SET_NUM_THREADS or the OMP_NUM_THREADS environment variable. The dynamic status can either be set by calling OMP_SET_DYNAMIC or by setting the environment variable OMP_DYNAMIC.

C/C++

#include &#060omp.h&#062 int omp_get_dynamic()

Fortran
LOGICAL FUNCTION omp_get_dynamic()

The example below demonstrates these functions:


q q

Set thread dynamic status to ".true." ("1" for C). Query the thread dynamic status.

C/C++

omp_set_dynamic(1); printf(STDOUT, " The status of thread dynamic is %d\n", omp_get_dynamic());

Fortran (replace "C" in column 1 with "!" for F90)


omp_set_dynamic(.true.) write(*,*)'The status of thread dynamic is : ', OMP_get_dynamic()

10.4 OMP_GET_MAX_THREADS
Returns the maximum number of threads that may be available for use in a parallel region. Note that:
q q q

This function can be used in serial and parallel regions. OMP_GET_NUM_THREADS() <= OMP_GET_MAX_THREADS() if OMP_SET_DYNAMIC is ".true. " (or 1 for C). OMP_GET_NUM_THREADS() = OMP_GET_MAX_THREADS() if OMP_SET_DYNAMIC is ".false." (or 0 for C).

C/C++

#include &#060omp.h&#062 int omp_get_max_threads()

Fortran:
INTEGER FUNCTION omp_get_max_threads()

The example below demonstrates these functions:


q q q

Query for thread dynamic status. Query for maximum thread count in a serial region. Query for maximum thread count in a parallel region.

C/C++
printf(STDOUT, " The thread dynamic status is : %d\n", omp_get_dynamic()); printf(STDOUT, " In a serial region; max threads are : %d\n", omp_get_max_threads()); #pragma omp parallel { printf(STDOUT, " In a parallel region; max threads are : %d\n", omp_get_max_threads()); }

Fortran (replace "C" in column 1 with "!" for F90)

write(*,*)'The thread dynamic status is : ', omp_get_dynamic() write(*,*)'In a serial region, max threads are : ', & omp_get_max_threads() C$OMP PARALLEL write(*,*)'In a parallel region, max threads are : ', & omp_get_max_threads() C$OMP END PARALLEL

10.5 OMP_GET_NUM_PROCS
Queries the number of processors available on the system. Note that:
q q

This function can be used in serial and parallel regions. You may or may not be able to request as many threads as reported by OMP_GET_NUM_PROCS() as further restrictions, such as batch queue process limits, may apply.

C:
#include &#060omp.h&#062 int omp_get_num_procs()

Fortran:
INTEGER FUNCTION omp_get_num_procs()

The example below demonstrates these functions:


q q q q

In a serial region, use OMP_IN_PARALLEL to confirm that it is so. Use OMP_GET_NUM_PROCS in a serial region to see how many processors are available in system. In a parallel region, use OMP_IN_PARALLEL to confirm that it is indeed so. Use OMP_GET_NUM_PROCS in a parallel region to see how many processors are available in system.

C/C++

printf(STDOUT, " In parallel region (0)? %d\n", omp_in_parallel()); printf(STDOUT, "Threads available in system : %d\n", omp_get_num_procs()); #pragma omp parallel { printf(STDOUT, " In parallel region (1)? %d\n", omp_in_parallel()); printf(STDOUT, "Threads available in system : %d\n", omp_get_num_procs()); }

Fortran (replace "C" in column 1 with "!" for F90)

write(*,*)'In parallel region (F)? ', omp_in_parallel() write(*,*)'Threads available in system : ', omp_get_num_procs() C$OMP PARALLEL write(*,*)'In parallel region (T)? ', omp_in_parallel() write(*,*)'Threads available in system : ', omp_get_num_procs() C$OMP END PARALLEL

10.6 OMP_IN_PARALLEL
Returns a value indicating whether the region (code segment) in question is parallel or serial. A region is parallel when enclosed by the parallel directive. A region is serial otherwise. Note that:
q q

For Fortran, this function returns a logical value (true or false). For C, it returns int of "1" or "0". As expected, in a parallel region for which the number of threads is 1, OMP_IN_PARALLEL returns "false" (for fortran) or "0" (for C).

C/C++
#include &#060omp.h&#062 int omp_in_parallel()

Fortran:
LOGICAL FUNCTION OMP_in_parallel()

The example below demonstrates these functions:


q q q

Request threads by OMP_SET_NUM_THREADS. In a serial region, use OMP_IN_PARALLEL to confirm that it is indeed so. Use OMP_IN_PARALLEL in a parallel region to verify that it reports "true" for fortran and "1" for C.

C/C++
omp_set_num_threads(num_threads); printf(STDOUT, " In parallel region (0)? %d\n", omp_in_parallel()); #pragma omp parallel { printf(STDOUT, " In parallel region (1)? %d\n", omp_in_parallel()); }

Fortran (replace "C" in column 1 with "!" for F90)

call OMP_SET_NUM_THREADS(num_threads) write(*,*)'In parallel region (F)? ', OMP_IN_PARALLEL() C$OMP PARALLEL write(*,*)'In parallel region (T)? ', OMP_IN_PARALLEL() C$OMP END PARALLEL

10.7 Self Test


@import url(base.css); @import url(content.css); Introduction to OpenMP - More Functions Question 1

Question 1 omp_in_parallel returns an integer for fortran and an int for C. returns a logical for fortran and an int for C. must be called in a parallel region.

Question 2

Question 2 omp_get_num_procs() must be called from a parallel region. can be called from parallel regions or serial regions. must only be called from a serial region.

Question 3

Question 3 omp_get_max_threads returns a value set by call to omp_get_num_threads. is the same as omp_get_num_procs. is the same as omp_get_num_threads.

Question 4

Question 4 omp_get_dynamic returns the status of dynamic memory allocation. returns an integer for both fortran and C. returns the dynamic status of threads.

11 Nested Parallelism 11.1 Nested Parallelism


Often, application codes have nested do/for loops. At times, individual loops may have small loop counts which would render them inefficient to be processed in parallel. However, grouped together these loops may present potential efficiency gains for parallelism (if they are parallelizable, of course). Nested parallelism, as the phrase implies, is a feature in

OpenMP that deals with multiple levels of parallelism. If nested parallelism is implemented in the OpenMP API and is enabled by the user, multiple levels of nested loops or parallel regions are executed in parallel. At present, nested parallelism has not been implemented by any vendor. This topic is only covered for completeness.

11.2 OMP_SET_NESTED
This subprogram enables or disables nested parallelism by setting its only argument to .true. or .false. for fortran and 1 or 0 for C. Note that:
q

This feature has not been implemented by any vendor. Only single level of for loop (for C) or do loop (for fortran) is permitted. On these machines, user's action of enabling nested parallelism may either be ignored (e.g., the IBM language compiler) or a warning message that this feature is not implemented (e.g., the SGI compiler) will result. Alternatively, nested parallelism can be enabled through the environment variable OMP_NESTED before a run as follows:
setenv OMP_NESTED TRUE (c shell) OMP_NESTED = TRUE (korn shell) export OMP_NESTED (korn shell)

C/C++
#include &#060omp.h&#062 void omp_set_nested(int nested)

Fortran
subroutine omp_set_nested(nested) logical nested
q

NESTED (logical, input) -- .true. to enable nested parallelism; .false. either wise

Example C/C++

omp_set_nested(1); /* enables nested parallelism */ printf("Status of nested parallelism is %d\n", omp_get_nested());

Fortran

call omp_set_nested(.true.) ! enables nested parallelism write(*,*)'Status of nested parallelism is : ', OMP_GET_NESTED()

11.3 OMP_GET_NESTED
This function returns the status of nested parallelism setting. Note that:
q

As of September, 2001, no vendor has implemented the nested parallelism feature. The query function always returns false (or 0 for C), even if the user enables it explicitly.

C/C++
#include &#060omp.h&#062 int omp_get_nested()

Fortran
logical function omp_get_nested()

Example C/C++

omp_set_nested(1); /* enables nested parallelism */ printf("Status of nested parallelism is %d\n", omp_get_nested());

Fortran

call omp_set_nested(.TRUE.) ! enables nested parallelism write(*,*)'Status of nested parallelism is : ', omp_get_nested()

12 LOCKS
We have seen how the CRITICAL, MASTER, SINGLE, and ORDERED directives can be used to control the execution of a single block of code. The SECTION directive can be used to control the parallel execution of different blocks of code, but the number of threads is restricted to the number of sections. If additional control over the parallel execution of different blocks of code is required, OpenMP offers a set of LOCK routines. LOCK routines operate very much as the name implies: A given thread takes "ownership" of a lock, and no other thread can execute a specified block of code until the lock is relinquished, i.e., the other threads are "locked out" until the lock is "opened." One useful application of locks is when a code performs a time-consuming serial task. Using locks, other useful work can be done by the other processors while the serial task is processing. A name must be declared for each lock. In Fortran, the name must be an integer which is large enough to hold an address. For 64-bit addresses, this can be declared as INTEGER*8 or INTEGER(SELECTED_REAL_KIND(18)). In C or C++, the lock name must be declared to be type OMP_LOCK_T (which is defined in the omp.h header file). Every lock routine has a single argument, which is the lock name in Fortran, or a pointer to the lock name in C or C++. Before using a lock, the lock name must be initialized through the OMP_INIT_LOCKsubroutine (Fortran) or function (C/C++): C/C++
omp_init_lock(&mylock);

Fortran

call omp_init_lock(mylock)

where mylock is the lock name. Similarly, when the lock is no longer needed it should be destroyed using OMP_DESTROY_LOCK: C/C++
omp_destroy_lock(&mylock);

Fortran
call omp_destroy_lock(mylock)

A thread gains ownership of a lock by calling

C/C++
omp_set_lock(&mylock);

Fortran
call omp_set_lock(mylock)

If OMP_SET_LOCK is called by a thread and a different thread already has ownership of the specified lock, the calling thread will remain blocked at the call until the lock becomes available. The companion function to OMP_SET_LOCK is OMP_UNSET_LOCK, which releases ownership of the specified lock: C/C++
omp_unset_lock(&mylock);

Fortran
call omp_unset_lock(mylock)

There is one additional lock routine, OMP_TEST_LOCK, which is related to OMP_SET_LOCK: C/C++
did_it_set = omp_test_lock(&mylock);

Fortran
did_it_set = omp_test_lock(mylock)

In Fortran this is a logical function, not a subroutine like the other lock routines. In C/C++ it is an integer function rather than a void function like the others. The OMP_TEST_LOCK routine is like the OMP_SET_LOCK routine in that the calling thread takes ownership of the specified lock if it is available. However, if the lock is currently owned by another thread, the code continues to the next line rather than blocking to wait for the thread. The function's return value indicates whether or not the lock was available. In Fortran, a logical variable "true" is returned indicating that the calling thread successfully took ownership of the lock, and "false" is returned indicating that the lock was owned by a different thread. In C/C++, the function returns a non-zero integer if the lock was successfully set, and it returns zero if the thread was owned by a different thread. Example Below is an example of the use of lock routines. One thread performs a long serial task. While it is doing so, the other threads perform a parallel task. The routine which performs the parallel task has an index as its argument so that each time it is called it can restart from wherever it left off in the previous call. An example of such a task could be searching a database. C/C++
omp_init_lock(&mylock); #pragma omp parallel private(index){ if(omp_test_lock(&mylock)){ long_serial_task(); omp_unset_lock(&mylock); }else{ while(! omp_test_lock(&mylock)) short_parallel_task(index); omp_unset_lock(&mylock); } } omp_destroy_lock(&mylock);

Fortran
call OMP_INIT_LOCK(mylock) !$OMP PARALLEL PRIVATE(index) if(OMP_TEST_LOCK(mylock)) then call long_serial_task call OMP_UNSET_LOCK(mylock) else dowhile(.not. OMP_TEST_LOCK(mylock)) call short_parallel_task(index) enddo call OMP_UNSET_LOCK(mylock) endif !$OMP END PARALLEL call OMP_DESTROY_LOCK(mylock)

A lock called "mylock" is first initialized, and then a PARALLEL directive spawns multiple threads. This is followed by a call to OMP_TEST_LOCK. Whichever thread reaches this line first will find the lock to be free, and will take ownership of it. This thread then goes on to perform a long serial task. The remaining threads will repeatedly check the lock and perform the short parallel task as long as the lock is still owned by another thread. As soon as the long serial task has been completed the thread performing that task releases the lock and goes to the end of the parallel block, where there is an implied barrier. Each of the other threads will take ownership of the lock, perform its final short parallel task, unset the lock, and go to the end of the parallel block. (The "final short parallel task" is performed in order to keep this example as simple as possible. In practice, if this task is not required, it could be bypassed with additional logic.) Finally, once all threads have been synchronized at the implied barrier, the lock is released.

13 SCHEDULE 13.1 SCHEDULE


The way in which iterations of a parallel loop are distributed among the executing threads is called the loop's SCHEDULE. In OpenMP's default scheduling scheme, the executing threads are assigned nearly equal numbers of iterations. If each iteration contains approximately the same amount of work, the threads will finish the loop at about the same time. This situation is called load-balanced and yields optimal performance. In some cases, different iterations of a loop may perform different amounts of work. When threads are assigned differing amounts of work, the load is said to be unbalanced. In the example below, each iteration of the loop calls one of the subroutines FAST or SLOW depending on the value of y . If the iterations assigned to each thread have very different proportions of FAST and SLOW calls, then the speed with which the threads complete their work will vary considerably. The threads that complete earlier do no useful work while they wait for the slower threads to catch up, thus the performance is not optimal. Fortran

!$omp parallel do private(y) do i = 1, n y = f(i) if (y .lt. 0.5e0) then call fast(x(i)) else call slow(x(i)) endif enddo

If the work done by each iteration of the loop varies in some systematic fashion, then it may be possible to speed up the execution of the loop by changing its schedule. In OpenMP, iterations are assigned to threads in contiguous ranges called chunks. By controlling how these chunks are assigned to threads, either dynamically or in some static fashion, and the number of iterations per chunk, the so-called chunk size, a scheduling scheme attempts to balance the work across threads. A schedule is specified by a SCHEDULE clause on the PARALLEL DO or DO directive, or it may be optionally specified by an environment variable. In the next section, we describe each of the options OpenMP provides for scheduling. We focus on how each schedule assigns iterations to threads and on the overhead each schedule imposes. We provide guidelines for choosing an appropriate schedule.

Syntax The syntax of a schedule clause is


schedule(type[, chunk_size])

Type is one of static, dynamic, guided or runtime. If it is present, chunk_size must be a scalar integer value. The kind of schedule specified by the schedule clause depends on the combination of the type and optional chunk_size parameter. If no schedule clause is specified, the choice of schedule is implementation dependent. The various types are discussed in the following sections and then summarized in a table.

13.2 Static
In a static schedule, each thread is assigned a fixed number of chunks to work on. If the type is static and the chunk_size parameter is not present, then each thread is given a single chunk of iterations to perform. The runtime system attempts to make the chunks as equal in size as possible, but the precise assignment of iterations to threads is implementation dependent. For example, if the number of iterations is not evenly divisible by the number of threads, the remaining iterations may be distributed among the threads in any suitable fashion. This kind of schedule is called "simple static." If the type is static and the chunk_size parameter is present, iterations are divided into chunks of size chunk_size until fewer than chunk_size iterations are left. The remaining iterations are divided into chunks in an implementation dependent fashion. Threads are then assigned chunks in a round-robin fashion: the first thread gets the first chunk, the second thread gets the second chunk, and so on, until no more chunks remain. This kind of schedule is called "interleaved." The simple static scheme is appropriate if the work per iteration is nearly equal. The interleaved scheme may be useful if the work per iteration varies systematically. For example, if the work per iteration increases monotonically, then an interleaved scheme will more evenly distribute work among the threads, but at a cost of a small amount of additional overhead. Simple static scheduling is usually the default.

13.3 Dynamic
In a dynamic schedule, the assignment of iterations to threads is determined at runtime. As in the static case, the iterations are broken up into a number of chunks which are then farmed out to the threads on at a time. As threads complete work on a chunk, they request another chunk until the supply of chunks is exhausted. If the scheduling type is dynamic, iterations are divided into chunks of size chunk_size, similar to an interleaved schedule. If chunk is not present, the size of all chunks is 1. This kind of schedule is called "simple dynamic." A simple dynamic schedule is more flexible than an interleaved schedule because faster threads are assigned more iterations, but it has greater overhead, in the form of synchronization costs, because the OpenMP runtime system must coordinate the assignment of iterations to threads.

13.4 Guided
The guided type is a variant of dynamic scheduling. In this type, the first chunk of iterations is of some implementation-dependent size, and the size of each successive chunk is a fixed fraction of the preceding chunk until a minimum size of chunk_size is reached. Hence, size(chunkn)=min(chunk_size,rn*size(chunk0)). The value of r (where r &lt 1 ),is also implementation dependent. Frequently, chunk0 is chosen at about N/P where N is the number of iterations and P is the number of threads and r is chosen as (1-1/P). If fewer than chunk_size iterations are left, how the remaining iterations are divided into chunks also depends on the implementation. If chunk_size is not specified, the minimum chunk size is 1. Chunks are assigned to threads dynamically. Guided scheduling is sometimes called "guided self-scheduling" or "GSS." The advantage of the guided type over the dynamic type is that guided schedules use fewer chunks, reducing the amount of synchronization overhead, i.e. the number of times a thread must ask for new work. The number of chunks produced increases linearly with the number of iterations in the dynamic type but only logarithmically for the guided type, so the advantage gets greater as the number of iterations in the loop increases.

13.5 Runtime
The runtime type allows the scheduling to be determined at runtime. The chunk_size parameter must not appear. The schedule type is chosen at runtime based on the value of the environment variable OMP_SCHEDULE. The environment

variable is set to a string that matches the parameters that would appear in the parentheses of a SCHEDULE clause. For example,setting OMP_SCHEDULE via the C shell command
%setenv OMP_SCHEDULE "guided, 100"

before executing the program would result in the loops having a guided schedule with a minimum chunk size of 100. If OMP_SCHEDULE is not set, the choice of schedule depends on the implementation.

13.6 Schedule Clause


The table below summarizes the different scheduling options and compares them in terms of several characteristics that affect performance. Summary of scheduling options Name Simple Static Interleaved Simple dynamic Guided Runtime Type simple simple Chunk no yes Chunk Size N/P C C Number of Chunks P N/C N/C fewer than N/C varies Static or Dynamic static static dynamic dynamic varies Computer Overhead lowest low medium high varies

dynamic optional guided runtime

optional decreasing from N/P no varies

In this table, N is the number of iterations of the parallel loop, P is the number of threads executing the loop, and C is the user-specified chunk size. A note of caution: the correctness of a program should not depend on the schedule chosen for its parallel loops. If the correctness of the results depends on the choice of schedule, then it is likely that you missed a source of dependency in one or more of your loop parallelizations. For example, if the correct results depend on the sequential execution of some of the iterations, then results will depend on whether the iterations are assigned to the same chunk and/or thread. A program may get correct results at first, but then mysteriously stop working if the schedule is changed while tuning performance. If the schedule is dynamic, the situation is potentially more challenging as the program may fail only intermittently.

13.7 Self Test


@import url(base.css); @import url(content.css); Introduction to OpenMP - Schedule Question 1

Question 1 Which one of the following type is NOT a correct type for the schedule clause? static dynamic guided shared

Question 2

Question 2

About the static type, which one of the following statement is correct: The choice of which thread performs a particular iteration is only a function of the iteration number. The first thread gets the first chunk. If type is static and chunk is not present, the chunk size can't be determined. Static scheduling has higher overhead.

Question 3

Question 3 How are the iterations assigned to the threads in the dynamic schedule? All iterations are assigned to the threads at the beginning of the loop. Each thread requests more iterations after it has completed the work already assigned to it. All iterations are assigned to all thread evenly. Iterations are assigned to all threads randomly.

Question 4

Question 4 In the guided type, how is the chunk size distributed? The chunk size is distributed evenly. The chunk size is distributed randomly. The first is implementation dependent and the size of each successive chunk decreases in size exponentially. The first is implementation dependent and the size of each successive decreases one by one.

Question 5

Question 5 Which one of the following UNIX C shell commands sets the environment value for the runtime type correctly? %setenv OMP_SCHEDULE = "dynamic, 3" %setenv OMP_SCHEDULE dynamic:3 %setenv OMP_SCHEDULE "dynamic", "3" %setenv OMP_SCHEDULE "dynamic, 3"

CI-Tutor content for personal use only. All rights reserved. 2013 Board of Trustees of the University of Illinois.

You might also like