Skip to content. Skip to navigation

ICTP Portal

You are here: Home Manuals on-line PGI Compiler pgiws_ug PGI Workstation User's Guide - 10 OpenMP Parallelization Directives for Fortran
Personal tools
Document Actions

PGI Workstation User's Guide - 10 OpenMP Parallelization Directives for Fortran

<< << " border=0> >> > " border=0> Title Contents Index Home Help

10 OpenMP Parallelization
Directives for Fortran

The PGF77 and PGF90 Fortran compilers support the OpenMP Fortran Application Program Interface. The OpenMP shared-memory parallel programming model is defined by a collection of compiler directives, library routines, and environment variables that can be used to specify shared-memory parallelism in Fortran programs. The directives include a parallel region construct for writing coarse grain SPMD programs, work-sharing constructs which specify that DO loop iterations should be split among the available threads of execution, and synchronization constructs. The data environment is controlled using clauses on the directives or with additional directives. Run-time library routines are provided to query the parallel runtime environment, for example to determine how many threads are participating in execution of a parallel region. Finally, environment variables are provided to control the execution behavior of parallel programs. For more information on OpenMP, see

For an introduction to how to execute programs which use multiple processors along with some pointers to example code, see Section 1.4, Parallel Programming Using the PGI Compilers. The file contains a more advanced self-guided tutorial on how to parallelize the NAS FT fast Fourier transform benchmark using OpenMP directives. You can retrieve it using a web browser, and unpack it using the following commands within a UNIX shell window or a BASH for Win32 command window:

% gunzip fftpde.tar.gz
% tar xvf fftpde.tar

Follow the instructions in the README file to work through the tutorial.

10.1 Parallelization Directives

Parallelization directives are comments in a program that are interpreted by the PGI Fortran compilers when the option -mp is specified on the command line. The form of a parallelization directive is:

sentinel	directive_name	[clauses]

With the exception of the SGI-compatible DOACROSS directive, the sentinel must be !$OMP, C$OMP, or *$OMP, must start in column 1 (one), and must appear as a single word without embedded white space. The sentinel marking a DOACROSS directive is C$. Standard Fortran syntax restrictions (line length, case insensitivity, etc.) apply to the directive line. Initial directive lines must have a space or zero in column six and continuation directive lines must have a character other than space or zero in column six. Continuation lines for C$DOACROSS directives are specified using the C$& sentinel.

The order in which clauses appear in the parallelization directives is not significant. Commas separate clauses within the directives, but commas are not allowed between the directive name and the first clause. Clauses on directives may be repeated as needed subject to the restrictions listed in the description of each clause.

The compiler option -mp enables recognition of the parallelization directives. The use of this option also implies:

local variables are placed on the stack and optimizations that may result in non-reentrant code are disabled (e.g., -Mnoframe);
critical sections are generated around Fortran I/O statements.

Many of the directives are presented in pairs and must be used in pairs. In the examples given with each section, the routines omp_get_num_threads() and omp_get_thread_num() are used (refer to section 10.16, Run-time Library Routines). They return the number of threads currently in the team executing the parallel region and the thread number within the team, respectively.



!$OMP PARALLEL [Clauses]
< Fortran code executed in body of parallel region >


REDUCTION([{operator | intrinsic}:] list)
COPYIN (list)
IF (scalar_logical_expression)

This directive pair declares a region of parallel execution. It directs the compiler to create an executable in which the statements between PARALLEL and END PARALLEL are executed by multiple lightweight threads. The code that lies between PARALLEL and END PARALLEL is called a parallel region.

The OpenMP parallelization directives support a fork/join execution model in which a single thread executes all statements until a parallel region is encountered. At the entrance to the parallel region, a system-dependent number of symmetric parallel threads begin executing all statements in the parallel region redundantly. These threads share work by means of work-sharing constructs such as parallel DO loops (see below). The number of threads in the team is controlled by the OMP_NUM_THREADS environment variable. If OMP_NUM_THREADS is not defined, the program will execute parallel regions using only one processor. Branching into or out of a parallel region is not supported.

All other shared-memory parallelization directives must occur within the scope of a parallel region. Nested PARALLEL ... END PARALLEL directive pairs are not supported and are ignored. The END PARALLEL directive denotes the end of the parallel region, and is an implicit barrier. When all threads have completed execution of the parallel region, a single thread resumes execution of the statements that follow.

It should be emphasized that by default there is no work distribution in a parallel region. Each active thread executes the entire region redundantly until it encounters a directive that specifies work distribution. For work distribution, see the DO, PARALLEL DO, or DOACROSS directives.


INTEGER omp_get_thread_num
A(0) = -1
A(1) = -1
A(omp_get_thread_num()) = omp_get_thread_num()
PRINT *, "A(0)=",A(0), " A(1)=",A(1)

The variables specified in a PRIVATE list are private to each thread in a team. In effect, the compiler creates a separate copy of each of these variables for each thread in the team. When an assignment to a private variable occurs, each thread assigns to its local copy of the variable. When operations involving a private variable occur, each thread performs the operations using its local copy of the variable. Other important points to note about private variables are the following:

  • Variables declared private in a parallel region are undefined upon entry to the parallel region. If the first use of a private variable within the parallel region is in a right-hand-side expression, the results of the expression will be undefined (i.e. this is probably a coding error).
  • Likewise, variables declared private in a parallel region are undefined when serial execution resumes at the end of the parallel region.

The variables specified in a SHARED list are shared between all threads in a team, meaning that all threads access the same storage area for SHARED data.

The DEFAULT clause allows the user to specify the default attribute for variables in the lexical extent of the parallel region. Individual clauses specifying PRIVATE, SHARED, etc. status override the declared DEFAULT. Specifying DEFAULT(NONE) declares that there is no implicit default, and in this case each variable in the parallel region must be explicitly listed with an attribute of PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, or REDUCTION.

Variables that appear in the list of a FIRSTPRIVATE clause are subject to the same semantics as PRIVATE variables, but in addition are initialized from the original object existing prior to entering the parallel region. Variables that appear in the list of a REDUCTION clause must be SHARED. A private copy of each variable in list is created for each thread as if the PRIVATE clause had been specified. Each private copy is initialized according to the operator as specified in table 10-1:

Table 10-1 Initialization of REDUCTION Variables


















Smallest Representable Number


Largest Representable Number


All bits on





At the end of the parallel region, a reduction is performed on the instances of variables appearing in list using operator or intrinsic as specified in the REDUCTION clause. The initial value of each REDUCTION variable is included in the reduction operation. If the {operator | intrinsic}: portion of the REDUCTION clause is omitted, the default reduction operator is "+" (addition).

The COPYIN clause applies only to THREADPRIVATE common blocks. In the presence of the COPYIN clause, data from the master thread's copy of the common block is copied to the threadprivate copies upon entry to the parallel region.

In the presence of an IF clause, the parallel region will be executed in parallel only if the corresponding scalar_logical_expression evaluates to .TRUE.. Otherwise, the code within the region will be executed by a single processor regardless of the value of the environment variable OMP_NUM_THREADS.



!$OMP CRITICAL [(name)]
< Fortran code executed in body of critical section >

Within a parallel region, the user may have code that will not execute properly when multiple threads act upon the same sub-region of code. This is often due to a shared variable that is written and then read again.

The CRITICAL ... END CRITICAL directive pair defines a subsection of code within a parallel region, referred to as a critical section, which will be executed one thread at a time. The optional name argument identifies the critical section. The first thread to arrive at a critical section will be the first to execute the code within the section. The second thread to arrive will not begin execution of statements in the critical section until the first thread has exited the critical section. Likewise each of the remaining threads will wait its turn to execute the statements in the critical section.

Critical sections cannot be nested, and any such specifications are ignored. Branching into or out of a critical section is illegal. If a name argument appears on a CRITICAL directive, the same name must appear on the END CRITICAL directive.


REAL A(100,100), MX, LMX
MX = -1.0
LMX = -1.0
DO J=1,100
DO I=1,100

Note that this program could also be implemented without the critical region by declaring MX as a reduction variable and performing the MAX calculation in the loop using MX directly rather than using LMX. See sections 10.2 and 10.6 for more information on how to use the REDUCTION clause on a parallel DO loop.



< Fortran code in body of MASTER section >

In a parallel region of code, there may be a sub-region of code that should execute only on the master thread. Instead of ending the parallel region before this subregion, and then starting it up again after this subregion, the MASTER ... END MASTER directive pair allows the user to conveniently designate code that executes on the master thread and is skipped by the other threads. There is no implied barrier on entry to or exit from a MASTER ... END MASTER section of code. Nested master sections are ignored. Branching into or out of a master section is not supported.


INTEGER omp_get_thread_num
A(omp_get_thread_num()) = omp_get_thread_num()
PRINT *, "A(0)=", A(0), " A(1)=", A(1)



!$OMP SINGLE [Clauses]
< Fortran code in body of SINGLE processor section >



In a parallel region of code, there may be a sub-region of code that will only execute correctly on a single thread. Instead of ending the parallel region before this subregion, and then starting it up again after this subregion, the SINGLE ... END SINGLE directive pair allows the user to conveniently designate code that executes on a single thread and is skipped by the other threads. There is an implied barrier on exit from a SINGLE ... END SINGLE section of code unless the optional NOWAIT clause is specified.

Nested single process sections are ignored. Branching into or out of a single process section is not supported.


INTEGER omp_get_thread_num()
A(omp_get_thread_num()) = omp_get_thread_num()
PRINT *, "A(0)=", A(0), " A(1)=", A(1)

The PRIVATE and FIRSTPRIVATE clauses are as described in section 10.2.

10.6 DO ... END DO


!$OMP DO [Clauses ]
< Fortran DO loop to be executed in parallel >


REDUCTION({operator | intrinsic } : list)
SCHEDULE (type [, chunk])

The real purpose of supporting parallel execution is the distribution of work across the available threads. The user can explicitly manage work distribution with constructs such as:

IF (omp_get_thread_num() .EQ. 0) THEN
ELSE IF (omp_get_thread_num() .EQ. 1) THEN

However, these constructs are not in the form of directives. The DO ... END DO directive pair provides a convenient mechanism for the distribution of loop iterations across the available threads in a parallel region.

Variables declared in a PRIVATE list are treated as private to each processor participating in parallel execution of the loop, meaning that a separate copy of the variable exists on each processor. Variables declared in a FIRSTPRIVATE list are PRIVATE, and in addition are initialized from the original object existing before the construct. Variables declared in a LASTPRIVATE list are PRIVATE, and in addition the thread that executes the sequentially last iteration updates the version of the object that existed before the construct. The REDUCTION clause is as described in section 10.2. The SCHEDULE clause is explained below. If ORDERED code blocks are contained in the dynamic extent of the DO directive, the ORDERED clause must be present. See section 10.12 for more information on ORDERED code blocks.

The DO ... END DO directive pair directs the compiler to distribute the iterative DO loop immediately following the !$OMP DO directive across the threads available to the program. The DO loop is executed in parallel by the team that was started by an enclosing parallel region. If the !$OMP END DO directive is not specified, the !$OMP DO is assumed to end with the enclosed DO loop. DO ... END DO directive pairs may not be nested. Branching into or out of a !$OMP DO loop is not supported.

By default, there is an implicit barrier after the end of the parallel loop; the first thread to complete its portion of the work will wait until the other threads have finished their portion of work. If NOWAIT is specified, the threads will not synchronize at the end of the parallel loop.

Other items to note about !$OMP DO loops:

  • The DO loop index variable is always private.
  • !$OMP DO loops must be executed by all threads participating in the parallel region or none at all.
  • The END DO directive is optional, but if it is present it must appear immediately after the end of the enclosed DO loop.


REAL A(1000), B(1000)
DO I=1,1000
DO I=1,1000
A(I) = SQRT(B(I));

The SCHEDULE clause specifies how iterations of the DO loop are divided up between processors. Given a SCHEDULE (type [, chunk]) clause, type can be STATIC, DYNAMIC, GUIDED, or RUNTIME. These are defined as follows:

When SCHEDULE (STATIC, chunk) is specified, iterations are allocated in contiguous blocks of size chunk. The blocks of iterations are statically assigned to threads in a round-robin fashion in order of the thread ID numbers. The chunk must be a scalar integer expression. If chunk is not specified, a default chunk size is chosen equal to:

(number_of_iterations + omp_num_threads() - 1) / omp_num_threads()

When SCHEDULE (DYNAMIC, chunk) is specified, iterations are allocated in contiguous blocks of size chunk. As each thread finishes a piece of the iteration space, it dynamically obtains the next set of iterations. The chunk must be a scalar integer expression. If no chunk is specified, a default chunk size is chosen equal to 1.

When SCHEDULE (GUIDED, chunk) is specified, the chunk size is reduced in an exponentially decreasing manner with each dispatched piece of the iteration space. Chunk specifies the minimum number of iterations to dispatch each time, except when there are less than chunk iterations remaining to be processed, at which point all remaining iterations are assigned. If no chunk is specified, a default chunk size is chosen equal to 1.

When SCHEDULE (RUNTIME) is specified, the decision regarding iteration scheduling is deferred until runtime. The schedule type and chunk size can be chosen at runtime by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the resulting schedule is equivalent to SCHEDULE(STATIC).




There may be occasions in a parallel region when it is necessary that all threads complete work to that point before any thread is allowed to continue. The BARRIER directive synchronizes all threads at such a point in a program. Multiple barrier points are allowed within a parallel region. The BARRIER directive must either be executed by all threads executing the parallel region or by none of them.


The C$DOACROSS directive is not part of the OpenMP standard, but is supported for compatibility with programs parallelized using legacy SGI-style directives.


C$DOACROSS [ Clauses ]
< Fortran DO loop to be executed in parallel >


[ {PRIVATE | LOCAL} (list) ]
[ {SHARED | SHARE} (list) ]
[ CHUNK=<integer_expression> ]
[ IF (logical_expression) ]

The C$DOACROSS directive has the effect of a combined parallel region and parallel DO loop applied to the loop immediately following the directive. It is very similar to the OpenMP PARALLEL DO directive, but provides for backward compatibility with codes parallelized for SGI systems prior to the OpenMP standardization effort. The C$DOACROSS directive must not appear within a parallel region. It is a short-hand notation which tells the compiler to parallelize the loop to which it applies, even though that loop is not contained within a parallel region. While this syntax is more convenient, it should be noted that if multiple successive DO loops are to be parallelized it is more efficient to define a single enclosing parallel region and parallelize each loop using the OpenMP DO directive.

A variable declared PRIVATE or LOCAL to a C$DOACROSS loop is treated the same as a private variable in a parallel region or DO (see above). A variable declared SHARED or SHARE to a C$DOACROSS loop is shared among the threads, meaning that only 1 copy of the variable exists to be used and/or modified by all of the threads. This is equivalent to the default status of a variable that is not listed as PRIVATE in a parallel region or DO (this same default status is used in C$DOACROSS loops as well).


The OpenMP PARALLEL DO directive is supported using the following syntax.


< Fortran DO loop to be executed in parallel >


REDUCTION({operator | intrinsic} : list)
COPYIN (list)
IF (scalar_logical_expression)
SCHEDULE (type [, chunk])

The semantics of the PARALLEL DO directive are identical to those of a parallel region containing only a single parallel DO loop and directive. Note that the END PARALLEL DO directive is optional. The available clauses are as defined in sections 10.2 and 10.6.


The OpenMP SECTIONS / END SECTIONS directive pair is supported using the following syntax:


!$OMP SECTIONS [ Clauses ]
< Fortran code block executed by processor i >
< Fortran code block executed by processor j >


PRIVATE (list)
REDUCTION({operator | intrinsic} : list)

The SECTIONS / END SECTIONS directives define a non-iterative work-sharing construct within a parallel region. Each section is executed by a single processor. If there are more processors than sections, some processors will have no work and will jump to the implied barrier at the end of the construct. If there are more sections than processors, one or more processors will execute more than one section.

A SECTION directive may only appear within the lexical extent of the enclosing SECTIONS / END SECTIONS directives. In addition, the code within the SECTIONS / END SECTIONS directives must be a structured block, and the code in each SECTION must be a structured block.

The available clauses are as defined in section 10.2 and 10.6.


The OpenMP PARALLEL SECTIONS / END SECTIONS directive pair is supported using the following syntax:


< Fortran code block executed by processor i >
< Fortran code block executed by processor j >


REDUCTION({operator | intrinsic} : list)
COPYIN (list)
IF (scalar_logical_expression)

The PARALLEL SECTIONS / END SECTIONS directives define a non-iterative work-sharing construct without the need to define an enclosing parallel region. Each section is executed by a single processor. If there are more processors than sections, some processors will have no work and will jump to the implied barrier at the end of the construct. If there are more sections than processors, one or more processors will execute more than one section.

A SECTION directive may only appear within the lexical extent of the enclosing PARALLEL SECTIONS / END SECTIONS directives. In addition, the code within the PARALLEL SECTIONS / END SECTIONS directives must be a structured block, and the code in each SECTION must be a structured block.

The available clauses are as defined in section 10.2 and 10.6.


The OpenMP ORDERED directive is supported using the following syntax:


< Fortran code block executed by processor >

The ORDERED directive can appear only in the dynamic extent of a DO or PARALLEL DO directive that includes the ORDERED clause. The code block between the ORDERED / END ORDERED directives is executed by only one thread at a time, and in the order of the loop iterations. This sequentializes the ordered code block while allowing parallel execution of statements outside the code block. The following additional restrictions apply to the ORDERED directive:

  • The ORDERED code block must be a structured block. It is illegal to branch into or out of the block.
  • A given iteration of a loop with a DO directive cannot execute the same ORDERED directive more than once, and cannot execute more than one ORDERED directive.

10.13 ATOMIC

The OpenMP ATOMIC directive is supported using the following syntax:



The ATOMIC directive is semantically equivalent to enclosing the following single statement in a CRITICAL / END CRITICAL directive pair. The statement must be of one of the following forms:

* x = x operator expr

* x = expr operator x

* x = intrinsic (x, expr)

* x = intrinsic (expr, x)

where x is a scalar variable of intirnsic type, expr is a scalar expression that does not reference x, intrinsic is one of MAX, MIN, IAND, IOR, or IEOR, and operator is one of +, *, -, /, .AND., .OR., .EQV., or .NEQV..

10.14 FLUSH

The OpenMP FLUSH directive is supported using the following syntax:


!$OMP FLUSH [(list)]

The FLUSH directive ensures that all processor-visible data items, or only those specified in list when it's present, are written back to memory at the point at which the directive appears.


The OpenMP THREADPRIVATE directive is supported using the following syntax:


!$OMP THREADPRIVATE ( [ /common_block1/ [, /common_block2/] ...] )

Where common_blockn is the name of a common block to be made private to each thread but global within the thread. This directive must appear in the declarations section of a program unit after the declaration of any common blocks listed. On entry to a parallel region, data in a THREADPRIVATE common block is undefined unless COPYIN is specified on the PARALLEL directive. When a common block that is initialized using DATA statements appears in a THREADPRIVATE directive, each thread's copy is initialized once prior to its first use.

The following restrictions apply to the THREADPRIVATE directive:

  • The THREADPRIVATE directive must appear after every declaration of a thread private common block.
  • Only named common blocks can be made thread private
  • It is illegal for a THREADPRIVATE common block or its constituent variables to appear in any clause other than a COPYIN clause.

10.16 Run-time Library Routines

User-callable functions are available to the Fortran programmer to query and alter the parallel execution environment.

integer omp_get_num_threads()

returns the number of threads in the team executing the parallel region from which it is called. When called from a serial region, this function returns 1. A nested parallel region is the same as a single parallel region. By default, the value returned by this function is equal to the value of the environment variable OMP_NUM_THREADS or to the value set by the last previous call to the omp_set_num_threads() subroutine defined below.

subroutine omp_set_num_threads(scalar_integer_exp)

sets the number of threads to use for the next parallel region. This subroutine can only be called from a serial region of code. If it is called from within a parallel region, or within a subroutine or function that is called from within a parallel region, the results are undefined. This subroutine has precedence over the OMP_NUM_THREADS environment variable.

integer omp_get_thread_num()

returns the thread number within the team. The thread number lies between 0 and omp_get_num_threads()-1. When called from a serial region, this function returns 0. A nested parallel region is the same as a single parallel region.

integer function omp_get_max_threads()

returns the maximum value that can be returned by calls to omp_get_num_threads(). If omp_set_num_threads() is used to change the number of processors, subsequent calls to omp_get_max_threads() will return the new value. This function returns the maximum value whether executing from a parallel or serial region of code.

integer function omp_get_num_procs()

returns the number of processors that are available to the program.

logical function omp_in_parallel()

returns .TRUE. if called from within a parallel region and .FALSE. if called outside of a parallel region. When called from within a parallel region that is serialized, for example in the presence of an IF clause evaluating .FALSE., the function will return .FALSE..

subroutine omp_set_dynamic(scalar_logical_exp)

is designed to allow automatic dynamic adjustment of the number of threads used for execution of parallel regions. This function is recognized, but currently has no effect.

logical function omp_get_dynamic()

is designed to allow the user to query whether automatic dynamic adjustment of the number of threads used for execution of parallel regions is enabled. This function is recognized, but currently always returns .FALSE..

subroutine omp_set_nested(scalar_logical_exp)

is designed to allow enabling/disabling of nested parallel regions. This function is recognized, but currently has no effect.

logical function omp_get_nested()

is designed to allow the user to query whether dynamic adjustment of the number of threads available for execution of parallel regions is enabled. This function is recognized, but currently always returns .FALSE..

subroutine omp_init_lock(integer_var)

initializes a lock associated with the variable integer_var for use in subsequent calls to lock routines. This initial state of integer_var is unlocked. It is illegal to make a call to this routine if integer_var is already associated with a lock.

subroutine omp_destroy_lock(integer_var)

disassociates a lock associated with the variable integer_var.

subroutine omp_set_lock(integer_var)

causes the calling thread to wait until the specified lock is available. The thread gains ownership of the lock when it is available. It is illegal to make a call to this routine if integer_var has not been associated with a lock.

subroutine omp_unset_lock(integer_var)

causes the calling thread to release ownership of the lock associated with integer_var. It is illegal to make a call to this routine if integer_var has not been associated with a lock.

logical function omp_test_lock(integer_var)

causes the calling thread to try to gain ownership of the lock associated with integer_var. The function returns .TRUE. if the thread gains ownership of the lock, and .FALSE. otherwise. It is illegal to make a call to this routine if integer_var has not been associated with a lock.

10.17 Environment Variables

OMP_NUM_THREADS - specifies the number of threads to use during execution of parallel regions. The default value for this variable is 1. For historical reasons, the environment variable NCPUS is supported with the same functionality. In the event that both OMP_NUM_THREADS and NCPUS are defined, the value of OMP_NUM_THREADS takes precedence.


OMP_NUM_THREADS threads will be used to execute the program regardless of the number of physical processors available in the system. As a result, you can run programs using more threads than physical processors and they will execute correctly. However, performance of programs executed in this manner can be unpredictable, and oftentimes will be inefficient

OMP_SCHEDULE - specifies the type of iteration scheduling to use for DO and PARALLEL DO loops which include the SCHEDULE(RUNTIME) clause. The default value for this variable is "STATIC". If the optional chunk size is not set, a chunk size of 1 is assumed except in the case of a STATIC schedule. For a STATIC schedule, the default is as defined in section 10.6. Examples of the use of OMP_SCHEDULE are as follows:


OMP_DYNAMIC - currently has no effect.

OMP_NESTED - currently has no effect.

MPSTKZ - increase the size of the stacks used by threads executing in parallel regions. For use with programs that utilize large amounts of thread-local storage in the form of private variables or local variables in functions or subroutines called within parallel regions. The value should be an integer <n> concatenated with M or m to specify stack sizes of n megabytes. For example:

$ setenv MPSTKZ 8M

<< << " border=0> >> > " border=0> Title Contents Index Home Help

Powered by Plone This site conforms to the following standards: