Compaq Fortran
User Manual for
Tru64 UNIX and Linux Alpha Systems

6.2.5.2 Controlling Data Scope Attributes

You can use several options to control the data scope attributes of variables for the duration of the construct in which you specify them. If you do not specify a data scope attribute option on a directive, the default is SHARED for those variables affected by the directive.

Each of the data scope attribute options accepts a list, which is a comma-separated list of named variables or named common blocks that are accessible in the scoping unit. When you specify named common blocks, they must appear between slashes (/name/).

Not all of the options are allowed on all directives, but the directives to which each option applies are listed in the clause descriptions.

The data scope attribute options are:

COPYIN
DEFAULT
FIRSTPRIVATE
LASTLOCAL or LAST LOCAL
PRIVATE or LOCAL
REDUCTION
SHARED or SHARE

COPYIN Option

Use the COPYIN option on the PARALLEL, PARALLEL DO, and PARALLEL SECTIONS directives to copy named common block values from the master thread copy to threads at the beginning of a parallel region, use the COPYIN option on the PARALLEL directive. The COPYIN option applies only to named common blocks that have been previously declared thread private using the TASKCOMMON or the INSTANCE PARALLEL directive (see Section 6.2.5.1).

Use a comma-separated list to name the common blocks and variables in common blocks you want to copy.

DEFAULT Option

This option is the same as the OpenMP Fortran API DEFAULT clause (see Section 6.1.5.2).

FIRSTPRIVATE Option

The FIRSTPRIVATE option is the same as the OpenMP Fortran API FIRSTPRIVATE clause (see Section 6.1.5.2).

LASTLOCAL or LAST LOCAL Option

Except for differences in directive name spelling, the LASTLOCAL or LAST LOCAL option is the same as the OpenMP Fortran API LASTPRIVATE clause (see Section 6.1.5.2).

PRIVATE or LOCAL Option

Except for the alternate directive spelling of LOCAL, the PRIVATE (or LOCAL) option is the same as the OpenMP Fortran API PRIVATE clause (see Section 6.1.5.2).

REDUCTION Option

Use the REDUCTION option on the PDO directive to declare variables that are to be the object of a reduction operation. Use a comma-separated list to name the variables you want to declare as objects of a reduction.

The REDUCTION option in the Compaq Fortran parallel compiler directive set is different from the REDUCTION clause in the OpenMP Fortran API directive set. In the OpenMP Fortran API directive set, both a variable and an operator type are given. In the Compaq Fortran parallel compiler directive set, the operator is not given in the directive. The compiler must be able to determine the reduction operation from the source code. The REDUCTION option can be applied to a variable in a DO loop only if the variable meets the following criteria:

must be scalar
must be assigned to exactly once in the DO loop
must be read from exactly once in the DO loop and only in the right side of the assignment
the assignment must be one of the following forms:

x = x operator expr x = expr operator x (except for subtraction) x = operator(x, expr) x = operator(expr, x)

where operator is one of the following supported reduction operations: +, -, *, .AND., .OR., .EQV., .NEQV., MAX, MIN, IAND, or IOR.

The compiler rewrites the reduction operation by computing partial results into local variables and then combining the results into the reduction variable. The reduction variable must be SHARED in the enclosing context.

SHARED or SHARE Option

Except for the alternate directive spelling of SHARE, the SHARED (or SHARE) option is the same as the OpenMP Fortran API SHARED clause (see Section 6.1.5.2).

6.2.6 Parallel Region Construct

The concepts of using a parallel region construct are the same as those for OpenMP Fortran API (see Section 6.1.6). However, the environment variable you use to set the default number of threads is MP_THREAD_COUNT and the run-time library routine is OtsSetNumThreads.

6.2.7 Worksharing Constructs

At the heart of parallel processing is the concept of the worksharing construct. A worksharing construct divides the execution of the enclosed code region among the members of the team created upon entering the enclosing parallel region construct.

A worksharing construct must be enclosed lexically within a parallel region if the worksharing directive is to execute in parallel. No new threads are launched and there is no implied barrier upon entry to a worksharing construct.

The worksharing constructs are:

PDO and END PDO directives (see Section 6.2.7.1)
PSECTIONS, SECTION, and END PSECTIONS directives (see Section 6.2.7.2)
SINGLE PROCESS and END SINGLE PROCESS directives (see Section 6.2.7.3)

6.2.7.1 PDO and END PDO Directives

The PDO directive specifies that the iterations of the immediately following DO loop must be dispatched across the team of threads so that each iteration is executed in parallel by a single thread. The loop that follows a PDO directive cannot be a DO WHILE or a DO loop that does not have loop control. The iterations of the DO loop are divided among and dispatched to the existing threads in the team.

You cannot use a GOTO statement, or any other statement, to transfer control into or out of the PDO construct.

If you specify the optional END PDO directive, it must appear immediately after the end of the DO loop. If you do not specify the END PDO directive, an END PDO directive is assumed at the end of the DO loop.

If you do not specify the optional NOWAIT clause on the END PDO directive, threads synchronize at the END PDO directive. If you specify NOWAIT, threads do not synchronize at the END PDO directive. Threads that finish early proceed directly to the instructions following the END PDO directive.

The PDO directive optionally lets you:

Control data scope attributes (see Section 6.1.5.2)
Specify chunk size
Specify schedule type
Terminate loop execution early
Override implicit synchronization

Specifying Chunk Size

A chunk is a contiguous group of iterations dispatched to a thread. You can explicitly define a chunk size for the current PDO directive by using the CHUNK or BLOCKED option. Chunk size must be a scalar integer expression. The specified chunk size overrides any chunk size specified by an earlier CHUNK directive, and applies only to the current PDO directive.

Refer to Section 6.2.10 for information about how chunk size and schedule type interact.

You can determine the chunk size for the current PDO or PARALLEL DO directive by using the following prioritized list. The available chunk size closest to the top of the list is used:

The chunk size specified in the CHUNK or BLOCKED option of the current PDO or PARALLEL DO directive
The value specified in the most recent CHUNK directive
If the schedule type for the current PDO or PARALLEL DO directive is either INTERLEAVED, DYNAMIC, GUIDED, or RUNTIME; the chunk size default value specified in the MP_CHUNK_SIZE environment variable
The compiler default chunk size value of one

Specifying Schedule Type

The schedule type specifies a scheduling algorithm that determines how chunks of loop iterations are dispatched to the threads of a team. You can explicitly define a schedule type for the current PDO or PARALLEL DO directive by using the MP_SCHEDTYPE option. The specified schedule type overrides any default schedule type specified by an earlier MP_SCHEDTYPE directive, and applies to the current PDO or PARALLEL DO directive only.

You can determine the schedule type used for the current PDO or PARALLEL DO directive by using the following prioritized list. The available schedule type closest to the top of the list is used:

The schedule type specified in the MP_SCHEDTYPE option of the current PDO or PARALLEL DO directive
The schedule type specified in the most recent MP_SCHEDTYPE directive
If the schedule type for the current PDO or PARALLEL DO directive is RUNTIME, the default value specified in the MP_SCHEDTYPE environment variable
The compiler default schedule type of STATIC

For information about schedule types, see Section 6.2.11.

Another option you can use to affect the way threads are dispatched is the ORDERED option. When you specify this option, iterations are dispatched to threads in the same order they would be for sequential execution.

Terminating Loop Execution Early

If you want to terminate loop execution early because a specified condition has been satisfied, use the PDONE directive. This is an executable directive and any undispatched iterations are not executed. However, all previously dispatched iterations are completed. When the schedule type is STATIC or INTERLEAVED, this directive has no effect because all iterations are dispatched prior to loop execution.

Overriding Implicit Synchronization

Whether or not you include the END PDO directive at the end of the DO loop, by default an implicit synchronization point exists immediately after the last statement in the loop. Threads reaching this point wait until all threads complete their work and reach this synchronization point.

If there are no data dependences between the variables inside the loop and those outside the loop, there may be no reason to make threads wait. In this case, use the NOWAIT clause on the END PDO directive to override synchronization and allow threads to continue.

6.2.7.2 PSECTIONS, SECTION, and END PSECTIONS Directives

Except for the different PSECTIONS directive name, this directive is the same as the OpenMP Fortran API SECTIONS directive (see Section 6.1.7.2).

6.2.7.3 SINGLE PROCESS and END SINGLE PROCESS Directives

Except for the different SINGLE PROCESS directive name, this directive is the same as the OpenMP Fortran API SINGLE directive (see Section 6.1.7.3).

6.2.8 Combined Parallel/Worksharing Constructs

The combined parallel/worksharing constructs provide an abbreviated way to specify a parallel region that contains a single worksharing construct. The combined parallel/worksharing constructs are:

PARALLEL DO (see Section 6.2.8.1)
PARALLEL SECTIONS (see Section 6.2.8.2)

6.2.8.1 PARALLEL DO and END PARALLEL DO Directives

This directive is the same as the OpenMP Fortran API PARALLEL DO directive with the following exceptions:

You can use the alternate DOACROSS directive name instead of PARALLEL DO
The options can be one or more of the options for the PARALLEL and PDO directives

For information about the OpenMP Fortran API PARALLEL DO directive, see Section 6.1.8.1.

6.2.8.2 PARALLEL SECTIONS and END PARALLEL SECTIONS Directives

This directive is the same as the OpenMP Fortran API PARALLEL SECTIONS directive with the following exception:

The options can be one or more of the options for the PARALLEL and PSECTIONS directives

For more information about the OpenMP Fortran API PARALLEL SECTIONS directive, see Section 6.1.8.2.

6.2.9 Synchronization Constructs

Synchronization refers to the interthread communication that ensures the consistency of shared data and coordinates parallel execution among threads.

Shared data is consistent within a team of threads when all threads obtain the identical value when the data is accessed.

To achieve explicit thread synchronization, you can use:

BARRIER directive (see Section 6.2.9.1)
CRITICAL SECTION directive (see Section 6.2.9.2)

6.2.9.1 BARRIER Directive

The BARRIER directive is the same as the OpenMP Fortran API BARRIER directive (see Section 6.1.9.2).

6.2.9.2 CRITICAL SECTION and END CRITICAL SECTION Directives

The CRITICAL SECTION and END CRITICAL SECTION directives are the same as the OpenMP Fortran API CRITICAL and END CRITICAL directives with the following exceptions:

The directive names are CRITICAL SECTION and END CRITICAL SECTION.
You can specify an optional latch variable name.
If you do not specify a latch variable name, the compiler assigns a unique name.
The END CRITICAL SECTION directive does not take a latch variable name.
You must explicitly initialize a latch variable to zero before any critical section using that latch variable is executed.
You must not reuse that latch variable in anything other than a critical section until all uses as a latch variable are complete.

For additional information about the OpenMP Fortran API CRITICAL directive, see Section 6.1.9.3.

6.2.10 Specifying a Default Chunk Size

To specify a default chunk size, use the CHUNK directive. Chunk size must be a scalar integer expression. The interaction between the chunk size and the schedule type are:

For the DYNAMIC and INTERLEAVED schedule types, iterations are always dispatched to threads in chunk size groups. If the total number of iterations is not evenly divisible by chunk size, the last group dispatched has fewer iterations.
For the GUIDED schedule type, chunk size is the minimum number of iterations that can be dispatched to a thread. If less than chunk size iterations remain, the remaining iterations are dispatched to the next available thread.
For the STATIC schedule type, chunk size is ignored.

You can also specify a chunk size by using the CHUNK option of the PDO or PARALLEL DO directive (see Specifying Chunk Size.)

6.2.11 Specifying a Default Schedule Type

To specify a default schedule type, use the MP_SCHEDTYPE directive. The following list describes the schedule types and how the chunk size affects scheduling:

For the STATIC or SIMPLE schedule types, one contiguous group of iterations is dispatched to each thread, with each group having approximately the same number of iterations.
For the INTERLEAVED or INTERLEAVE schedule types, a chunk sized group of iterations is dispatched to each thread in a round-robin manner.
For the DYNAMIC schedule type, a chunk sized group of the remaining iterations is dispatched to the next available thread. If less than one chunk size of iterations remain, all the remaining iterations are dispatched.
For the GUIDED or GSS schedule types (similar to the DYNAMIC schedule type), the number of iterations dispatched is relatively large at the beginning of the loop and decreases exponentially. The number of iterations dispatched is not necessarily evenly divisible by chunk size.
The specified chunk size is the minimum number of iterations that can be dispatched when a thread becomes available. When the number of remaining iterations is less than or equal to chunk size, all of the remaining iterations are dispatched to the next available thread.
In some cases, setting a chunk size greater than 1 improves execution efficiency as the loop nears termination. This is because contention between threads for the small number of remaining iterations is reduced.
For the RUNTIME schedule type, the schedule type and the chunk size are those specified in the MP_SCHEDTYPE environment variable.

The DYNAMIC and GUIDED schedule types introduce some amount of overhead required to manage the continuing dispatching of iterations to threads. However, this overhead is sometimes offset by better load balancing when the average execution time of iterations is not uniform throughout the loop.

The STATIC and INTERLEAVED schedule types dispatch all of the iterations to the threads in advance, with each thread receiving approximately equal numbers of iterations. One of these types is usually the most efficient schedule type when the average execution time of iterations is uniform throughout the loop.

You can also specify a schedule type using the MP_SCHEDTYPE option of the PDO or PARALLEL DO directive (see Specifying Schedule Type.)

6.3 Decomposing Loops for Parallel Processing

Note

The following sections contain information that applies to both the OpenMP Fortran API and the Compaq Fortran parallel compiler directives. The code examples use the OpenMP API directive format.

The term loop decomposition is used to specify the process of dividing the iterations of an iterated DO loop and running them on two or more threads of a shared-memory multi-processor computer system.

To run in parallel, the source code in iterated DO loops must be decomposed by the user, and adequate system resources must be made available. Decomposition is the process of analyzing code for data dependences, dividing up the workload, and ensuring correct results when iterations run concurrently. The only type of decomposition available with Compaq Fortran is directed decomposition using a set of parallel compiler directives.

The following sections describe how to decompose loops and how to use the OpenMP Fortran API and the Compaq Fortran parallel compiler directives to achieve parallel processing.

6.3.1 Directed Decomposition

When a program is compiled using the -omp or the -mp option, the compiler parses the parallel compiler directives. However, you must transform the source code to resolve any loop-carried dependences and improve run-time performance. ¹

To use directed decomposition effectively, take the following steps:

Identify the loops that benefit most from parallel processing.
- Consider whether another algorithm might achieve more parallelism in general.
- Evaluate any caller or called loops and decompose the most CPU-intensive loops in the application (as long as there are no interfering dependences).
  If a parallel DO loop invokes a subprogram containing another parallel DO loop, only the parallel DO loop of the calling program will be run in parallel. Each of the threads executing the outermost parallel DO loop will execute all of the iterations in the innermost parallel DO loop in a serial, nonparallel fashion.
- Make sure the loop contains enough CPU work to outweigh the parallel-processing startup overhead.
Analyze the loop and resolve dependences as needed (see Section 6.3.1.1). If you cannot resolve loop-carried dependences, you cannot safely decompose the loop.
Make sure the shared or private attributes inside the loop are consistent with corresponding use outside the loop. By default, common blocks and individual variables are shared, except for the loop control variable and variables referenced in a subprogram called from within a parallel loop (in which case they are private by default).
Precede the loop with the PARALLEL directive followed by the DO directive. You can combine the two directives by using the PARALLEL DO directive.
As needed, manually optimize the loop.
Make sure the loop complies with restrictions of the parallel-processing environment.
Without using the -omp option or the -mp option, compile, test, and debug the program.
Using -omp (or -mp ), repeat the previous step.
Evaluate the parallel run:
- If you reach an acceptable level of performance and if the results are correct, stop.
- If the results are inaccurate, analyze the manually decomposed loops for dependences, apply a method to resolve them, and retest the parallel run.
- If performance is inadequate, consider adjusting the run-time environment (see Section 6.3.1.4) or performing other manual optimizations, or consider other alternatives discussed in this manual. Then reenter the cycle by retesting the parallel program.

6.3.1.1 Resolving Dependences Manually

In directed decomposition, you must resolve loop-carried dependences and dependences involving temporary variables to ensure safe parallel execution. Only cycles of dependences are nearly impossible to resolve.

Do one of the following:

Let the loop execute serially (possibly decompose an outer loop level)
Use a lock (CRITICAL) to force the critical section to execute serially
Recode or restructure the loop
Find another algorithm that does not have cycles of dependences

There are several methods for resolving dependences manually:

For dependences on variables used as temporaries, declare them PRIVATE; this effectively makes separate copies of temporary values for each thread.
Recode the loop so that the loop-carried dependence becomes loop independent, with each thread having the involved store and fetch operation contained in a single iteration.
Insert locks (CRITICAL) around the critical section containing the dependence.
Use this technique only for very CPU-intensive loops, when no other method is possible, and for the smallest amount of code possible. The locks extend processing time by making individual threads wait while only one executes the critical region at a time.
Recode loops with cycles of dependences (these are typically linear recurrences).

Resolving Dependences Involving Temporary Variables

Declare temporary variables PRIVATE to resolve dependences involving them. Temporary variables are used in intermediate calculations. If they are used in more than one iteration of a parallel loop, the program can produce incorrect results.

One thread might define a value and another thread use that value instead of the one it defined for a particular iteration. Loop control variables are prime examples of temporary variables that are declared PRIVATE by default within a parallel region. For example:

DO I = 1,100 TVAR = A(I) + 2 D(I) = TVAR + Y(I-1) END DO

As long as certain criteria are met, you can resolve this kind of dependence by declaring the temporary variable (TVAR, in the example) PRIVATE. That way, each thread keeps its own copy of the variable.

For the criteria to be met, the values of the temporary variable must be all of the following:

Defined in each iteration, inside the loop
Meant to be used inside the same iteration that established it
Used nowhere outside the loop unless it is redefined outside the loop before subsequent use

The default for variables in a parallel loop is SHARED, so you must explicitly declare these variables PRIVATE to resolve this kind of dependence.

Resolving Loop-Carried Dependences

You can often resolve loop-carried dependences using one or more of the following loop transformations:

Loop alignment
Code replication
Loop distribution
Restructure the loop into an inner and outer loop

These techniques also resolve dependences that inhibit autodecomposition.

Loop Alignment

Loop alignment offsets memory references in the loop so that the dependence is no longer loop carried. The following example shows a loop that is aligned to resolve the dependence in array A.

Loop with Dependence Aligned Statements

DO I = 2,N A(I) = B(I) C(I) = A(I+1) END DO
C(I-1) = A(I) A(I) = B(I)

Loop with Dependence	Aligned Statements
`DO I = 2,N A(I) = B(I) C(I) = A(I+1) END DO`	`C(I-1) = A(I) A(I) = B(I)`

To compensate for the alignment and achieve the same calculations as the original loop, you probably have to perform one or more of the following:

Change the loop control variable.
Add IF constructs.
Switch the order of the statements (this preserves the relative store-fetch order of the original loop).

Example 6-1 shows two possible forms of the final loop.

Example 6-1 Aligned Loop

! First possible form: !$OMP PARALLEL PRIVATE (I) !$OMP DO DO I = 2,N+1 IF (I .GT. 2) C(I-1) = A(I) IF (I .LE. N) A(I) = B(I) END DO !$OMP END DO !$OMP END PARALLEL ! ! Second possible form; more efficient because the tests are ! performed outside the loop: ! !$OMP PARALLEL !$OMP DO DO I = 3,N C(I-1) = A(I) A(I) = B(I) END DO !$OMP END DO !$OMP END PARALLEL IF (N .GE. 2) A(2) = B(2) C(N) = A(N+1) END IF

Code Replication

When a loop contains a loop-independent dependence as well as a loop-carried dependence, loop alignment alone is usually not adequate. By resolving the loop-carried dependence, you often misalign another dependence. Code replication creates temporary variables that duplicate operations and keep the loop-independent dependences inside each iteration.

In S₂ of the following loop, aligning the A(I-1) reference without code replication would misalign the A(I) reference:

Loop with Multiple Dependences Misaligned Dependence

DO I = 2,100 S ₁ A(I) = B(I) + C(I) S ₂ D(I) = A(I) + A(I-1) END DO
D(I-1) = A(I-1) + A(I) A(I) = B(I) + C(I)

Loop with Multiple Dependences	Misaligned Dependence
`DO I = 2,100 S ₁ A(I) = B(I) + C(I) S ₂ D(I) = A(I) + A(I-1) END DO`	`D(I-1) = A(I-1) + A(I) A(I) = B(I) + C(I)`

Example 6-2 uses code replication to keep the loop-independent dependence inside each iteration. The temporary variable, TA, must be declared PRIVATE.

Example 6-2 Transformed Loop Using Code Replication

!$OMP PARALLEL PRIVATE (I,TA) A(2) = B(2) + C(2) D(2) = A(2) + A(1) !$OMP DO DO I = 3,100 A(I) = B(I) + C(I) TA = B(I-1) + C(I-1) D(I) = A(I) + TA END DO !$OMP END DO !$OMP END PARALLEL

Loop Distribution

Loop distribution allows more parallelism when neither loop alignment nor code replication can resolve the dependences. Loop distribution divides the contents of loops into multiple loops so that dependences cross between two separate loops. The loops run serially in relation to each other, even if they both run in parallel.

The following loop contains multiple dependences that cannot be resolved by either loop alignment or code replication:

DO I = 1,100 S₁ A(I) = A(I-1) + B(I) S₂ C(I) = B(I) - A(I) END DO

Example 6-3 resolves the dependences by distributing the loop. S₂ can run in parallel despite the data recurrence in S₁.

Example 6-3 Distributed Loop

DO I 1,100 S₁ A(I) = A(I-1) + B(I) END DO DO I 1,100 S₂ C(I) = B(I) - A(I) END DO

Restructuring a Loop into an Inner and Outer Nest

Restructuring a loop into an inner and outer loop nest can resolve some recurrences that are used as rapid approximations of a function of the loop control variable. For example, the following loop uses sines and cosines:

THETA = 2.*PI/N DO I=0,N-1 S = SIN(I*THETA) C = COS(I*THETA) . . ! use S and C . END DO

Using a recurrence to approximate the sines and cosines can make the serial loop run faster (with some loss of accuracy), but it prevents the loop from running in parallel:

Note

¹ Another method of supporting parallel processing does not involve iterated DO loops. Instead, it allows large amounts of independent code to be run in parallel using the SECTIONS and SECTION directives.

Contents

Index

Compaq FortranUser Manual for Tru64 UNIX and Linux Alpha Systems