Compaq Fortran
User Manual for
Tru64 UNIX and Linux Alpha Systems

THETA = 2.*PI/N STH = SIN(THETA) CTH = COS(THETA) S = 0.0 C = 1.0 DO I=0,N-1 . . ! use S and C . S = S*CTH + C*STH C = C*CTH - S*STH END DO

To resolve the dependences, substitute the SIN and COS calls (however, this loses the performance improvement gained from using the recurrence). You can also restructure the loop into an outer parallel loop and an inner serial loop. Each iteration of the outer loop reinitializes the recurrence, and the inner loop uses the value:

!$OMP PARALLEL SHARED (THETA,STH,CTH,LCHUNK) PRIVATE (ISTART,I,S,C) THETA = 2.*PI/N STH = SIN(THETA) CTH = COS(THETA) LCHUNK = (N + NWORKERS()-1) / NWORKERS !$OMP DO DO ISTART = 0,N-1,LCHUNK S = SIN(ISTART*THETA) C = COS(ISTART*THETA) DO I = ISTART, MIN(N,ISTART+LCHUNK-1) . . ! use S and C . S = S*CTH + C*STH C = C*CTH - S*STH END DO END DO !$OMP END DO !$OMP END PARALLEL

Dependences Requiring Locks

When no other method can resolve a dependence, you can put locks around the critical section that contains them. Locks force threads to execute the critical section serially, while allowing the rest of the loop to run in parallel.

However, locks degrade performance because they force the critical section to run serially and increase the overhead. They are best used only when no other technique resolves the dependence, and only in CPU-intensive loops.

To create locks in a loop, enclose the critical section between the CRITICAL and END CRITICAL directives. When a thread executes the CRITICAL directive and the latch variable is open, it takes possession of the latch variable, and other threads must wait to execute the section. The latch variable becomes open when the thread executing the section executes the END CRITICAL directive.

The latch variable is closed when a thread has possession of it and open when the latch variable is free.

In Example 6-4, the statement updating the sum is locked for safe parallel execution of the loop.

Example 6-4 Decomposed Loop Using Locks

INTEGER(4) LCK !$OMP PARALLEL PRIVATE (I,Y) SHARED (LCK,SUM) LCK = 0 . . . !$OMP DO DO I = 1,1000 Y = some_calculation !$OMP CRITICAL (LCK) SUM = SUM + Y !$OMP END CRITICAL (LCK) END DO !$OMP END DO !$OMP END PARALLEL

This particular example is better solved using a reduction clause as shown in Example 6-5.

Example 6-5 Decomposed Loop Using a Reduction Clause

INTEGER(4) LCK !$OMP PARALLEL PRIVATE (I,Y) SHARED (LCK,SUM) LCK = 0 . . . !$OMP DO REDUCTION (SUM) DO I = 1,1000 Y = some_calculation SUM = SUM + Y END DO !$OMP END DO !$OMP END PARALLEL

6.3.1.2 Coding Restrictions

Because iterations in a parallel DO loop execute in an indeterminate order and in different threads, certain constructs in these loops can cause unpredictable run-time behavior.

The following restrictions are flagged:

The loop control variable for a parallel loop must be declared an integer.
Only comment lines and blank lines can exist between a DO directive and the DO loop statement.
The loop body must not contain any RETURN statements.
A loop with a branch (GOTO) into or out of its body, or having an EXIT statement cannot be run in parallel.

The following restrictions are not flagged:

Loop-carried dependences involving shared variables must not exist between iterations of a parallel loop.
Dependences involving private variables must not exist between code within a parallel loop and code executed before entry into or after the completion of the loop.
System services or run-time library routines that change the context of a thread (such as a change in privileges, priority, access mode, or environment variables) must not be called from within a parallel loop.
I/O statements and the control statements PAUSE and STOP must not be used in a routine called at any call level from within a parallel loop.
Private symbols must not be referenced in a SAVE statement in a routine called at any call level from within a parallel loop.
If a dummy argument is referenced within a parallel DO loop, the corresponding actual argument must reside in shared memory.
Random number generators must be used carefully inside parallel loops, because parallel processing affects how numbers are generated.

6.3.1.3 Manual Optimization

To manually optimize structures containing parallel loops:

Interchange loops so that the parallel loop has the most CPU work and the caches can perform efficiently.
Balance the parallel work among threads when it is unusually unbalanced.

Interchanging Loops

The following example shows a case in which an inner loop can run in parallel and an outer loop cannot, because of a loop-carried dependence. The inner loop also has a more effective memory-referencing pattern for parallel processing than the outer loop. By interchanging the loops, more work executes in parallel and the cache can perform more efficiently.

Original Structure Interchanged Structure

!$OMP PARALLEL PRIVATE (J,I) SHARED (A)

!$OMP DO

DO I = 1,100 DO J = 1,300

DO J = 1,300 DO I = 1,100

A(I,J) = A(I+1,J) + 1 A(I,J) = A(I+1,J) + 1

END DO END DO

END DO END DO

!$OMP END DO

!$OMP END PARALLEL

Original Structure	Interchanged Structure
	`!$OMP PARALLEL PRIVATE (J,I) SHARED (A)`
	`!$OMP DO`
`DO I = 1,100`	`DO J = 1,300`
`DO J = 1,300`	`DO I = 1,100`
`A(I,J) = A(I+1,J) + 1`	`A(I,J) = A(I+1,J) + 1`
`END DO`	`END DO`
`END DO`	`END DO`
	`!$OMP END DO`
	`!$OMP END PARALLEL`

Balancing the Workload

On the DO directive, you can specify the SCHEDULE(GUIDED) clause to use guided self-scheduling in manually decomposed loops, which is effective for most loops. However, when the iterations contain a predictably unbalanced workload, you can obtain better performance by manually balancing the workload. To do this, specify the chunk size in the SCHEDULE clause of the DO directive.

In the following loop, it might be very inefficient to divide the iterations into chunks of 50. A chunk size of 25 would probably be much more efficient on a system with two processors, depending on the amount of work being done by the routine SUB.

DO I = 1,100 . . . IF (I .LT. 50) THEN CALL SUB(I) END IF . . . END DO

6.3.1.4 Adjusting the Run-Time Environment

The OpenMP Fortran API and the Compaq Fortran parallel compiler directive sets also provide environment variables that adjust the run-time environment in unusual situations.

Regardless of whether you used the -omp or the -mp compiler option, when the compiler needs information supplied by an environment variable, the compiler first looks for an OpenMP Fortran API environment variable and then for a Compaq Fortran parallel compiler environment variable. If neither one is found, the compiler uses a default.

The compiler looks for environment variable information in the following situations:

When entering a parallel region, it looks for the number of threads (OMP_NUM_THREADS or MP_THREAD_COUNT),the spin count (MP_SPIN_COUNT), the yield count (MP_YIELD_COUNT), and the stack size (MP_STACK_SIZE).
When entering a DO or PARALLEL DO directive that has RUNTIME specified, it looks at schedule type (OMP_SCHEDULE.
When entering a worksharing directive, it looks at chunk size (MP_CHUNK_SIZE).

The OpenMP Fortran API environment variables are listed in Table 6-4.

.
Table 6-4 OpenMP Fortran API Environment Variables
Environment Variable¹ Interpretation

OMP_SCHEDULE

This variable applies only to DO and PARALLEL DO directives that have the schedule type of RUNTIME. You can set the schedule type and an optional chunk size for these loops at run time. The schedule types are STATIC, DYNAMIC, GUIDED, and RUNTIME.
For directives that have a schedule type other than RUNTIME, this variable is ignored. The compiler default schedule type is STATIC. If the optional chunk size is not set, a chunk size of one is assumed, except for the STATIC schedule type. For this schedule type, the default chunk size is set to the loop iteration space divided by the number of threads applied to the loop.

OMP_NUM_THREADS

Use this environment variable to set the number of threads to use during execution. This number applies unless you explicitly change it by calling the OMP_SET_NUM_THREADS run-time library routine.
When you have enabled dynamic thread adjustment, the value assigned to this environment variable represents the maximum number of threads that can be used. The default value is the number of processors in the current system. For more information about dynamic thread adjustment, see the online release notes.

OMP_DYNAMIC

Use this environment variable to enable or disable dynamic thread adjustment for the execution of parallel regions. When set to TRUE, the number of threads used can be adjusted by the run-time environment to best utilize system resources. When set to FALSE, dynamic adjustment is disabled. The default is FALSE. For more information about dynamic thread adjustment, see the online release notes.

OMP_NESTED

Use this environment variable to enable or disable nested parallelism. When set to TRUE, nested parallelism is enabled. When set to FALSE, it is disabled. The default is FALSE. For more information about nested parallelism, see the online release notes.

**Table 6-4 OpenMP Fortran API Environment Variables**
Environment Variable¹	Interpretation
OMP_SCHEDULE
	This variable applies only to DO and PARALLEL DO directives that have the schedule type of RUNTIME. You can set the schedule type and an optional chunk size for these loops at run time. The schedule types are STATIC, DYNAMIC, GUIDED, and RUNTIME. For directives that have a schedule type other than RUNTIME, this variable is ignored. The compiler default schedule type is STATIC. If the optional chunk size is not set, a chunk size of one is assumed, except for the STATIC schedule type. For this schedule type, the default chunk size is set to the loop iteration space divided by the number of threads applied to the loop.
OMP_NUM_THREADS
	Use this environment variable to set the number of threads to use during execution. This number applies unless you explicitly change it by calling the OMP_SET_NUM_THREADS run-time library routine. When you have enabled dynamic thread adjustment, the value assigned to this environment variable represents the maximum number of threads that can be used. The default value is the number of processors in the current system. For more information about dynamic thread adjustment, see the online release notes.
OMP_DYNAMIC
	Use this environment variable to enable or disable dynamic thread adjustment for the execution of parallel regions. When set to TRUE, the number of threads used can be adjusted by the run-time environment to best utilize system resources. When set to FALSE, dynamic adjustment is disabled. The default is FALSE. For more information about dynamic thread adjustment, see the online release notes.
OMP_NESTED
	Use this environment variable to enable or disable nested parallelism. When set to TRUE, nested parallelism is enabled. When set to FALSE, it is disabled. The default is FALSE. For more information about nested parallelism, see the online release notes.

¹Environment variable names must be in uppercase; the assigned values are not case-sensitive.

The Compaq Fortran parallel compiler environment variables are listed in Table 6-5.

Table 6-5 Compaq Fortran Environment Variables
Environment Variable Interpretation

MP_THREAD_COUNT

Specifies the number of threads the run-time system is to create. The default is the number of processors available to your process.

MP_CHUNK_SIZE

Specifies the chunk size the run-time system uses when dispatching loop iterations to threads if the program specified the RUNTIME schedule type or specified another schedule type requiring a chunk size, but omitted the chunk size. The default chunk size is 1.

MP_STACK_SIZE

Specifies how many bytes of stack space the runtime system allocates for each thread when creating it. If you specify zero, the runtime system uses the default, which is very small. Therefore, if a program declares any large arrays to be PRIVATE, specify a value large enough to allocate them. If you do not use this environment variable at all, the runtime system allocates 5 MB.

MP_SPIN_COUNT

Specifies how many times the runtime system spins while waiting for a condition to become true. The default is 16,000,000, which is approximately one second of CPU time.

MP_YIELD_COUNT

Specifies how many times the runtime system alternates between calling sched_yield and testing the condition before going to sleep by waiting for a thread condition variable. The default is 10.

**Table 6-5 Compaq Fortran Environment Variables**
Environment Variable	Interpretation
MP_THREAD_COUNT
	Specifies the number of threads the run-time system is to create. The default is the number of processors available to your process.
MP_CHUNK_SIZE
	Specifies the chunk size the run-time system uses when dispatching loop iterations to threads if the program specified the RUNTIME schedule type or specified another schedule type requiring a chunk size, but omitted the chunk size. The default chunk size is 1.
MP_STACK_SIZE
	Specifies how many bytes of stack space the runtime system allocates for each thread when creating it. If you specify zero, the runtime system uses the default, which is very small. Therefore, if a program declares any large arrays to be PRIVATE, specify a value large enough to allocate them. If you do not use this environment variable at all, the runtime system allocates 5 MB.
MP_SPIN_COUNT
	Specifies how many times the runtime system spins while waiting for a condition to become true. The default is 16,000,000, which is approximately one second of CPU time.
MP_YIELD_COUNT
	Specifies how many times the runtime system alternates between calling sched_yield and testing the condition before going to sleep by waiting for a thread condition variable. The default is 10.

6.4 Calls to Programs Written in Other Languages

Note

The following sections contain information that applies to both the OpenMP Fortran API and the Compaq Fortran parallel compiler directives.

Only programs written in Compaq Fortran support parallel directives. Any procedures or routines called from within a parallel region in a Compaq Fortran program must consider the following:

Compile any Compaq Fortran programs containing parallel directives using the -mp or the -omp option.
Called procedures or routines must be thread safe.
It is the programmer's responsibility to ensure that all data objects in the called procedures or routines are shared or allocated on each thread's private stack.

6.5 Compiling, Linking, and Running Parallelized Programs

Note

The following sections contain information that applies to both the OpenMP Fortran API and the Compaq Fortran parallel compiler directives.

Whether you compile and link your program in one step or in separate steps, you must include the name of the f90 Compaq Fortran driver and either the -omp (or -mp ) option if you want to use the OpenMP Fortran API directives) on each command line. For example, to compile and link the program prog.f in one step, use the command:

% f90 -omp prog.f -o prog

To separately compile and link the program prog.f, use these commands:

% f90 -c -omp prog.f % f90 -omp prog.o -o prog

To run your program, use the command:

% prog

When you use the -omp (or -mp ) option, the driver sets the -reentrancy threaded and the -automatic options for the compiler if you did not specify them on the command line. The options are not set if you used the negated forms of the options on the command line. The driver also sets the -pthread and -lots3 options for the linker.

6.6 Debugging Parallelized Programs

Note

The following sections contain information that applies to both the OpenMP Fortran API and the Compaq Fortran parallel compiler directives.

When a Compaq Fortran program uses parallel decomposition directives, there are some special considerations concerning how that program can be debugged. Subsequent sections describe these special considerations and discuss approaches to some of the unique problems of debugging parallel programs.

When a bug occurs in a Compaq Fortran program that uses parallel decomposition directives, it may be caused by incorrect Compaq Fortran statements, or it may be caused by incorrect parallel decomposition directives. In either case, the program to be debugged can be executed by multiple threads simultaneously.

OpenMP Fortran API and Compaq Fortran parallel compiler directives are fully supported in f90 compilers. Some of the new features used in OpenMP are not yet fully supported by the debuggers, so it is important to understand how these features work to understand how to debug them. The two problem areas are:

Outlining of parallel regions (see Section 6.6.1)
Shared variables (see Section 6.6.2)

Available Debuggers

Debuggers such as the Compaq Ladebug debugger provide features that support the debugging of programs that are executed by multiple threads. However, the currently available versions of Ladebug do not directly support the debugging of parallel decomposition directives, and therefore, there are limitations on the debugging features.

Other debuggers are available for use on UNIX. Before attempting to debug programs containing parallel decomposition directives, determine what level of support the debugger provides for these directives by reading the documentation or by contacting the supplier of the debugger.

6.6.1 Parallel Regions

The compiler implements a parallel region by taking the code in the region and putting it into a separate, compiler-created subroutine. This process is called outlining because it is the inverse of inlining a subroutine into its call site.

In place of the parallel region, the compiler inserts a call to a run-time library routine, which starts up threads and causes them to call the outlined routine. As threads return from the outlined routine, they return to the run-time library, which waits for all threads to finish before returning to the master thread in the original program.

Example 6-6 contains a section of the source listing with machine code (produced using f90 -omp -V -machine_code). Note that the original program unit was named outline_example and the parallel region was at line 2. The compiler created an outlined routine called _2_outline_example_. In general, the outlined routine is named _line-number_original-routine-name.

Example 6-6 Code Using Parallel Region

OUTLINE_EXAMPLE Source Listing 1 program outline_example 2 !$omp parallel 3 print *, 'hello world' 4 !$omp end parallel 5 print *, 'done' 6 end OUTLINE_EXAMPLE Machine Code Listing .text .ent _2_outline_example_ .eflag 16 0000 _2_outline_example_: 27BB0001 0000 ldah gp, _2_outline_example_ 23BD8180 0004 lda gp, _2_outline_example_ 23DEFFA0 0008 lda sp, -96(sp) B75E0000 000C stq r26, (sp) .mask 0x04000000,-96 .fmask 0x00000000,0 .frame $sp, 96, $26 .prologue 1 A45D8040 0010 ldq r2, 48(gp) A77D8020 0014 ldq r27, for_write_seq_lis 63FF0000 0018 trapb 47E17400 001C mov 11, r0 265F0385 0020 ldah r18, 901(r31) A67D8018 0024 ldq r19, 8(gp) B3FE0008 0028 stl r31, var$0001 221E0008 002C lda r16, var$0001 B41E0048 0030 stq r0, 72(sp) 47E0D411 0034 mov 6, r17 B45E0050 0038 stq r2, 80(sp) 2252FF00 003C lda r18, -256(r18) 229E0048 0040 lda r20, 72(sp) 6B5B4000 0044 jsr r26, for_write_seq_lis 27BA0001 0048 ldah gp, _2_outline_example_ 23BD8180 004C lda gp, _2_outline_example_ A75E0000 0050 ldq 63FF0000 0054 trapb 23DE0060 0058 lda sp, 96(sp) 6BFA8001 005C ret (r26) .end _2_outline_example_ Routine Size: 96 bytes, Routine Base: $CODE$ + 0000 .globl outline_example_ .ent outline_example_ .eflag 16 0060 outline_example_: 27BB0001 0060 ldah gp, outline_example_ 23BD8180 0064 lda gp, outline_example_ A77D8038 0068 ldq r27, for_set_reentrancy 23DEFFA0 006C lda sp, -96(sp) A61D8010 0070 ldq r16, (gp) B75E0000 0074 stq r26, (sp) .mask 0x04000000,-96 .fmask 0x00000000,0 .frame $sp, 96, $26 .prologue 1 6B5B4000 0078 jsr r26, for_set_reentrancy 27BA0001 007C ldah gp, outline_example_ 23BD8180 0080 lda gp, outline_example_ 47FE0411 0084 mov sp, r17 A77D8028 0088 ldq r27, _OtsEnterParallelOpenMP A61D8030 008C ldq r16, _2_outline_example_ 47FF0412 0090 clr r18 6B5B4000 0094 jsr r26, _OtsEnterParallelOpenMP 27BA0001 0098 ldah gp, outline_example_ 47E09401 009C mov 4, r1 23BD8180 00A0 lda gp, outline_example_ 265F0385 00A4 ldah r18, 901(r31) A47D8018 00A8 ldq r3, 8(gp) A77D8020 00AC ldq r27, for_write_seq_lis A67D8018 00B0 ldq r19, 8(gp) 221E0008 00B4 lda r16, var$0001 20630008 00B8 lda r3, 8(r3) B3FE0008 00BC stl r31, var$0001 B43E0048 00C0 stq r1, 72(sp) 47E0D411 00C4 mov 6, r17 B47E0050 00C8 stq r3, 80(sp) 2252FF00 00CC lda r18, -256(r18) 229E0048 00D0 lda r20, 72(sp) 6B5B4000 00D4 jsr r26, for_write_seq_lis 27BA0001 00D8 ldah gp, outline_example_ A75E0000 00DC ldq r26, (sp) 23BD8180 00E0 lda gp, outline_example_ 47E03400 00E4 mov 1, r0 23DE0060 00E8 lda sp, 96(sp) 6BFA8001 00EC ret (r26) .end outline_example_

In the preceding example, the run-time library routine _OtsEnterParallelOpenMP is responsible for creating threads (if they have not already been created) and causing them to call the outlined routine. The outlined routine is called once by each thread.

Debugging the program at this level is just like debugging a program that uses POSIX threads directly. Breakpoints can be set in the outlined routine just like any other routine (leave off the trailing underscore. However, all Compaq Fortran routines are appended with a trailing underscore, so the debugger automatically inserts it.

6.6.2 Shared Variables

When a variable appears in a PRIVATE, FIRSTPRIVATE, LASTPRIVATE, or REDUCTION clause on some block, the variable is made private to the parallel region by redeclaring it in the block. SHARED data, however, is not declared in the outlined routine. Instead, it gets its declaration from the parent routine.

When in a debugger, you can switch from one thread to another. Each thread has its own program counter so each thread can be in a different place in the code. Example 6-7 shows a Ladebug session.

Example 6-7 Code Using Multiple Threads

% ladebug a.out Welcome to the Ladebug Debugger Version 4.0-xx ------------------ object file name: a.out Reading symbolic information ...done (ladebug) stop in _2_outline_example [#1: stop in subroutine _2_outline_example() ] (ladebug) run [1] stopped at [_2_outline_example:2 0x120002d14] 2 !$omp parallel (ladebug) show thread Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- >* 1 running throughput 11 default thread -1 blocked kernel fifo 32 manager thread -2 ready idle 0 null thread for VP 0x0 2 ready not started throughput 11 <anonymous> 3 ready not started throughput 11 <anonymous> 4 ready not started throughput 11 <anonymous> 5 ready not started throughput 11 <anonymous> 6 ready not started throughput 11 <anonymous> (ladebug)

Thread 1 is the master thread. Do not confuse debugger thread numbers with OpenMP thread numbers. The compiler numbers threads beginning at zero, but the debugger numbers threads beginning at 1. There are also two extra threads in the debugging process, numbered -1 and -2, for use by the kernel).

Thread 1 has started running and is currently stopped just inside the outlined routine. The other threads have not started running because the example session was run on a uniprocessor workstation. On a multiprocessor, the other threads can run on different processors, so switch processors and examine the stack as shown in Example 6-8.

Example 6-8 Code Using Multiple Processors

(ladebug) thread 2 Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- > 2 ready not started throughput 11 <anonymous> (ladebug) where >0 0x3ff805739e0 in thdBase(0x14005d7d0, 0x0, 0x0, 0x120003c20, 0x4, 0x0) (ladebug) thread 1 Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- >* 1 running throughput 11 default thread (ladebug) where >0 0x120002d14 in _2_outline_example() omp_hello.f:2 #1 0x12000495c in _OtsEnterParallelOpenMP() #2 0x120002d98 in outline_example() omp_hello.f:1 #3 0x120002ccc in main() for_main.c:203 (ladebug)

Thread 2 has not yet started and is reported as being in thdBase, a POSIX run-time support routine that threads run when they are created. Thread 1 is the master thread and is currently executing the outlined routine, called from the run-time library, which was called from the original program.

Note that only the master thread (thread 1) has a full call tree. The other threads have thdBase(), from which they call the outlined routine. If you want to look at variables higher on the call stack than the parallel region, you must first tell the debugger to switch to thread 1, and then use the up command to climb the call stack.

If SHARED data is in common blocks, the outlined routine accesses it the same way any other routine would. If the SHARED data is automatic storage associated with the routine where the parallel region appears, however, each thread has a pointer to the master thread stack when the parallel region is reached.

Variables on the master stack can be accessed through the pointer. The compiler handles this automatically and does describe the access in the symbol table, but Ladebug and TotalView^tm currently do not support this uplevel access mechanism.

Example 6-9 makes this clearer.

Example 6-9 Code Using Shared Variables

UPLEVEL Source Listing 1 program uplevel 2 implicit none 3 integer i 4 5 !$omp parallel 6 !$omp atomic 7 i = i + 1 8 !$omp end parallel 9 10 print *, i 11 end UPLEVEL Machine Code Listing .text .ent _5_uplevel_ .eflag 16 0000 _5_uplevel_: 23DEFFC0 0000 lda sp, -64(sp) .frame $sp, 64, $26 .prologue 0 47E10402 0004 mov r1, __StaticLink.1 # r1, r2 63FF0000 0008 trapb 20620010 000C lda r3, 16(r2) 0010 L$3: A8230000 0010 ldl_l r1, (r3) 40203000 0014 addl r1, 1, r0 B8030000 0018 stl_c r0, (r3) E4000003 001C beq r0, L$4 63FF0000 0020 trapb 23DE0040 0024 lda sp, 64(sp) 6BFA8001 0028 ret (r26) 002C L$4: C3FFFFF8 002C br L$3 .end _5_uplevel_ Routine Size: 48 bytes, Routine Base: $CODE$ + 0000 .globl uplevel_ .ent uplevel_ .eflag 16 0030 uplevel_: 27BB0001 0030 ldah gp, uplevel_ # gp, (r27) 23BD8130 0034 lda gp, uplevel_ # gp, (gp) 23DEFFA0 0038 lda sp, -96(sp) B75E0000 003C stq r26, (sp) .mask 0x04000000,-96 .fmask 0x00000000,0 .frame $sp, 96, $26 .prologue 1 A61D8010 0040 ldq r16, (gp) A77D8038 0044 ldq r27, for_set_reentrancy # r27, 40(gp) 6B5B4000 0048 jsr r26, for_set_reentrancy # r26, (r27) 27BA0001 004C ldah gp, uplevel_ # gp, (r26) 23BD8130 0050 lda gp, uplevel_ # gp, (gp) A61D8030 0054 ldq r16, _5_uplevel_ # r16, 32(gp) 47FE0411 0058 mov sp, r17 47FF0412 005C clr r18 A77D8028 0060 ldq r27, _OtsEnterParallelOpenMP # r27, 24(gp) 6B5B4000 0064 jsr r26, _OtsEnterParallelOpenMP # r26, (r27) 27BA0001 0068 ldah gp, uplevel_ # gp, (r26) 23BD8130 006C lda gp, uplevel_ # gp, (gp) B3FE0018 0070 stl r31, var$0001 # r31, 24(sp) A67D8018 0074 ldq r19, 8(gp) 203E0010 0078 lda r1, I # r1, 16(sp) B43E0058 007C stq r1, 88(sp) 221E0018 0080 lda r16, var$0001 # r16, 24(sp) 47E0D411 0084 mov 6, r17 265F0385 0088 ldah r18, 901(r31) 2252FF00 008C lda r18, -256(r18) 229E0058 0090 lda r20, 88(sp) A77D8020 0094 ldq r27, for_write_seq_lis # r27, 16(gp) 6B5B4000 0098 jsr r26, for_write_seq_lis # r26, (r27) 27BA0001 009C ldah gp, uplevel_ # gp, (r26) 23BD8130 00A0 lda gp, uplevel_ # gp, (gp) 47E03400 00A4 mov 1, r0 A75E0000 00A8 ldq r26, (sp) 23DE0060 00AC lda sp, 96(sp) 6BFA8001 00B0 ret (r26) .end uplevel_ Routine Size: 132 bytes, Routine Base: $CODE$ + 0030

Note that in this example in the main routine, the variable i is kept at offset 16 from the stack pointer. The stack pointer is passed into _OtsEnterParallelOpenMP, which puts it into register r1 before calling _5_uplevel_. Each thread uses indirect address through this address to get to the shared i.

Because the debuggers have not yet been adjusted to understand uplevel addressing, the variable i does not appear to be declared in the outlined region, only in the original routine. To look at the value of the shared variable, we have to switch threads to the master thread and then get into the appropriate context. This is shown in Example 6-10.

Example 6-10 Code Looking at a Shared Variable Value

% ladebug a.out Welcome to the Ladebug Debugger Version 4.0-xx ------------------ object file name: a.out Reading symbolic information ...done (ladebug) stop in _5_uplevel [#1: stop in subroutine _5_uplevel() ] (ladebug) run [1] stopped at [_5_uplevel:5 0x120002cd8] 5 !$omp parallel (ladebug) where >0 0x120002cd8 in _5_uplevel() omp_uplevel.f:5 #1 0x1200048ec in _OtsEnterParallelOpenMP #2 0x120002d34 in uplevel() omp_uplevel.f:1 #3 0x120002c9c in main() for_main.c:203 (ladebug) p i 0 (ladebug) c [1] stopped at [_5_uplevel:5 0x120002cd8] 5 !$omp parallel (ladebug) show thread Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- 1 ready throughput 11 default thread -1 blocked kernel fifo 32 manager thread -2 ready idle 0 null thread for VP 0x0 >* 2 running throughput 11 <anonymous> 3 ready not started throughput 11 <anonymous> 4 ready not started throughput 11 <anonymous> 5 ready not started throughput 11 <anonymous> 6 ready not started throughput 11 <anonymous> (ladebug) p i Error: no value for symbol I Error: no value for i (ladebug) thread 1 Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- > 1 ready throughput 11 default thread (ladebug) where >0 0x12000493c in _OtsEnterParallelOpenMP #1 0x120002d34 in uplevel() omp_uplevel.f:1 #2 0x120002c9c in main() for_main.c:203 (ladebug) p i 1 (ladebug) c [1] stopped at [_5_uplevel:5 0x120002cd8] 5 !$omp parallel (ladebug) show thread Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- 1 ready throughput 11 default thread -1 blocked kernel fifo 32 manager thread -2 ready idle 0 null thread for VP 0x0 2 ready throughput 11 <anonymous> >* 3 running throughput 11 <anonymous> 4 ready not started throughput 11 <anonymous> 5 ready not started throughput 11 <anonymous> 6 ready not started throughput 11 <anonymous> (ladebug) where >0 0x120002cd8 in _5_uplevel() omp_uplevel.f:5 #1 0x120003d90 in slave_main(arg=2) ots_parallel.bli:859 #2 0x3ff80573ea4 in thdBase(0x0, 0x0, 0x0, 0x1, 0x45586732, 0x3) DebugInformationStrippedFromFile101 (ladebug) p i Error: no value for symbol I Error: no value for i (ladebug) thread 1 Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- > 1 ready throughput 11 default thread (ladebug) up >1 0x120002d34 in uplevel() omp_uplevel.f:1 1 program uplevel (ladebug) p i 2 (ladebug) q %

Contents

Index

Compaq FortranUser Manual for Tru64 UNIX and Linux Alpha Systems

6.3.1.2 Coding Restrictions

6.3.1.3 Manual Optimization

6.3.1.4 Adjusting the Run-Time Environment

6.4 Calls to Programs Written in Other Languages

6.5 Compiling, Linking, and Running Parallelized Programs

6.6 Debugging Parallelized Programs

6.6.1 Parallel Regions

6.6.2 Shared Variables

Compaq Fortran
User Manual for
Tru64 UNIX and Linux Alpha Systems