Compaq KAP Fortran/OpenMP
for Tru64 UNIX
User Guide

8.5 Conditions Inhibiting Inlining/IPA

This section lists conditions that inhibit the inlining of subroutines and functions, whether from a library or source file. Many constructs that prevent inlining will also stop or restrict IPA.

See Section 8.4.6 for other notes on the use of the -inline and -ipa command switches and directives.

Conditions that inhibit inlining:

Dummy and actual arguments are mismatched in type or class.
Missing dummy parameter.
Missing actual parameter and the corresponding dummy parameter is an array.
Actual parameter is a nonscalar expression (for example, A+B, where A and B are arrays).
Number of actual parameters differs from the number of dummy parameters.
The size of an array actual parameter differs from the array dummy parameter AND the arrays cannot be linearized.
Subprogram is too long ( > 600 lines).
Subprogram has a SAVE statement.
Subprogram has variables that are live-on-entry, even if they are not in explicit SAVE statements.
Subprogram has a DATA statement and DATA implies SAVE and the variable is live-on-entry.
Calling routine and called routine have mismatched COMMON declarations.
Called routine has EQUIVALENCE statements. (Some of these can be handled.)
Called routine has NAMELIST statements.
Called routine has dynamic arrays.
CALL to be expanded has alternate return parameters.
Subprogram to be inlined contains a CALL with a subroutine or function name as an argument.
Subprogram contains unsubscripted array references in I/O statements.
Subprogram to be inlined contains POINTER statements.

Chapter 9
Transformations

Compaq KAP Fortran/OpenMP for Tru64 UNIX transformations convert standard Fortran 90 constructs into forms that have better serial performance and which make more efficient use of the memory hierarchy of Compaq Alpha computers.

In addition, KAP identifies opportunities for concurrent execution of loops, identifying them with directives for the Compaq Fortran compiler, and transforms vectorizable loops to enhance the vectorization capability of the Compaq Fortran compiler.

This chapter describes some of the possible transformations in this process:

Memory management for data locality
Loop reordering
Serial optimizations
Dusty-deck IF transformations
Loop unrolling
Loop rerolling

Unless otherwise noted, the examples in this chapter were run with the command switches -optimize=5, -roundoff=2, and -scalaropt=2.

9.1 Memory Management

The following sections provide information about memory management.

9.1.1 Command Switches

The following lists the memory management command switches and how they are used:

-aggressive --- Whether or not to pad COMMON blocks and routine-local memory
-cachesize --- To pick block sizes, that is, the sizes of the sections in the cache, when -aggressive=a, to determine when to pad memory
-cacheline --- To pick block sizes, when -aggressive=a, to determine the number of cache lines
-fpregisters, -dpregisters --- To pick unrolling factors, to make sure registers are not overflowed when unrolling
-setassociativity --- When -aggressive=a, to determine the number of cache lines

9.1.2 Memory Management Tactics

One way KAP enhances performance on cache-memory machines is to use a combination of loop blocking, loop interchanging, and loop unrolling to optimize the cache hit ratios of operands in loops. Memory management in KAP is enabled when -scalaropt=3 and -roundoff=3 have been specified. These are the default settings for these switches.

When arrays are being processed in loops, KAP improves memory access patterns by restructuring the code to work on small sections of the arrays that fit in the cache and thereby giving large cache hit ratios. The sizes of these sections are controlled by using the memory management command switches -cacheline, -cachesize, -fpregisters, -dpregisters, and -setassociativity.

The default settings for these are the standard machine characteristics for which the code is being targeted. These default settings will give the best results for most programs. For some special cases, you may want to modify these settings to adjust the sizes of the array sections that are meant to reside in the cache. The algorithm that KAP uses keeps square blocks of data in the cache.

The following matrix multiplication example shows how loop interchanging, loop blocking, scalar temporary variables, and strip mining can be used. Because these techniques introduce some additional overhead, they are not used with small, simple loops.

DIMENSION A2(N,N),B2(N,N),C2(N,N),D2(N,N) DO I = 1, N DO J = 1, N DO K = 1, N C2(I,J) = C2(I,J) + A2(I,K) * B2(K,J) ENDDO ENDDO ENDDO

Becomes:

DIMENSION A2(N,N), B2(N,N), C2(N,N), D2(N,N) INTEGER II1, II2, II3, II4, II7, II8, II9, II10, II13, II14, II15 X, II16, II19, II20, II21, II22, II23, II24, II25, II26, II27, X II28 REAL RR1, RR2, RR3, RR4, RR5, RR6, RR7, RR8, RR9, RR10, RR11, X RR12, RR13, RR14, RR15, RR16, RR17, RR18, RR19 II3 = 1 II1 = MOD (N - 1, 21) + 1 II2 = II1 II7 = MOD (N - 1, 21) + 1 II13 = MOD (N - 1, 21) + 1 DO 12 II4=1,N,21 II9 = 1 II8 = II7 II27 = II3 + II2 - 3 II28 = II3 + II2 - 1 DO II10=1,N,21 II15 = 1 II14 = II13 II19 = II9 + II8 - 1 II20 = II9 + II8 - 1 II24 = II9 + II8 - 1 II23 = II9 + II8 - 1 DO 10 II16=1,N,21 II21 = II15 + II14 - 3 II22 = II15 + II14 -1 DO 5 J=II3,II27,3

DO 3 I=II15,II21,3 RR1 = C2(I,J) RR2 = C2(I+1,J) RR3 = C2(I,J+1) RR4 = C2(I+1,J+1) RR5 = C2(I,J+2) RR6 = C2(I+1,J+2) RR7 = C2(I+2,J) RR8 = C2(I+2,J+1) RR9 = C2(I+2,J+2) DO 2 K=II9,II19,1 RR17 = A2(I,K) * B2(K,J) RR1 = RR1 + RR17 RR17 = A2(I+1,K) * B2(K,J) RR2 = RR2 + RR17 RR17 = A2(I,K) * B2(K,J+1) RR3 = RR3 + RR17 RR17 = A2(I+1,K) * B2(K,J+1) RR4 = RR4 + RR17 RR17 = A2(I,K) * B2(K,J+2) RR5 = RR5 + RR17 RR17 = A2(I+1,K) * B2(K,J+2) RR6 = RR6 + RR17 RR17 = A2(I+2,K) * B2(K,J) RR7 = RR7 + RR17 RR17 = A2(I+2,K) * B2(K,J+1) RR8 = RR8 + RR17 RR17 = A2(I+2,K) * B2(K,J+2) RR9 = RR9 + RR17

2 CONTINUE C2(I,J) = RR1 C2(I+1,J) = RR2 C2(I,J+1) = RR3 C2(I+1,J+1) = RR4 C2(I,J+2) = RR5 C2(I+1,J+2) = RR6 C2(I+2,J) = RR7 C2(I+2,J+1) = RR8 C2(I+2,J+2) = RR9 3 CONTINUE DO 5 I=I,II22,1 RR10 = C2(I,J) RR11 = C2(I,J+1) RR12 = C2(I,J+2) DO 4 K=II9,II20,1 RR18 = A2(I,K) * B2(K,J) RR10 = RR10 + RR18 RR18 = A2(I,K) * B2(K,J+1) RR11 = RR11 + RR18 RR18 = A2(I,K) * B2(K,J+2) RR12 = RR12 + RR18 4 CONTINUE C2(I,J) = RR10 C2(I,J+1) = RR11 C2(I,J+2) = RR12 5 CONTINUE II25 = II15 + II14 - 3 II26 = II15 + II14 - 1 DO 9 J=J,II28,1 DO 7 I=II15,II25,3 RR13 = C2(I,J) RR14 = C2(I+1,J) RR15 = C2(I+2,J) DO 6 K=II9,II23,1 RR19 = A2(I,K) * B2(K,J) RR13 = RR13 + RR19 RR19 = A2(I+1,K) * B2(K,J) RR14 = RR14 + RR19 RR19 = A2(I+2,K) * B2(K,J) RR15 = RR15 + RR19 6 CONTINUE C2(I,J) = RR13 C2(I+1,J) = RR14 C2(I+2,J) = RR15 7 CONTINUE DO 9 I=I,II26,1 RR16 = C2(I,J) DO 8 K=II9,II24,1 RR16 = RR16 + A2(I,K) * B2(K,J) 8 CONTINUE C2(I,J) = RR16 9 CONTINUE II15 = II15 + II14 II14 = 21 10 CONTINUE II9 = II9 + II8 II8 = 21 11 CONTINUE II3 = II3 + II2 II2 = 21 12 CONTINUE

9.2 Serial Optimizations

The following standard optimizations are performed to enhance the serial code and aid uncovering opportunities for parallel execution:

Dead-Code Elimination
Induction Variable Recognition
Global Forward Substitution
Loop Peeling
Lifetime Analysis
Invariant-IF Restructuring
Reciprocal Substitution

9.2.1 Dead-Code Elimination

KAP can delete unused lines or sections of code from the transformed code. This can speed up a program by eliminating unnecessary operations and by not using valuable cache space for unused instructions or data. The dead-code eliminations are as follows:

Unreachable code is removed.
Empty loops and IF blocks are deleted.
IFs with constant conditions are simplified.
Unused definitions are removed.

Examples of dead-code eliminations are as follows:

Unreachable code removal:

GO TO 10 X = 2 <-- unreachable assignment removed 10 Y = 13

Removal of zero trip and empty DO loops:

DO 20 I = 10, 2 <-- zero trip DO ... 20 CONTINUE

Elimination of resolved conditionals:

PARAMETER( N = 12 ) ... IF (N .GT. 10) THEN X = 1. ELSE X = 2. ENDIF

Is transformed to:

PARAMETER( N = 12 ) ... X = 1.

The true branch will always be taken.

The removal of unprofitable code.
KAP performs lifetime analysis to determine the reaching definitions of variables and removes unused definitions, subject to overriding factors such as variables involved in SAVE, EXTERNAL, and COMMON statements, and so on. The -scalaropt > 2 switch is required.
Y = 5 <-- no subsequent use of Y X = 3 <-- redefinition of X X = 4 PRINT*, X END

The full effect of dead-code elimination is realized when combined with other optimizations such as subprogram inlining and forward substitution, which help in exposing useless or unreachable code.

9.2.2 Induction Variable Recognition

Induction variables are variables (INTEGER or REAL) that are incremented or decremented the same amount for each DO loop trip. The -roundoff=3 switch is required for REAL induction variable transformation.

The following loop has two induction variables, J and K. Transforming them into functions of the loop index removes dependencies between iterations and can simplify other optimizations, as shown in the following example:

DO 10 I = 1,200 A(J) = B(K) J = J + 1 K = K - 2 C(J) = D(I) 10 CONTINUE

Becomes:

II2 = K II1 = J DO 2 I=1,200 A(II1+I-1) = B(II2+I*(-2)+2) C(II1+I) = D(I) 2 CONTINUE J = II1 + 200 K = II2 - 400

9.2.3 Global Forward Substitution

A global forward substitution pass finds relations between variables that do not necessarily depend on the loop index variables, such as the following one:

NP1 = N + 1 DO 10 I = 1,M A(I,N) = A(I-1,NP1) 10 CONTINUE

Becomes:

NP1 = N + 1 DO 2 I=1,M A(I,N) = A(I-1,N+1) 2 CONTINUE

Here, the relation between N and NP1 is known and the assumed data dependence relation is broken.

9.2.4 Loop Peeling

In cases where an array is used to simulate a cylindrical coordinate system where the left edge of the array must be adjacent to its right edge, a method known as a wraparound variable is often used:

JM1 = N DO 10 J = 1,N B(J) = (A(J) + A(JM1)) /2 JM1 = J 10 CONTINUE

In the first iteration, JM1 is N. In all iterations except for J=1, the value of JM1 is J-1. Thus, JM1 is an induction variable for the loop after the first iteration. By peeling off the first iteration of the loop, the JM1 induction variable can be recognized, as follows:

JM1 = N IF (N .GE. 1) THEN RR1 = A(1) + A(N) JM1 = 1 II1 = N - 1 B(1) = RR1 / 2 DO 3 J=2,N B(J) = (A(J) + A(J-1)) / 2 3 CONTINUE IF (II1 .GT. 0) JM1 = II1 + 1 ENDIF

KAP may peel off several iterations in cases where multiple wraparound variables exist.

9.2.5 Lifetime Analysis

In general, when KAP inserts an iteration-local temporary, the last value assigned to that temporary must be reassigned to the original scalar. When -optimize is set to 2 or higher, KAP performs lifetime analysis to determine if the value of the original scalar is used outside the loop. When the command switch -save=a is in effect (the default), live variable analysis will be performed to identify scalar local variables in subroutines and functions that require that their values be retained invocations.

KAP determines if it is possible to reuse the result computed in the loop later in the program unit. If so, the variable is assigned the value it would have received in the serial loop, as shown in the following example:

DO 20 I=1,N DO 20 J=1,N X = A(I,J) Y = B(I,J) 20 C(I,J) = X + Y PRINT *, X

Becomes:

LL1 = N .GT. 0 DO 3 I=1,N DO 2 J=1,N C(I,J) = A(I,J) + B(I,J) 2 CONTINUE IF (LL1) X = A(I,N) 3 CONTINUE PRINT *, X

The roundoff=0 switch was used to disable memory management.

9.2.6 Invariant-IF Restructuring

KAP recognizes loops with invariant-IFs. To avoid repeatedly performing the same IF test, KAP generates two versions of the loop and places the IF test outside. This transformation is enabled only when -scalaropt>2. See the command switches -each_invariant_if_growth and -max_invariant_if_growth for ways to restrict this transformation.

DO 1 I=1,N IF (ADD) THEN X(I) = X(I) + Y(I) ELSE X(I) = X(I) - Y(I) ENDIF 1 CONTINUE 10 CONTINUE

Becomes:

IF (ADD) THEN DO 3 I=1,N X(I) = X(I) + Y(I) 3 CONTINUE ELSE DO 4 I=1,N X(I) = X(I) - Y(I) 4 CONTINUE ENDIF

9.2.7 Reciprocal Substitution

When -roundoff>2, KAP performs reciprocal substitution to move an expensive division outside a loop. In the following example, because the variable R is not changed in the loop, the division can be replaced by multiplication by its reciprocal:

DO 1 I=1,N A(I) = B(I)/R 1 CONTINUE

Becomes:

RR1=1.0/R DO 1 I=1,N A(I) = B(I)*RR1 1 CONTINUE

9.3 Scalar (Dusty-Deck) IF Transformations

IF statements are examined carefully by KAP because they affect many of the transformations performed by KAP. The following sections discuss these transformations:

IF to Block IF
IF to DO Loop
Semantic IF Merging
Zero Trip IF Removal

9.3.1 IF to Block IF

Many logical IF-GOTO combinations can be translated into Fortran 77 block IF statements as follows:

IF ( A.GT.B ) GOTO 10 C = D GOTO 20 10 D = C 20 CONTINUE

Becomes:

IF (A .LE. B) THEN C = D ELSE D = C ENDIF

9.3.2 IF to DO Loop

IF branches that can be translated into iterative DO loops are recognized as follows:

I = 1 11 IF ( I .GT. N ) GOTO 10 A(I) = B(I) + C(I) I = I + 1 GOTO 11 10 CONTINUE

Becomes:

I = 1 II2 = 1 DO 3 II1 = II2,N,1 A(I) = B(I) + C(I) I = I + 1 3 CONTINUE

9.3.3 Semantic IF Merging

Adjacent, complementary, IF statements can occasionally be merged into an IF/THEN/ELSE block as follows:

IF (A(I) .GT. 0) B(I) = A(I) IF (A(I) .LE. 0) B(I) = B(I) - A(I)

Becomes:

IF (A(I) .GT. 0) THEN B(I) = A(I) ELSE B(I) = B(I) - A(I) ENDIF

9.3.4 Zero-Trip IF Removal

DO loops in Fortran 77 are not executed if the initial value of the DO control variable is beyond the final value. These are called zero-trip loops. IF statements can be removed when they are used only to ensure that a DO statement is executed only when the upper bound of the loop is greater than or equal to the lower bound, for example:

IF (N .GE. 1) THEN DO 10 I = 1,N A(I) = A(I-1) + B(I) 10 CONTINUE ENDIF

Becomes:

DO 10 II1=1,N A(II1) = A(II1-1) + B(II1) 10 CONTINUE

9.4 Loop Unrolling

DO loop unrolling is a standard "manual" optimization that creates more statements in a small loop by replication of the original statement. This is done automatically by KAP to speed up some scalar loops. Unrolling a loop involves duplicating the loop body one or more times within the loop, adding an increment (or changing the increment that was already on the DO statement), and possibly inserting code before or after the loop to execute the excess iterations of the loop (the cleanup code).

If the loop bounds are constant and the iteration count of the loop is small, the loop may be entirely deleted and replaced by copies of the loop body. If the loop bounds are constant and the iteration count is a multiple of the unrolling factor, the cleanup code can be omitted. If the loop bounds are constant and the iteration count is evenly divisible by a number close to the unrolling factor given by the -unroll switch, KAP may use that number as the unrolling factor to eliminate the cleanup loop. Only numbers within 25% of the given unrolling factor will be considered.

Inner-loop unrolling is controlled by the -unroll[2] command switches and directive. Outer-loop unrolling is part of memory management and is controlled by -roundoff and -scalaropt.

The following examples can be duplicated with either -unroll=8 and -unroll2=1000 on the command line or the directive !*$* unroll (8,1000).

If the loop bounds are variable, a loop might be unrolled as follows:

DO 10 I = 1,N A(I) = B(I)/A(I-1) 10 CONTINUE

Becomes:

DO 24 I=1,N-7,8 A(I) = B(I) / A(I-1) A(I+1) = B(I+1) / A(I) A(I+2) = B(I+2) / A(I+1) A(I+3) = B(I+3) / A(I+2) A(I+4) = B(I+4) / A(I+3) A(I+5) = B(I+5) / A(I+4) A(I+6) = B(I+6) / A(I+5) A(I+7) = B(I+7) / A(I+6) 24 CONTINUE DO 2 I=I,N,1 A(I) = B(I) / A(I-1) 2 CONTINUE

If loop bounds are constant, the unrolled loop might look like the following example. The unrolling factor has been modified for this loop to avoid inserting cleanup code:

DO 20 I=1,63 A(I) = B(I)/A(I-1) 20 CONTINUE

Becomes:

DO 25 I=1,55,9 A(I) = B(I) / A(I-1) A(I+1) = B(I+1) / A(I) A(I+2) = B(I+2) / A(I+1) A(I+3) = B(I+3) / A(I+2) A(I+4) = B(I+4) / A(I+3) A(I+5) = B(I+5) / A(I+4) A(I+6) = B(I+6) / A(I+5) A(I+7) = B(I+7) / A(I+6) A(I+8) = B(I+8) / A(I+7) 25 CONTINUE

Or, if the loop iteration count is constant and small, the loop may be removed altogether, for example:

DO 30 I=1,5 A(I) = B(I)/A(I-1) 30 CONTINUE

Becomes:

A(1) = B(1) / A(0) A(2) = B(2) / A(1) A(3) = B(3) / A(2) A(4) = B(4) / A(3) A(5) = B(5) / A(4)

Contents

Index

Compaq KAP Fortran/OpenMP for Tru64 UNIXUser Guide

8.5 Conditions Inhibiting Inlining/IPA

Chapter 9Transformations

9.1.2 Memory Management Tactics

9.2.1 Dead-Code Elimination

9.3.1 IF to Block IF

Compaq KAP Fortran/OpenMP
for Tru64 UNIX
User Guide

Chapter 9
Transformations