Compaq KAP Fortran/OpenMP
for Tru64 UNIX
User Guide


Previous Contents Index

8.5 Conditions Inhibiting Inlining/IPA

This section lists conditions that inhibit the inlining of subroutines and functions, whether from a library or source file. Many constructs that prevent inlining will also stop or restrict IPA.

See Section 8.4.6 for other notes on the use of the -inline and -ipa command switches and directives.

Conditions that inhibit inlining:


Chapter 9
Transformations

Compaq KAP Fortran/OpenMP for Tru64 UNIX transformations convert standard Fortran 90 constructs into forms that have better serial performance and which make more efficient use of the memory hierarchy of Compaq Alpha computers.

In addition, KAP identifies opportunities for concurrent execution of loops, identifying them with directives for the Compaq Fortran compiler, and transforms vectorizable loops to enhance the vectorization capability of the Compaq Fortran compiler.

This chapter describes some of the possible transformations in this process:

Unless otherwise noted, the examples in this chapter were run with the command switches -optimize=5, -roundoff=2, and -scalaropt=2.

9.1 Memory Management

The following sections provide information about memory management.

9.1.1 Command Switches

The following lists the memory management command switches and how they are used:

9.1.2 Memory Management Tactics

One way KAP enhances performance on cache-memory machines is to use a combination of loop blocking, loop interchanging, and loop unrolling to optimize the cache hit ratios of operands in loops. Memory management in KAP is enabled when -scalaropt=3 and -roundoff=3 have been specified. These are the default settings for these switches.

When arrays are being processed in loops, KAP improves memory access patterns by restructuring the code to work on small sections of the arrays that fit in the cache and thereby giving large cache hit ratios. The sizes of these sections are controlled by using the memory management command switches -cacheline, -cachesize, -fpregisters, -dpregisters, and -setassociativity.

The default settings for these are the standard machine characteristics for which the code is being targeted. These default settings will give the best results for most programs. For some special cases, you may want to modify these settings to adjust the sizes of the array sections that are meant to reside in the cache. The algorithm that KAP uses keeps square blocks of data in the cache.

The following matrix multiplication example shows how loop interchanging, loop blocking, scalar temporary variables, and strip mining can be used. Because these techniques introduce some additional overhead, they are not used with small, simple loops.


DIMENSION A2(N,N),B2(N,N),C2(N,N),D2(N,N) 
 
  DO I = 1, N 
    DO J = 1, N 
      DO K = 1, N 
             C2(I,J) = C2(I,J) + A2(I,K) * B2(K,J) 
      ENDDO 
    ENDDO 
 ENDDO 

Becomes:


DIMENSION A2(N,N), B2(N,N), C2(N,N), D2(N,N) 
INTEGER II1, II2, II3, II4, II7, II8, II9, II10, II13, II14, II15 
X, II16, II19, II20, II21, II22, II23, II24, II25, II26, II27, 
X    II28 
REAL RR1, RR2, RR3, RR4, RR5, RR6, RR7, RR8, RR9, RR10, RR11, 
X    RR12, RR13, RR14, RR15, RR16, RR17, RR18, RR19 
  II3 = 1 
  II1 = MOD (N - 1, 21) + 1 
  II2 = II1 
  II7 = MOD (N - 1, 21) + 1 
  II13 = MOD (N - 1, 21) + 1 
    DO 12 II4=1,N,21 
     II9 = 1 
     II8 = II7 
     II27 = II3 + II2 - 3 
     II28 = II3 + II2 - 1 
     DO  II10=1,N,21 
         II15 = 1 
         II14 = II13 
          II19 = II9 + II8 - 1 
          II20 = II9 + II8 - 1 
          II24 = II9 + II8 - 1 
          II23 = II9 + II8 - 1 
          DO 10 II16=1,N,21 
          II21 = II15 + II14 - 3 
          II22 = II15 + II14 -1 
                   DO 5 J=II3,II27,3 


              DO 3 I=II15,II21,3 
                RR1 = C2(I,J) 
                RR2 = C2(I+1,J) 
                RR3 = C2(I,J+1) 
                RR4 = C2(I+1,J+1) 
                RR5 = C2(I,J+2) 
                RR6 = C2(I+1,J+2) 
                RR7 = C2(I+2,J) 
                RR8 = C2(I+2,J+1) 
                RR9 = C2(I+2,J+2) 
                    DO 2 K=II9,II19,1 
                    RR17 = A2(I,K) * B2(K,J) 
                    RR1 = RR1 + RR17 
                   RR17 = A2(I+1,K) * B2(K,J) 
                   RR2 = RR2 + RR17 
                   RR17 = A2(I,K) * B2(K,J+1) 
                   RR3 = RR3 + RR17 
                   RR17 = A2(I+1,K) * B2(K,J+1) 
                   RR4 = RR4 + RR17 
                   RR17 = A2(I,K) * B2(K,J+2) 
                   RR5 = RR5 + RR17 
                    RR17 = A2(I+1,K) * B2(K,J+2) 
                   RR6 = RR6 + RR17 
                   RR17 = A2(I+2,K) * B2(K,J) 
                   RR7 = RR7 + RR17 
                  RR17 = A2(I+2,K) * B2(K,J+1) 
                  RR8 = RR8 + RR17 
                  RR17 = A2(I+2,K) * B2(K,J+2) 
                  RR9 = RR9 + RR17 


         2   CONTINUE 
               C2(I,J) = RR1 
               C2(I+1,J) = RR2 
                C2(I,J+1) = RR3 
                 C2(I+1,J+1) = RR4 
                 C2(I,J+2) = RR5 
                 C2(I+1,J+2) = RR6 
                 C2(I+2,J) = RR7 
                 C2(I+2,J+1) = RR8 
                 C2(I+2,J+2) = RR9 
             3   CONTINUE 
                  DO 5 I=I,II22,1 
                    RR10 = C2(I,J) 
                    RR11 = C2(I,J+1) 
                    RR12 = C2(I,J+2) 
                     DO 4 K=II9,II20,1 
                     RR18 = A2(I,K) * B2(K,J) 
                     RR10 = RR10 + RR18 
                     RR18 = A2(I,K) * B2(K,J+1) 
                     RR11 = RR11 + RR18 
                     RR18 = A2(I,K) * B2(K,J+2) 
                    RR12 = RR12 + RR18 
             4   CONTINUE 
                   C2(I,J) = RR10 
                     C2(I,J+1) = RR11 
                  C2(I,J+2) = RR12 
              5  CONTINUE 
                  II25 = II15 + II14 - 3 
                  II26 = II15 + II14 - 1 
                    DO 9 J=J,II28,1 
                        DO 7 I=II15,II25,3 
                     RR13 = C2(I,J) 
                    RR14 = C2(I+1,J) 
                   RR15 = C2(I+2,J) 
                     DO 6 K=II9,II23,1 
                      RR19 = A2(I,K) * B2(K,J) 
                      RR13 = RR13 + RR19 
                      RR19 = A2(I+1,K) * B2(K,J) 
                      RR14 = RR14 + RR19 
                       RR19 = A2(I+2,K) * B2(K,J) 
                     RR15 = RR15 + RR19 
              6  CONTINUE 
                    C2(I,J) = RR13 
                    C2(I+1,J) = RR14 
                    C2(I+2,J) = RR15 
                 7 CONTINUE 
 
                     DO 9 I=I,II26,1 
                     RR16 = C2(I,J) 
                      DO 8 K=II9,II24,1 
                       RR16 = RR16 + A2(I,K) * B2(K,J) 
            8  CONTINUE 
                C2(I,J) = RR16 
                  9  CONTINUE 
                    II15 = II15 + II14 
                    II14 = 21 
            10  CONTINUE 
                II9 = II9 + II8 
                II8 = 21 
               11  CONTINUE 
                  II3 = II3 + II2 
                    II2 = 21 
        12  CONTINUE 

9.2 Serial Optimizations

The following standard optimizations are performed to enhance the serial code and aid uncovering opportunities for parallel execution:

9.2.1 Dead-Code Elimination

KAP can delete unused lines or sections of code from the transformed code. This can speed up a program by eliminating unnecessary operations and by not using valuable cache space for unused instructions or data. The dead-code eliminations are as follows:

Examples of dead-code eliminations are as follows:

  1. Unreachable code removal:


    GO TO 10 
    X = 2    <-- unreachable assignment removed 
    10    Y = 13 
    

  2. Removal of zero trip and empty DO loops:


    DO 20 I = 10, 2 <-- zero trip DO 
    ... 
    20    CONTINUE 
    

  3. Elimination of resolved conditionals:


    PARAMETER( N = 12 ) 
    ... 
    IF (N .GT. 10) THEN 
    X = 1. 
    ELSE 
    X = 2. 
    ENDIF 
    

    Is transformed to:


    PARAMETER( N = 12 ) 
    ... 
    X = 1. 
    

    The true branch will always be taken.

  4. The removal of unprofitable code.
    KAP performs lifetime analysis to determine the reaching definitions of variables and removes unused definitions, subject to overriding factors such as variables involved in SAVE, EXTERNAL, and COMMON statements, and so on. The -scalaropt > 2 switch is required.


    Y = 5  <-- no subsequent use of Y 
    X = 3  <-- redefinition of X 
    X = 4 
    PRINT*, X 
    END 
    

The full effect of dead-code elimination is realized when combined with other optimizations such as subprogram inlining and forward substitution, which help in exposing useless or unreachable code.

9.2.2 Induction Variable Recognition

Induction variables are variables (INTEGER or REAL) that are incremented or decremented the same amount for each DO loop trip. The -roundoff=3 switch is required for REAL induction variable transformation.

The following loop has two induction variables, J and K. Transforming them into functions of the loop index removes dependencies between iterations and can simplify other optimizations, as shown in the following example:


DO 10 I = 1,200 
A(J) = B(K) 
J = J + 1 
K = K - 2 
C(J) = D(I) 
10    CONTINUE 

Becomes:


II2 = K 
II1 = J 
DO 2 I=1,200 
A(II1+I-1) = B(II2+I*(-2)+2) 
C(II1+I) = D(I) 
2  CONTINUE 
J = II1 + 200 
K = II2 - 400 

9.2.3 Global Forward Substitution

A global forward substitution pass finds relations between variables that do not necessarily depend on the loop index variables, such as the following one:


NP1 = N + 1 
DO 10 I = 1,M 
A(I,N) = A(I-1,NP1) 
10    CONTINUE 

Becomes:


NP1 = N + 1 
DO 2 I=1,M 
A(I,N) = A(I-1,N+1) 
2   CONTINUE 

Here, the relation between N and NP1 is known and the assumed data dependence relation is broken.

9.2.4 Loop Peeling

In cases where an array is used to simulate a cylindrical coordinate system where the left edge of the array must be adjacent to its right edge, a method known as a wraparound variable is often used:


JM1 = N 
DO 10 J = 1,N 
B(J) = (A(J) + A(JM1)) /2 
JM1 = J 
10    CONTINUE 

In the first iteration, JM1 is N. In all iterations except for J=1, the value of JM1 is J-1. Thus, JM1 is an induction variable for the loop after the first iteration. By peeling off the first iteration of the loop, the JM1 induction variable can be recognized, as follows:


JM1 = N 
IF (N .GE. 1) THEN 
RR1 = A(1) + A(N) 
JM1 = 1 
II1 = N - 1 
B(1) = RR1 / 2 
DO 3 J=2,N 
B(J) = (A(J) + A(J-1)) / 2 
3   CONTINUE 
IF (II1 .GT. 0) JM1 = II1 + 1 
ENDIF 

KAP may peel off several iterations in cases where multiple wraparound variables exist.

9.2.5 Lifetime Analysis

In general, when KAP inserts an iteration-local temporary, the last value assigned to that temporary must be reassigned to the original scalar. When -optimize is set to 2 or higher, KAP performs lifetime analysis to determine if the value of the original scalar is used outside the loop. When the command switch -save=a is in effect (the default), live variable analysis will be performed to identify scalar local variables in subroutines and functions that require that their values be retained invocations.

KAP determines if it is possible to reuse the result computed in the loop later in the program unit. If so, the variable is assigned the value it would have received in the serial loop, as shown in the following example:


DO 20 I=1,N 
DO 20 J=1,N 
X = A(I,J) 
Y = B(I,J) 
20  C(I,J) = X + Y 
PRINT *, X 

Becomes:


LL1 = N .GT. 0 
DO 3 I=1,N 
DO 2 J=1,N 
C(I,J) = A(I,J) + B(I,J) 
2   CONTINUE 
IF (LL1) X = A(I,N) 
3  CONTINUE 
PRINT *, X 

The roundoff=0 switch was used to disable memory management.

9.2.6 Invariant-IF Restructuring

KAP recognizes loops with invariant-IFs. To avoid repeatedly performing the same IF test, KAP generates two versions of the loop and places the IF test outside. This transformation is enabled only when -scalaropt>2. See the command switches -each_invariant_if_growth and -max_invariant_if_growth for ways to restrict this transformation.


 
DO 1 I=1,N 
IF (ADD) THEN 
X(I) = X(I) + Y(I) 
ELSE 
X(I) = X(I) - Y(I) 
ENDIF 
1  CONTINUE 
10 CONTINUE 

Becomes:


IF (ADD) THEN 
DO 3 I=1,N 
X(I) = X(I) + Y(I) 
3   CONTINUE 
ELSE 
DO 4 I=1,N 
X(I) = X(I) - Y(I) 
4    CONTINUE 
ENDIF 

9.2.7 Reciprocal Substitution

When -roundoff>2, KAP performs reciprocal substitution to move an expensive division outside a loop. In the following example, because the variable R is not changed in the loop, the division can be replaced by multiplication by its reciprocal:


DO 1 I=1,N 
A(I) = B(I)/R 
1    CONTINUE 

Becomes:


RR1=1.0/R 
DO 1 I=1,N 
A(I) = B(I)*RR1 
1    CONTINUE 

9.3 Scalar (Dusty-Deck) IF Transformations

IF statements are examined carefully by KAP because they affect many of the transformations performed by KAP. The following sections discuss these transformations:

9.3.1 IF to Block IF

Many logical IF-GOTO combinations can be translated into Fortran 77 block IF statements as follows:


IF ( A.GT.B ) GOTO 10 
C = D 
GOTO 20 
10     D = C 
20     CONTINUE 

Becomes:


IF (A .LE. B) THEN 
C = D 
ELSE 
D = C 
ENDIF 

9.3.2 IF to DO Loop

IF branches that can be translated into iterative DO loops are recognized as follows:


I = 1 
11    IF ( I .GT. N ) GOTO 10 
A(I) = B(I) + C(I) 
I = I + 1 
GOTO 11 
10    CONTINUE 

Becomes:


I = 1 
II2 = 1 
DO 3 II1 = II2,N,1 
A(I) = B(I) + C(I) 
              I = I + 1 
3  CONTINUE 

9.3.3 Semantic IF Merging

Adjacent, complementary, IF statements can occasionally be merged into an IF/THEN/ELSE block as follows:


IF (A(I) .GT. 0) B(I) = A(I) 
IF (A(I) .LE. 0) B(I) = B(I) - A(I) 

Becomes:


IF (A(I) .GT. 0) THEN 
B(I) = A(I) 
ELSE 
B(I) = B(I) - A(I) 
ENDIF 

9.3.4 Zero-Trip IF Removal

DO loops in Fortran 77 are not executed if the initial value of the DO control variable is beyond the final value. These are called zero-trip loops. IF statements can be removed when they are used only to ensure that a DO statement is executed only when the upper bound of the loop is greater than or equal to the lower bound, for example:


IF (N .GE. 1) THEN 
DO 10 I = 1,N 
A(I) = A(I-1) + B(I) 
10   CONTINUE 
ENDIF 

Becomes:


DO 10 II1=1,N 
A(II1) = A(II1-1) + B(II1) 
10    CONTINUE 

9.4 Loop Unrolling

DO loop unrolling is a standard "manual" optimization that creates more statements in a small loop by replication of the original statement. This is done automatically by KAP to speed up some scalar loops. Unrolling a loop involves duplicating the loop body one or more times within the loop, adding an increment (or changing the increment that was already on the DO statement), and possibly inserting code before or after the loop to execute the excess iterations of the loop (the cleanup code).

If the loop bounds are constant and the iteration count of the loop is small, the loop may be entirely deleted and replaced by copies of the loop body. If the loop bounds are constant and the iteration count is a multiple of the unrolling factor, the cleanup code can be omitted. If the loop bounds are constant and the iteration count is evenly divisible by a number close to the unrolling factor given by the -unroll switch, KAP may use that number as the unrolling factor to eliminate the cleanup loop. Only numbers within 25% of the given unrolling factor will be considered.

Inner-loop unrolling is controlled by the -unroll[2] command switches and directive. Outer-loop unrolling is part of memory management and is controlled by -roundoff and -scalaropt.

The following examples can be duplicated with either -unroll=8 and -unroll2=1000 on the command line or the directive !*$* unroll (8,1000).

If the loop bounds are variable, a loop might be unrolled as follows:


DO 10 I = 1,N 
A(I) = B(I)/A(I-1) 
10  CONTINUE 

Becomes:


DO 24 I=1,N-7,8 
A(I) = B(I) / A(I-1) 
A(I+1) = B(I+1) / A(I) 
A(I+2) = B(I+2) / A(I+1) 
A(I+3) = B(I+3) / A(I+2) 
A(I+4) = B(I+4) / A(I+3) 
A(I+5) = B(I+5) / A(I+4) 
A(I+6) = B(I+6) / A(I+5) 
A(I+7) = B(I+7) / A(I+6) 
24  CONTINUE 
DO 2 I=I,N,1 
A(I) = B(I) / A(I-1) 
2  CONTINUE 

If loop bounds are constant, the unrolled loop might look like the following example. The unrolling factor has been modified for this loop to avoid inserting cleanup code:


DO 20 I=1,63 
A(I) = B(I)/A(I-1) 
20  CONTINUE 

Becomes:


DO 25 I=1,55,9 
A(I) = B(I) / A(I-1) 
A(I+1) = B(I+1) / A(I) 
A(I+2) = B(I+2) / A(I+1) 
A(I+3) = B(I+3) / A(I+2) 
A(I+4) = B(I+4) / A(I+3) 
A(I+5) = B(I+5) / A(I+4) 
A(I+6) = B(I+6) / A(I+5) 
A(I+7) = B(I+7) / A(I+6) 
A(I+8) = B(I+8) / A(I+7) 
25  CONTINUE 

Or, if the loop iteration count is constant and small, the loop may be removed altogether, for example:


DO 30 I=1,5 
A(I) = B(I)/A(I-1) 
30  CONTINUE 

Becomes:


A(1) = B(1) / A(0) 
A(2) = B(2) / A(1) 
A(3) = B(3) / A(2) 
A(4) = B(4) / A(3) 
A(5) = B(5) / A(4) 


Previous Next Contents Index