Previous | Contents | Index |
This section lists conditions that inhibit the inlining of subroutines and functions, whether from a library or source file. Many constructs that prevent inlining will also stop or restrict IPA.
See Section 8.4.6 for other notes on the use of the -inline and -ipa command switches and directives.
Conditions that inhibit inlining:
Compaq KAP Fortran/OpenMP for Tru64 UNIX transformations convert standard Fortran 90 constructs into forms that have better serial performance and which make more efficient use of the memory hierarchy of Compaq Alpha computers.
In addition, KAP identifies opportunities for concurrent execution of loops, identifying them with directives for the Compaq Fortran compiler, and transforms vectorizable loops to enhance the vectorization capability of the Compaq Fortran compiler.
This chapter describes some of the possible transformations in this process:
Unless otherwise noted, the examples in this chapter were run with the
command switches -optimize=5, -roundoff=2, and
-scalaropt=2.
9.1 Memory Management
The following sections provide information about memory management.
9.1.1 Command Switches
The following lists the memory management command switches and how they are used:
One way KAP enhances performance on cache-memory machines is to use a combination of loop blocking, loop interchanging, and loop unrolling to optimize the cache hit ratios of operands in loops. Memory management in KAP is enabled when -scalaropt=3 and -roundoff=3 have been specified. These are the default settings for these switches.
When arrays are being processed in loops, KAP improves memory access patterns by restructuring the code to work on small sections of the arrays that fit in the cache and thereby giving large cache hit ratios. The sizes of these sections are controlled by using the memory management command switches -cacheline, -cachesize, -fpregisters, -dpregisters, and -setassociativity.
The default settings for these are the standard machine characteristics for which the code is being targeted. These default settings will give the best results for most programs. For some special cases, you may want to modify these settings to adjust the sizes of the array sections that are meant to reside in the cache. The algorithm that KAP uses keeps square blocks of data in the cache.
The following matrix multiplication example shows how loop interchanging, loop blocking, scalar temporary variables, and strip mining can be used. Because these techniques introduce some additional overhead, they are not used with small, simple loops.
DIMENSION A2(N,N),B2(N,N),C2(N,N),D2(N,N) DO I = 1, N DO J = 1, N DO K = 1, N C2(I,J) = C2(I,J) + A2(I,K) * B2(K,J) ENDDO ENDDO ENDDO |
Becomes:
DIMENSION A2(N,N), B2(N,N), C2(N,N), D2(N,N) INTEGER II1, II2, II3, II4, II7, II8, II9, II10, II13, II14, II15 X, II16, II19, II20, II21, II22, II23, II24, II25, II26, II27, X II28 REAL RR1, RR2, RR3, RR4, RR5, RR6, RR7, RR8, RR9, RR10, RR11, X RR12, RR13, RR14, RR15, RR16, RR17, RR18, RR19 II3 = 1 II1 = MOD (N - 1, 21) + 1 II2 = II1 II7 = MOD (N - 1, 21) + 1 II13 = MOD (N - 1, 21) + 1 DO 12 II4=1,N,21 II9 = 1 II8 = II7 II27 = II3 + II2 - 3 II28 = II3 + II2 - 1 DO II10=1,N,21 II15 = 1 II14 = II13 II19 = II9 + II8 - 1 II20 = II9 + II8 - 1 II24 = II9 + II8 - 1 II23 = II9 + II8 - 1 DO 10 II16=1,N,21 II21 = II15 + II14 - 3 II22 = II15 + II14 -1 DO 5 J=II3,II27,3 |
DO 3 I=II15,II21,3 RR1 = C2(I,J) RR2 = C2(I+1,J) RR3 = C2(I,J+1) RR4 = C2(I+1,J+1) RR5 = C2(I,J+2) RR6 = C2(I+1,J+2) RR7 = C2(I+2,J) RR8 = C2(I+2,J+1) RR9 = C2(I+2,J+2) DO 2 K=II9,II19,1 RR17 = A2(I,K) * B2(K,J) RR1 = RR1 + RR17 RR17 = A2(I+1,K) * B2(K,J) RR2 = RR2 + RR17 RR17 = A2(I,K) * B2(K,J+1) RR3 = RR3 + RR17 RR17 = A2(I+1,K) * B2(K,J+1) RR4 = RR4 + RR17 RR17 = A2(I,K) * B2(K,J+2) RR5 = RR5 + RR17 RR17 = A2(I+1,K) * B2(K,J+2) RR6 = RR6 + RR17 RR17 = A2(I+2,K) * B2(K,J) RR7 = RR7 + RR17 RR17 = A2(I+2,K) * B2(K,J+1) RR8 = RR8 + RR17 RR17 = A2(I+2,K) * B2(K,J+2) RR9 = RR9 + RR17 |
2 CONTINUE C2(I,J) = RR1 C2(I+1,J) = RR2 C2(I,J+1) = RR3 C2(I+1,J+1) = RR4 C2(I,J+2) = RR5 C2(I+1,J+2) = RR6 C2(I+2,J) = RR7 C2(I+2,J+1) = RR8 C2(I+2,J+2) = RR9 3 CONTINUE DO 5 I=I,II22,1 RR10 = C2(I,J) RR11 = C2(I,J+1) RR12 = C2(I,J+2) DO 4 K=II9,II20,1 RR18 = A2(I,K) * B2(K,J) RR10 = RR10 + RR18 RR18 = A2(I,K) * B2(K,J+1) RR11 = RR11 + RR18 RR18 = A2(I,K) * B2(K,J+2) RR12 = RR12 + RR18 4 CONTINUE C2(I,J) = RR10 C2(I,J+1) = RR11 C2(I,J+2) = RR12 5 CONTINUE II25 = II15 + II14 - 3 II26 = II15 + II14 - 1 DO 9 J=J,II28,1 DO 7 I=II15,II25,3 RR13 = C2(I,J) RR14 = C2(I+1,J) RR15 = C2(I+2,J) DO 6 K=II9,II23,1 RR19 = A2(I,K) * B2(K,J) RR13 = RR13 + RR19 RR19 = A2(I+1,K) * B2(K,J) RR14 = RR14 + RR19 RR19 = A2(I+2,K) * B2(K,J) RR15 = RR15 + RR19 6 CONTINUE C2(I,J) = RR13 C2(I+1,J) = RR14 C2(I+2,J) = RR15 7 CONTINUE DO 9 I=I,II26,1 RR16 = C2(I,J) DO 8 K=II9,II24,1 RR16 = RR16 + A2(I,K) * B2(K,J) 8 CONTINUE C2(I,J) = RR16 9 CONTINUE II15 = II15 + II14 II14 = 21 10 CONTINUE II9 = II9 + II8 II8 = 21 11 CONTINUE II3 = II3 + II2 II2 = 21 12 CONTINUE |
The following standard optimizations are performed to enhance the serial code and aid uncovering opportunities for parallel execution:
KAP can delete unused lines or sections of code from the transformed code. This can speed up a program by eliminating unnecessary operations and by not using valuable cache space for unused instructions or data. The dead-code eliminations are as follows:
Examples of dead-code eliminations are as follows:
GO TO 10 X = 2 <-- unreachable assignment removed 10 Y = 13 |
DO 20 I = 10, 2 <-- zero trip DO ... 20 CONTINUE |
PARAMETER( N = 12 ) ... IF (N .GT. 10) THEN X = 1. ELSE X = 2. ENDIF |
PARAMETER( N = 12 ) ... X = 1. |
Y = 5 <-- no subsequent use of Y X = 3 <-- redefinition of X X = 4 PRINT*, X END |
The full effect of dead-code elimination is realized when combined with
other optimizations such as subprogram inlining and forward
substitution, which help in exposing useless or unreachable code.
9.2.2 Induction Variable Recognition
Induction variables are variables (INTEGER or REAL) that are incremented or decremented the same amount for each DO loop trip. The -roundoff=3 switch is required for REAL induction variable transformation.
The following loop has two induction variables, J and K. Transforming them into functions of the loop index removes dependencies between iterations and can simplify other optimizations, as shown in the following example:
DO 10 I = 1,200 A(J) = B(K) J = J + 1 K = K - 2 C(J) = D(I) 10 CONTINUE |
Becomes:
II2 = K II1 = J DO 2 I=1,200 A(II1+I-1) = B(II2+I*(-2)+2) C(II1+I) = D(I) 2 CONTINUE J = II1 + 200 K = II2 - 400 |
A global forward substitution pass finds relations between variables that do not necessarily depend on the loop index variables, such as the following one:
NP1 = N + 1 DO 10 I = 1,M A(I,N) = A(I-1,NP1) 10 CONTINUE |
Becomes:
NP1 = N + 1 DO 2 I=1,M A(I,N) = A(I-1,N+1) 2 CONTINUE |
Here, the relation between N and NP1 is known and the
assumed data dependence relation is broken.
9.2.4 Loop Peeling
In cases where an array is used to simulate a cylindrical coordinate system where the left edge of the array must be adjacent to its right edge, a method known as a wraparound variable is often used:
JM1 = N DO 10 J = 1,N B(J) = (A(J) + A(JM1)) /2 JM1 = J 10 CONTINUE |
In the first iteration, JM1 is N. In all iterations except for J=1, the value of JM1 is J-1. Thus, JM1 is an induction variable for the loop after the first iteration. By peeling off the first iteration of the loop, the JM1 induction variable can be recognized, as follows:
JM1 = N IF (N .GE. 1) THEN RR1 = A(1) + A(N) JM1 = 1 II1 = N - 1 B(1) = RR1 / 2 DO 3 J=2,N B(J) = (A(J) + A(J-1)) / 2 3 CONTINUE IF (II1 .GT. 0) JM1 = II1 + 1 ENDIF |
KAP may peel off several iterations in cases where multiple wraparound
variables exist.
9.2.5 Lifetime Analysis
In general, when KAP inserts an iteration-local temporary, the last value assigned to that temporary must be reassigned to the original scalar. When -optimize is set to 2 or higher, KAP performs lifetime analysis to determine if the value of the original scalar is used outside the loop. When the command switch -save=a is in effect (the default), live variable analysis will be performed to identify scalar local variables in subroutines and functions that require that their values be retained invocations.
KAP determines if it is possible to reuse the result computed in the loop later in the program unit. If so, the variable is assigned the value it would have received in the serial loop, as shown in the following example:
DO 20 I=1,N DO 20 J=1,N X = A(I,J) Y = B(I,J) 20 C(I,J) = X + Y PRINT *, X |
Becomes:
LL1 = N .GT. 0 DO 3 I=1,N DO 2 J=1,N C(I,J) = A(I,J) + B(I,J) 2 CONTINUE IF (LL1) X = A(I,N) 3 CONTINUE PRINT *, X |
The roundoff=0 switch was used to disable memory management.
9.2.6 Invariant-IF Restructuring
KAP recognizes loops with invariant-IFs. To avoid repeatedly performing the same IF test, KAP generates two versions of the loop and places the IF test outside. This transformation is enabled only when -scalaropt>2. See the command switches -each_invariant_if_growth and -max_invariant_if_growth for ways to restrict this transformation.
DO 1 I=1,N IF (ADD) THEN X(I) = X(I) + Y(I) ELSE X(I) = X(I) - Y(I) ENDIF 1 CONTINUE 10 CONTINUE |
Becomes:
IF (ADD) THEN DO 3 I=1,N X(I) = X(I) + Y(I) 3 CONTINUE ELSE DO 4 I=1,N X(I) = X(I) - Y(I) 4 CONTINUE ENDIF |
When -roundoff>2, KAP performs reciprocal substitution to move an expensive division outside a loop. In the following example, because the variable R is not changed in the loop, the division can be replaced by multiplication by its reciprocal:
DO 1 I=1,N A(I) = B(I)/R 1 CONTINUE |
Becomes:
RR1=1.0/R DO 1 I=1,N A(I) = B(I)*RR1 1 CONTINUE |
IF statements are examined carefully by KAP because they affect many of the transformations performed by KAP. The following sections discuss these transformations:
Many logical IF-GOTO combinations can be translated into Fortran 77 block IF statements as follows:
IF ( A.GT.B ) GOTO 10 C = D GOTO 20 10 D = C 20 CONTINUE |
Becomes:
IF (A .LE. B) THEN C = D ELSE D = C ENDIF |
IF branches that can be translated into iterative DO loops are recognized as follows:
I = 1 11 IF ( I .GT. N ) GOTO 10 A(I) = B(I) + C(I) I = I + 1 GOTO 11 10 CONTINUE |
Becomes:
I = 1 II2 = 1 DO 3 II1 = II2,N,1 A(I) = B(I) + C(I) I = I + 1 3 CONTINUE |
Adjacent, complementary, IF statements can occasionally be merged into an IF/THEN/ELSE block as follows:
IF (A(I) .GT. 0) B(I) = A(I) IF (A(I) .LE. 0) B(I) = B(I) - A(I) |
Becomes:
IF (A(I) .GT. 0) THEN B(I) = A(I) ELSE B(I) = B(I) - A(I) ENDIF |
DO loops in Fortran 77 are not executed if the initial value of the DO control variable is beyond the final value. These are called zero-trip loops. IF statements can be removed when they are used only to ensure that a DO statement is executed only when the upper bound of the loop is greater than or equal to the lower bound, for example:
IF (N .GE. 1) THEN DO 10 I = 1,N A(I) = A(I-1) + B(I) 10 CONTINUE ENDIF |
Becomes:
DO 10 II1=1,N A(II1) = A(II1-1) + B(II1) 10 CONTINUE |
DO loop unrolling is a standard "manual" optimization that creates more statements in a small loop by replication of the original statement. This is done automatically by KAP to speed up some scalar loops. Unrolling a loop involves duplicating the loop body one or more times within the loop, adding an increment (or changing the increment that was already on the DO statement), and possibly inserting code before or after the loop to execute the excess iterations of the loop (the cleanup code).
If the loop bounds are constant and the iteration count of the loop is small, the loop may be entirely deleted and replaced by copies of the loop body. If the loop bounds are constant and the iteration count is a multiple of the unrolling factor, the cleanup code can be omitted. If the loop bounds are constant and the iteration count is evenly divisible by a number close to the unrolling factor given by the -unroll switch, KAP may use that number as the unrolling factor to eliminate the cleanup loop. Only numbers within 25% of the given unrolling factor will be considered.
Inner-loop unrolling is controlled by the -unroll[2] command switches and directive. Outer-loop unrolling is part of memory management and is controlled by -roundoff and -scalaropt.
The following examples can be duplicated with either -unroll=8 and -unroll2=1000 on the command line or the directive !*$* unroll (8,1000).
If the loop bounds are variable, a loop might be unrolled as follows:
DO 10 I = 1,N A(I) = B(I)/A(I-1) 10 CONTINUE |
Becomes:
DO 24 I=1,N-7,8 A(I) = B(I) / A(I-1) A(I+1) = B(I+1) / A(I) A(I+2) = B(I+2) / A(I+1) A(I+3) = B(I+3) / A(I+2) A(I+4) = B(I+4) / A(I+3) A(I+5) = B(I+5) / A(I+4) A(I+6) = B(I+6) / A(I+5) A(I+7) = B(I+7) / A(I+6) 24 CONTINUE DO 2 I=I,N,1 A(I) = B(I) / A(I-1) 2 CONTINUE |
If loop bounds are constant, the unrolled loop might look like the following example. The unrolling factor has been modified for this loop to avoid inserting cleanup code:
DO 20 I=1,63 A(I) = B(I)/A(I-1) 20 CONTINUE |
Becomes:
DO 25 I=1,55,9 A(I) = B(I) / A(I-1) A(I+1) = B(I+1) / A(I) A(I+2) = B(I+2) / A(I+1) A(I+3) = B(I+3) / A(I+2) A(I+4) = B(I+4) / A(I+3) A(I+5) = B(I+5) / A(I+4) A(I+6) = B(I+6) / A(I+5) A(I+7) = B(I+7) / A(I+6) A(I+8) = B(I+8) / A(I+7) 25 CONTINUE |
Or, if the loop iteration count is constant and small, the loop may be removed altogether, for example:
DO 30 I=1,5 A(I) = B(I)/A(I-1) 30 CONTINUE |
Becomes:
A(1) = B(1) / A(0) A(2) = B(2) / A(1) A(3) = B(3) / A(2) A(4) = B(4) / A(3) A(5) = B(5) / A(4) |
Previous | Next | Contents | Index |