Compaq KAP C/OpenMP
for Tru64 UNIX
User Guide

7.5 Loop Rerolling

Many codes have loops that were unrolled manually over several iterations to amortize the cost of the test and branch at each iteration of the for loop. Before KAP can examine them, they must be rerolled to a simpler form, as shown in the following code examples:

for (i=0; i<n; i+=2) { a[i] = b[i]+c[i]; a[i+1] = b[i+1]+c[i+1]; }

KAP recognizes the following code as an unrolled loop and rerolls it before applying loop-based optimizations:

for ( i = 0; i<=(n + 1) / 2 * 2 - 1; i++ ) { a[i] = b[i] + c[i]; }

Unrolled summations are also recognized, as follows:

sum = 0.0; for ( i=0; i<n; i+=4 ) { sum += b[i]; sum += b[i+1]; sum += b[i+2]; sum += b[i+3]; }

Becomes:

sum = (float )(0.0); for ( i = 0; i<=(n + 3) / 4 * 4 - 1; i++ ) { sum += b[i]; }

Chapter 8
KAP Listing

This chapter describes the types of information available in the optional KAP listing. To determine KAP status and effectiveness, and to help you locate trouble spots where assertions or special transformations could help, KAP can list the optimizations it performed on each loop in the program. For example, in some cases KAP may communicate to the user that any of three loops could have been concurrentized, but KAP concurrentized only the one it considered most profitable. In other cases, a concurrentizable loop is left serial, because it will execute faster in that form. Because KAP optimizations can produce correct but unexpected code, KAP puts a note in the loop table explaining the action.

KAP does not produce a listing file unless requested. See Chapter 4 for a description of the -list command switch.

The following section describes the listing information that you can specify with the -listoptions command switch. Then, an example of KAP output for syntax errors is presented. The last section of the chapter explains the descriptive messages in the loop table section of the listing.

See the -cmpoptions switch for the optional information available in the transformed code file.

8.1 Listoptions

The -listoptions command switch tells KAP what optional information to include in the listing and error files. The listing file can contain any combination of the following messages about the optimizations performed, indicated by these values:

C --- Calling tree of the entire program
K --- KAP switches used presented at end of listing
L --- Loop table
N --- Print function names in error file
P --- Compilation performance statistics
S --- Summary table

The format of these listings follows. The source code used for the examples in this section was the transformed code from the -inline=setup example in Section 6.5.1. Information pertaining to just one function is for the first function of the source file. Default switches were used, except for -list, -cmp, and -listoptions.

8.1.1 Calling Tree (C)

The calling tree is listed after all functions have been compiled. Each function's calling tree information consists of the functions it calls, the loop nest level calls within that function, the total loop nest depth of the calls in that function (including the nesting level of calls to the unit itself), and the routines where that function itself is called.

When the calling tree is selected, a table of the variables used in each function is printed after that function is processed.

For example, the following table appears after the main function:

CROSS REFERENCE TABLE Name Type Class Storage ------------------------------------------------ signgam s.INT Var Static size_t s.INT Var Automatic wchar_t s.INT Var Automatic div_t Struc Var Automatic ldiv_t Struc Var Automatic optarg Holl Array Static optind s.INT Var Static optopt s.INT Var Static opterr s.INT Var Static ptrdiff_t s.INT Var Automatic . . . _Kii1 s.INT Var Automatic _Kii2 s.INT Var Automatic _Kii3 s.INT Var Automatic _Kii4 s.INT Var Automatic _Kii5 s.INT Var Automatic _Kii6 s.INT Var Automatic _Kii7 s.INT Var Automatic _Kii8 s.INT Var Automatic _Kii9 s.INT Var Automatic _Kii10 s.INT Var Automatic . . . Abbreviations used in Source Program References A = used as actual argument D = Declared or Defined M = Contents may get modified U = Its value is used Calling Tree line# routines at nest max. aggregate nest 1629 program main Calling Tree main Code Modules main called from

8.1.2 KAP Switches (K)

The KAP switches table lists the settings of the command switches related to optimization. When the values are changed inside the function with directives, the values from the directives will not appear in this table.

Some of these switches cannot be set and are fixed for this version of KAP. See Chapter 4 for the list of user command switches.

The following example shows the switch settings KAP used to optimize the source code. In this example, the -tune switch appears as the -architecture switch.

KAP/Tru64_U_C 4.0 k010710 990730 main Transformed 21-Aug-1999 15:16:02 Switches Used for this Program Unit addressresolution=1 limit=50 architecture=EV4 arclimit=5000 lines=55 list=matmulc.out cacheline=32,32 listingwidth=132 cache_prefetch_line_count=0 listoptions=kl cachesize=8 chunk=1 cmp=matmulc.cmp machine=s no cmpoptions miifg=500 no concurrentize minconcurrent=1000 no namepartitioning natural dpregisters=32 optimize=5 eiifg=20 processors=1 fpregisters=32 no fuse recursion fuselevel=0 heaplimit=100 roundoff=3 hli=1 scalaroptimize=3 scheduling=e no inline setassociativity=1 no inline_and_copy signed no inline_create no skip inline_depth=2 no stdio no inline_from_files no suppress no inline_from_libraries no syntax inline_looplevel=2 no inline_manual inline_optimize=0 input=matmulc.i unroll=4 interchange unroll2=160 no ipa unroll3=1 no ipa_create ipa_depth=10 no ipa_from_files no ipa_from_libraries ipa_looplevel=2 no ipa_manual ipa_optimize=0

8.1.3 Loop Table (L)

The loop table is printed after each program unit (function) and shows what KAP did with each for loop in the function. If the loop could not be optimized, a reason is given. The loop table is the primary listing tool for understanding what KAP did with the program. Loop table messages that refer to DO loops apply to C FOR loops.

The line messages are summaries of KAP actions for each loop (both loops in the original source code and loops that KAP generated). The loop to which each refers can be identified from the main loop table. The abbreviations used are explained after the main table.

KAP/Tru64_U_C 4.0 k010710 990730 main Loop Table 21-Aug-1999 15:16:02 -------------------------------- Loop Table -------------------------------- Nest Loop Message Level Contains Lines ============================================================================== for j 1 1637-1644 "matmul.i" 1. Enhanced Scalar 2 1637, 1640, 1644 "matmul.i" for i 2 1639-1642 "matmul.i" 1. Enhanced Scalar 1 1637, 1639-1640, 1642, 1644 "matmul.i" Line:1639 NV Not an inner loop. for j 1 1646-1653 "matmul.i" 1. Enhanced Scalar 2 1646, 1649, 1653 "matmul.i" for i 2 1648-1651 "matmul.i" 1. Enhanced Scalar 1 1646, 1648-1649, 1651, 1653 "matmul.i" Line:1648 NV Not an inner loop. for j 1 1658-1670 "matmul.i" Line:1658 NO Loop was asserted serial by directive. Line:1658 I DO loop was inserted here. Line:1658 SO Strip loop for strip mining with block size 24. Line:1658 SO Block loop for strip mining with block size 24. Line:1658 SO Loop unrolled 3 times to improve scalar performance. Line:1658 SO Cleanup-loop for loop unrolling added. Line:1660 NV Not an inner loop. Line:1660 I DO loop was inserted here. Line:1660 SO Strip loop for strip mining with block size 24. . . . Abbreviations Used NO not optimized DD data dependence SO scalar optimization I inserted INF informational

8.1.4 Name (N)

The function names, as processed, are printed in the error file. The first line gives the name of the source file, as shown in the following example:

KAP/Tru64_U_C 4.0 k010710 990730 21-Aug-1999 15:16:02 matmul.i: main: 0 errors in file matmul.i

8.1.5 Compilation Performance Statistics (P)

The compilation performance statistics list the number of lines in the function, the compilation time in seconds, and the compilation rate in lines per minute. This information is also summarized after all functions have been compiled.

KAP/Tru64_U_C 4.0 k010710 990730 main Compilation Statistics 21-Aug-1999 15:16:02 Compilation Statistics For the Routine main 1670 Lines in Program Unit 1.48 CPU Time 67702 Lines Per Minute 0 Symbol Cache File Writes 0 Symbol Cache File Reads 0 Name Table File Writes 0 Name Table File Reads Cumulative Compilation Statistics 1670 Lines in Source File << includes definitions and blank lines 1 Program Units in Source File 1.48 CPU Time 67702 Lines Per Minute 0 Symbol Cache File Writes << for functions with many symbols 0 Symbol Cache File Reads 0 Name Table File Writes << for functions with many symbols 0 Name Table File Reads 575 Lines in Compile File

8.1.6 Summary Table (S)

The summary table shows how many loops appeared in the function and how many loops were optimized in different ways:

KAP/Tru64_U_C 4.0 k010710 990730 main Optimization Summary 25-Aug-1999 15:18:03 7 loops total 3 loops vectorized 2 with scalar directive 2 with inner loop

8.2 Syntax Error/Warning Messages

KAP tries to match the syntax error and warning messages of the compiler it runs with. A file that will cause the compiler to issue a syntax error will cause KAP to issue a syntax error.

When a syntax error is found, KAP stops reading the input file after the current function definition has been read. The function with the syntax error will not be sent to the output file, so only code without syntax errors is put into the KAP transformed code file.

When illegal syntax or any other error is found, KAP writes a message to the error file on Tru64 UNIX systems, as shown in the following example:

Error: line 3: file 3_crlib.c: KAP-I-ERR_STMT_MISSIN, missing

KAP also may write syntax warning messages to the error file, but optimization proceeds. Syntax warning messages are issued for Tru64 UNIX constructs that are illegal, but whose intent is clear.

8.3 Loop Table Messages

The Loop Table (-listoptions=l) includes an entry for each loop indicating whether it was optimized, or why it was not. This section lists the possible messages and gives a brief explanation for each. The two most common reasons for a loop to be left serial are that the iterations were not independent and that the loop contained I/O statements.

Messsages referring to DO loops apply to C for loops:

bad intrinsic call --- Loop optimization may give unpredictable results.
bad operator --- Loop optimization may give unpredictable results.
branches out of-into loop --- Code may be iteration-order dependent.
cannot allocate temporary array --- Usually, a data dependence could have been lifted by expanding a scalar into an array, but there was a name conflict.
concurrentized --- Should not appear. If it does, indicates miscellaneous optimization.
distributed --- Loop divided into smaller loops, some of which may be further optimized.
DO statement not optimizable --- For example, the iteration count of a while loop is not known.
ELSE IF --- Some loops inside nested IFs require more information to optimize than KAP can deduce.
empty loop --- Loop with no executable statements inside (nothing to optimize), or loop whose body will never be executed.
fused with other loops --- Adjacent loops joined into a larger loop to reduce overhead.
GOTO in loop --- Possible flow patterns are too complicated for KAP.
I/O statement in loop --- Input/Output is assumed to be order-dependent.
left as DO loop --- Usually, a data dependence prevented concurrentization.
left as DOALL --- Loop was already explicitly concurrent in the input dialect of C.
left as a serial loop --- Usually, too much synchronization required.
nested IFs --- The branching pattern was too complicated.
no optimizable statements --- For example, no arithmetic in loop.
not enough work --- Optimization overhead greater than the expected gain.
optimization disabled --- By a directive or assertion.
replaced by calls --- Some loop nests can be replaced by a call to a library routine.
scalar directive --- A directive or assertion turned off optimization.
scalar inner loop --- Inside a parallel loop already.
scalar mode preferable --- Usually, too much synchronization needed.
too many inner loops --- LIMIT > integer VERY complicated loop nest; KAP will try to optimize the inner loops. Try increasing -limit.
too many stmts/DD arcs --- VERY complicated loop nest; try increasing -arclimit.
too much synchronization --- Loop iterations depend on each other in a complicated way; the overhead from untangling them would be greater than expected gain.
unoptimizable call --- Function with unknown side effects.
unoptimizable operation --- Character string manipulation.
unoptimizable statements in loop --- Usually I/O.
unoptimizable symbol use --- Variable used in a complicated way.
unoptimizable variable or operation type --- Only numerical work is easily concurrentized.
unrolled --- Loop could not be concurrentized, so KAP unrolled it (see explanations of unrolling in Chapter 4 and Chapter 7).
unrolled completely, fused or removed --- Loop has been turned into straight-line code, joined to another loop, or dropped entirely.
unrolled; already in a parallel loop --- Loop could be concurrentized, but target machine does not support nested parallelism; KAP unrolled it.
vectorized --- Can be used for optimizations that enhance the vectorization performed by the C compiler.

Appendix A
Data-Dependence Analysis

This appendix describes data-dependence analysis.

A.1 Data-Dependence Definitions

This section provides a brief explanation of data dependence --- the criterion that KAP uses to determine if a given loop can be optimized. KAP determines dependences between variables and arrays in loop iterations and bases decisions on this information automatically, informing you in the listing file only of those dependences that prevent optimization.

The study of data-dependence analysis was originally motivated by parallel (vector and concurrent) computation. The techniques and notations are applicable to advanced serial transformations for uniprocessor computers.

KAP uses a data-dependence graph that shows where data values are generated and where they are used within a loop or loop nest. The data-dependence graph is processed with standard graph traversal techniques to find potential problem areas, which appear as cycles in the graph. Each data-dependence cycle is carefully examined to see if it can be broken and the loop executed in parallel, or if part or all of the loop must be executed serially.

For an in-depth study of data dependence, consult the following:

Michael Wolfe, "Optimizing Supercompilers for Supercomputers," Research Monographs in Parallel and Distributed Processing, MIT Press, Cambridge, Massachusetts, 1989.
Michael Wolfe and Utpal Banerjee, "Data Dependence and Its Application to Parallel Processing," International Journal of Parallel Programming, Vol. 16, No. 2, April, 1988.
David Padua and Michael Wolfe, "Advanced Compiler Optimization for Supercomputers," Communications of the ACM,Vol. 29, No. 12, December, 1986.

A.2 Varieties of Data Dependence

The following describes three kinds of data dependence. The notation S1, S2, and SN is used to denote statement 1, statement 2, and statement N, respectively. In each example, S2 is dependent on S1, due to variable x.

Data dependence from an assignment to a use of a variable is called flow dependence or true dependence.
(S1): x = 3; (S2): y = x;
Data dependence from use of a variable to a later reassignment of that variable is called antidependence.
(S1): y = x; (S2): x = 3;
Data dependence from an assignment of a variable to a later reassignment of that variable is called output dependence.
(S1): x = 3; : (SN): x = 4;

A.3 Input and Output Sets

To determine data dependence, it is necessary to form the input and output sets for the given statements.

IN(S1) denotes the set of input items of S1 (items whose values may be read by S1). OUT(S1) denotes the set of output items (scalar variables or array elements) of statement S1 (items whose values may be changed by S1).

The following example shows the IN and OUT sets for the assignment statement in the loop:

for (i=1; i<=10; i++) S1: x[i] = a[i+1] * b; IN[S1] = {a[2], a[3], a[4], ..., a[11], b} OUT[S1] = {x[1], x[2], x[3], ..., x[10]}

In practice, KAP often approximates the IN and OUT sets because the actual loop bounds are frequently unknown at compile time.

A.4 Data-Dependence Relations

For any two statements, S1 and S2, one of the three types of data-dependence relations may be true, or the statements may be data independent.

If some item X is in OUT(S1) and X is in IN(S2) and S2 is to use the value of X computed in S1, then S2 is flow dependent on S1, as in example 1 in Section A.2.

If some item X is in IN(S1) and X is in OUT(S2), but S1 is to use the value of X before it is changed by S2, then S2 is antidependent on S1, as in example 2 in Section A.2.

If some item X is in OUT(S1) and X is in OUT(S2) and the value computed by S2 is to be stored after the value computed by S1 is stored, S2 is output dependent on S1, as in example 3 in Section A.2.

Antidependence and output dependence relations are sometimes inadvertently caused by programmers' coding practices. These dependences can often be removed by more careful coding.

A.5 Data-Dependence Direction Vectors

The direction vector is defined as a sequence of direction vector elements (one element for each loop enclosing both statements involved in the dependence arc). The following symbols are direction vector elements: <, =, >, <, >, /, or *. Refer to the following loop:

Loops Line +--------- 10 for ( i=1; i<n ; i++) { | 11 a[i] = b[i] + c[i]; | 12 c[i] = a[i-1] - 1; |_________ 13 }

If line 12 in iteration I" depends on line 11 in iteration I', the element of the direction vector for loop I is as follows:

direction vector element when < I' must be < I" = I' must be = I" > I' must be > I" <= I' must be < or = I" >= I' must be > or = I" / I' must not = I" * no relation between I' and I" can be proven

In the previous example, the dependence for variable a has a direction vector of <, because the dependence flows from iteration I' to iteration I'+1 and I' < I'+1. For example, the dependence on a[1] flows from iteration 1 to iteration 2, and 1 < 2. The data dependence for the variable c has a direction vector of = because the dependence stays in the same iteration of the loop (from iteration I' to iteration I').

Contents

Index

Compaq KAP C/OpenMP for Tru64 UNIXUser Guide