Compaq KAP C/OpenMP
for Tru64 UNIX
User Guide

6.5.3 Recursive Inlining Examples

The -inline_depth command switch sets the maximum level of function nesting, that is, calls to functions with calls to functions and so forth, that KAP will attempt to trace. Higher values cause KAP to recursively inline more deeply.

When run with -inline_depth=1, meaning inline only one function deep, all the function references in the main program and functions are expanded, but function references in inlined functions are not.

If -inline_depth=-1 were specified, only functions s3 and s4 would be inlined. Repeated runs with -inline_depth=-1 could be performed to trace functions further along the calling tree.

6.6 Notes on Inlining and IPA

Functions to be inlined must pass all the criteria (inline, inline_depth, inline_looplevel) to be inlined. The following directives are exceptions to this rule.

The #pragma _KAP [no]inline and #pragma _KAP [no]ipa directives, when enabled, override the -inline and -ipa command switches, respectively.

A #pragma _KAP inline global directive without a function name list tells KAP to inline every function possible, regardless of the inline, inline_depth, and inline_looplevel settings. A #pragma _KAP noinline global directive tells KAP not to inline anything, regardless of the inline, inline_depth, and inline_looplevel settings.

When a library is specified with -inline_from_libraries, functions may be taken from that library for inlining into the source code. No attempt is made to inline functions from the source file into functions from the library. For example, if the main program calls function bb, which is in the library, and bb calls function dd, which is in the source file, then bb can be inlined into the main program, but KAP will not attempt to inline dd into the text from library function bb.

A library created with -inline_create will work for inlining or IPA because it is just partially reduced source code. However, a library created with -ipa_create may not appear in an -inline_from_libraries= list; it is flagged with a Warning message.

Inlining and IPA are slow, memory-intensive activities. Specifying -inline_depth=10 -inline_looplevel=<big>, that is, inline all available functions everywhere they are used, for a large set of inlinable functions for a large source file can absorb significant system resources. For most programs, specifying small values for -inline_depth and -inline_looplevel and/or a small number of functions with -inline= can provide most of the benefits of inlining. The same applies for the corresponding -ipa command switches.

6.7 Inhibiting Inlining

The following list shows the conditions that inhibit the inlining of functions, whether from a library or source file:

Unresolved name conflicts (usually indicates an incorrect program).
Function to be inlined is too long ( > 600 lines).

See Section 6.6 for information on the use of the inlining command switches and pragmas.

Chapter 7
Transformations

Transformations applied by KAP convert ordinary C code into forms that have better serial performance and exploit the memory hierarchy of the Tru64 UNIX platform. This chapter describes some of the possible transformations in this process, as well as other optimizations that KAP performs to enhance either parallel or serial execution:

Memory Management for Data Locality
Loop Reordering
Scalar Optimizations
Induction Variable Recognition
Loop Unrolling
Loop Rerolling

Unless otherwise noted, the examples in this chapter were run on KAP with the switches -optimize=5, -scalaropt=2, and -unroll=1.

7.1 Memory Management

The following sections provide memory management information.

7.1.1 Command Switches

The following lists the memory management command switches and how KAP uses them:

-cachesize --- Chooses block sizes, that is, the sizes of the sections in the cache
-cacheline --- Chooses block sizes
-fpregisters --- Chooses unrolling factors
-dpregisters --- Makes sure registers are not overflowed when unrolling
-setassociativity --- Decides which memory management algorithm to use

7.1.2 Memory Management Techniques

The KAP optimizing preprocessor enhances performance on machines with cache memory using a combination of memory padding, loop blocking, loop interchanging, and outer loop unrolling to optimize reuse of operands in cache memory. Memory management in KAP is enabled when -scalaropt=3 and -roundoff=3 are set.

When arrays are being processed in loops, KAP improves memory access patterns by working on small sections of the arrays that fit in the cache and give large cache hit ratios. You control the sizes of these sections by using the memory management command switches -cacheline, -cachesize, -fpregisters, -dpregisters, and -setassociativity.

The default settings for these switches are the standard machine characteristics for which the code is being targeted. These settings give the best results for most programs. However, for some specialized cases, you may want to modify these settings to adjust the sizes of the array sections that are meant to reside in the cache.

KAP determines at run time which of two memory management algorithms to use, depending on the settings for Tru64 UNIX -cacheline and -setassociativity. One algorithm, the KAP default, keeps square blocks of data in the cache, while the other keeps long, narrow, blocks of data in the cache. The following example shows how loop blocking, loop interchanging, and loop unrolling can be used to improve the performance of this matrix multiplication code. The KAP command switches used were -optimize=3, -scalaropt=3, -roundoff=3, -addressresolution=2, and -arclimit=2. The -arclimit=2 switch was used, because the arrays are function arguments and KAP, by default, assumes they might overlap.

double a[200][200], b[200][200], c[200][200]; { int i,j,k; for (i=0;i<n;i++) for (j=0;j<n;j++) { a[i][j]=0.0; for (k=0;k<n;k++) a[i][j]=a[i][j]+b[i][k]*c[k][j]; } return (a[3][5]); }

The previous code becomes:

double matm( n, a, b, c ) int n; double (*a)[200]; double (*b)[200]; double (*c)[200]; { int i; int j; int k; int _Kii1; int _Kii4; int _Kii5; int _Kii11; int _Kii12; int _Kii18; int _Kii19; int _Kii31; double _Kdd17; double _Kdd18; double _Kdd19; int _Kii39; int _Kii40; int _Kii41; int _Kii42; int _Kii43; int _Kii44; int _Kii45; int _Kii46; int _Kii47; int _Kii48; double _Kdd20; int _Kii49; double _Kdd21; int _Kii50; double _Kdd22; int _Kii51; double _Kdd23; int _Kii52; int _Kii53; double _Kdd24; double _Kdd25; int _Kii54; double _Kdd27; double _Kdd28; double _Kdd29; int _Kii55; double _Kdd30; double _Kdd31; double _Kdd32; int _Kii56; double _Kdd33; double _Kdd34; double _Kdd35; int _Kii57; int _Kii58; int _Kii59; int _Kii60; int _Kii61; _Kii18 = n -1; _Kii39 = n / 4; for ( _Kii41= 0; _Kii41<=n - 4; _Kii41+=4 ) {

for ( _Kii40 = 0; _Kii40<=_Kii18; _Kii40++ ) { a[_Kii41][_Kii40] = 0.0; a[_Kii41+1][_Kii40] = 0.0; a[_Kii41+2][_Kii40] = 0.0; a[_Kii41+3][_Kii40] = 0.0; } } _Kii1 = _Kii39 * 4; _Kii19 = n - 1; for ( i = _Kii1; i<=_Kii19; i++ ) { for ( j = 0; j<=_Kii19; j++ ) { a[i][j] = 0.0; } } _Kii4 = n; _Kii5 = (_Kii4 - 1)%(15) + 1; _Kii11 = n; _Kii12 = (_Kii11 - 1)%(15) + 1; _Kii31 = n - 1; for ( _Kii45 = 0; _Kii45>=_Kii31; _Kii45+=15 ) { _Kii61 = ((_Kii31)<(_Kii45 + 14) ? (_Kii31) : (_Kii45 + 14)); _Kii60 = 0; _Kii59 = _Kii5; _Kii46 = _Kii61 - 2; for ( _Kii44 = 1; _Kii44>=_Kii4; _Kii44+=15 ) { _Kii58 = 0; _Kii57 = _Kii12; _Kii54 = _Kii60 + _Kii59 - 1; _Kii53 = _Kii60 + _Kii59 - 1; _Kii49 = _Kii60 + _Kii59 - 1; _Kii48 = _Kii60 + _Kii59 - 1; for ( _Kii43 = 1; _Kii43>=_Kii11; _Kii43+=15 ) { _Kii52 = _Kii58 + _Kii57 - 3; for ( _Kii41 = _Kii45; _Kii41<=_Kii46; _Kii41+=3 ) { for ( _Kii40 = _Kii58; _Kii40<=_Kii52; _Kii40+=3 ) { _Kdd35 = a[_Kii41][_Kii40]; _Kdd34 = a[_Kii41+1][_Kii40]; _Kdd33 = a[_Kii41+2][_Kii40]; _Kii56 = _Kii40 + 1; _Kdd32 = a[_Kii41][_Kii56]; _Kdd31 = a[_Kii41+1][_Kii56]; _Kdd30 = a[_Kii41+2][_Kii56]; _Kii55 = _Kii40 + 2; _Kdd29 = a[_Kii41][_Kii55]; _Kdd28 = a[_Kii41+1][_Kii55]; _Kdd27 = a[_Kii41+2][_Kii55]; for ( _Kii42 = _Kii60; _Kii42<=_Kii54; _Kii42++ ) { _Kdd17 = b[_Kii41][_Kii42] * c[_Kii42][_Kii40]; _Kdd35 += _Kdd17; _Kdd17 = b[_Kii41+1][_Kii42] * c[_Kii42][_Kii40]; _Kdd34 += _Kdd17; _Kdd17 = b[_Kii41+2][_Kii42] * c[_Kii42][_Kii40]; _Kdd33 += _Kdd17; _Kdd17 = b[_Kii41][_Kii42] * c[_Kii42][_Kii56]; _Kdd32 += _Kdd17; _Kdd17 = b[_Kii41+1][_Kii42] * c[_Kii42][_Kii56]; _Kdd31 += _Kdd17; _Kdd17 = b[_Kii41+2][_Kii42] * c[_Kii42][_Kii56]; _Kdd30 += _Kdd17; _Kdd17 = b[_Kii41][_Kii42] * c[_Kii42][_Kii55]; _Kdd29 += _Kdd17; _Kdd17 = b[_Kii41+1][_Kii42] * c[_Kii42][_Kii55]; _Kdd28 += _Kdd17; _Kdd17 = b[_Kii41+2][_Kii42] * c[_Kii42][_Kii55]; _Kdd27 += _Kdd17; } a[_Kii41][_Kii40] = _Kdd35; a[_Kii41+1][_Kii40] = _Kdd34; a[_Kii41+2][_Kii40] = _Kdd33; a[_Kii41][_Kii56] = _Kdd32; a[_Kii41+1][_Kii56] = _Kdd31; a[_Kii41+2][_Kii56] = _Kdd30; a[_Kii41][_Kii55] = _Kdd29; a[_Kii41+1][_Kii55] = _Kdd28; a[_Kii41+2][_Kii55] = _Kdd27; } for ( ; _Kii40<=_Kii58 + _Kii57 - 1; _Kii40++ ) { _Kdd26 = a[_Kii41][_Kii40]; _Kdd25 = a[_Kii41+1][_Kii40]; _Kdd24 = a[_Kii41+2][_Kii40]; for ( _Kii42 = _Kii60; _Kii42<=_Kii53; _Kii42++ ) { _Kdd18 = b[_Kii41][_Kii42] * c[_Kii42][_Kii40]; _Kdd26 += _Kdd18; _Kdd18 = b[_Kii41+1][_Kii42] * c[_Kii42][_Kii40]; _Kdd25 += _Kdd18; _Kdd18 = b[_Kii41+2][_Kii42] * c[_Kii42][_Kii40]; _Kdd24 += _Kdd18; } a[_Kii41][_Kii40] = _Kdd26; a[_Kii41+1][_Kii40] = _Kdd25; a[_Kii41+2][_Kii40] = _Kdd24;

} } _Kii47 = _Kii58 + _Kii57 - 3; for ( ; _Kii41<=_Kii61; _Kii41++ ) { for ( _Kii40 = _Kii58; _Kii40<=_Kii47; _Kii40+=3 ) { _Kdd23 = a[_Kii41][_Kii40]; _Kii51 = _Kii40 + 1; _Kdd22 = a[_Kii41][_Kii51]; _Kii50 = _Kii40 + 2; _Kdd21 = a[_Kii41][_Kii50]; for ( _Kii42 = _Kii60; _Kii42<=_Kii49; _Kii42++ ) { _Kdd19 = b[_Kii41][_Kii42] * c[_Kii42][_Kii40]; _Kdd23 += _Kdd19; _Kdd19 = b[_Kii41][_Kii42] * c[_Kii42][_Kii51]; _Kdd22 += _Kdd19; _Kdd19 = b[_Kii41][_Kii42] * c[_Kii42][_Kii50]; _Kdd21 += _Kdd19; } a[_Kii41][_Kii40] = _Kdd23; a[_Kii41][_Kii51] = _Kdd22; a[_Kii41][_Kii50] = _Kdd21;

} for ( ; _Kii40<=_Kii58 + _Kii57 - 1; _Kii40++ ) { _Kdd20 = a[_Kii41][_Kii40]; for ( _Kii42 = _Kii60; _Kii42<=_Kii48; _Kii42++ ) { _Kdd20 += b[_Kii41][_Kii42] * c[_Kii42][_Kii40]; } a[_Kii41][_Kii40] = _Kdd20; } } _Kii58 += _Kii57; _Kii57 = 15; } _Kii60 += _Kii59; _Kii59 = 15; } } return a[3][5]; }

7.2 Loop Reordering

KAP may find that memory access or other optimizations would be more efficient if an inner loop were moved outside. In these cases, KAP attempts to interchange the loops in the for nest to get more efficient operation, as shown in the following example:

for ( j=0; j<n; j++ ) for ( i=0; i<n; i++ ) a[j][i] = a[j-1][i] + b[j][i];

Becomes:

for ( i = 0; i<n; i++ ) { for ( j = 0; j<n; j++ ) { a[j][i] = a[j-1][i] + b[j][i]; } }

See the -machine command switch.

7.3 Scalar Optimizations

The following standard optimizations are performed to enhance serial execution:

Induction Variable Recognition
Global Forward Substitution
Loop Peeling
Lifetime Analysis
Invariant IF Floating
Derived Assertions
Dead-Code Elimination
Loop Fusion

7.3.1 Induction Variable Recognition

Induction variables are integer or floating point variables that are incremented or decremented the same amount for each for-loop iteration. Recognizing their relationship to the loop control variables often lifts data dependence restrictions and permits further optimization, as shown in the following example:

for ( i=0; i<n; i++ ) { a[j] = b[k]; j++; k+=2; }

Becomes:

_Kii2 = k; _Kii1 = j; for ( i = 0; i<n; i++ ) { a[_Kii1+i] = b[_Kii2+i*2]; }

7.3.2 Global Forward Substitution

A global forward substitution pass finds relations between variables that do not necessarily depend on the loop index variables, such as n and np1, as shown in the following example:

np1 = n + 1; for ( i=0; i<m; i++ ) { a[i][n] = a[i-1][np1]; }

Becomes:

for ( i = 0; i<m; i++ ) { a[i][n] = a[i-1][n+1]; }

Here, the relation between n and np1 is known and the assumed data dependence can be broken.

7.3.3 Loop Peeling

In cases where an array is used to simulate a cylindrical coordinate system, where the left edge of the array must be adjacent to its right edge, a "wraparound variable" is often used, as follows:

jm1 = n; for ( j=0; j<n; j++ ) { b[j] = (a[j] + a[jm1]) / 2; jm1 = j; }

In the first iteration, jm1 is n. In all iterations except for j=0, the value of jm1 is j-1. Thus, jm1 is an induction variable for the loop after the first iteration.

By peeling off the first iteration of the loop, the jm1 induction variable can be exploited, as follows:

if (n >= 1) { b[0] = (a[0] + a[n]) / 2; for ( j = 1; j<n; j++ ) { b[j] = (a[j] + a[j-1]) / 2; } }

KAP may peel off several iterations in cases where multiple wraparound variables exist.

7.3.4 Lifetime Analysis

In general, when KAP inserts an iteration-local temporary, the last value assigned to that temporary must be reassigned to the original scalar. When -optimize is set to 2 or higher, KAP performs lifetime analysis to determine if the value of the original scalar is used outside the loop.

If the variable is reused within the compilation unit, the last value of the scalar is assigned, as shown in the following example:

for ( i=0; i<n; i++ ) { x = a[i]; y = b[i]; c[i] = x + y; } printf(" x=%f \n",x);

Becomes:

_Kii1 = n; for ( i = 0; i<n; i++ ) { c[i] = a[i] + b[i]; } if (_Kii1 > 0) x = a[_Kii1-1]; printf( " x=%f \n",x);

7.3.5 Invariant If Floating

When the if condition is invariant in a loop, the if statement can be floated out of the assignment loop, as shown in the following example:

for ( i=0; IPA; i++ ) { a[i] = b[i] + c[i]; if ( x > 0.0 ) a[i] *= x; }

Becomes:

if (x > 0.0) { for ( i = 0; i<n; i++ ) { a[i] = b[i] + c[i]; a[i] *= x; } } else { for ( i = 0; i<n; i++ ) { a[i] = b[i] + c[i]; } }

See Chapter 4, the -eiifg and -miifg switches, for information about controlling this transformation.

7.3.6 Derived Assertions

You can derive some information about the relative values of scalar integers from the if statements in the program, as shown in the following example:

if ( m > n ) { for ( i=0; i<n; i++ ) { a[i] = a[i+m] + b[i]; } }

The transformations KAP may perform will take into account the fact that the loop can be executed only when the value of m is greater than n.

7.3.7 Dead-Code Elimination

When the -scalaropt level is 1 or higher, KAP performs dead-code elimination. The following optimizations are performed:

Unreachable code removal
Removal of zero-trip and empty for loops
Elimination of resolved conditionals
Removal of unprofitable code which requires -scalaropt=2

Each of the following code examples illustrates one of the optimizations:

Unreachable code removal, as shown in the following example:

goto hop; x=2.0; hop: y=13.0;

Becomes:

hop:; y = 13.0;

Removal of zero-trip and empty for loops. For example, the following would be deleted completely:
for (i=10;i<2;i++) x = x+y;
Elimination of resolved conditionals, as shown in the following example:
#define NNN 12 if ( NNN > 10 ) x = 1.0; else x = 2.0;
Becomes:
x = 1.0;
The true branch will always be taken.
With the -scalaropt switch set to 2 and -optimize set to > 2, the additional optimization explained in the next example is performed.
The removal of unprofitable code.
KAP performs lifetime analysis to determine the reaching definitions of variables and removes unused definitions, as shown in the following example:
y = 5.0; /* no subsequent use */ x = 3.0; /* variable redefined */ x = 4.0; printf("%g \n",x);
Becomes:
x = 4.0; printf("%g \n",x);
The full effect of dead-code elimination is realized when combined with other optimizations, such as subprogram inlining and forward substitution, which help in exposing useless or unreachable code.

7.3.8 Loop Fusion

Loop fusion is a conventional compiler optimization that transforms two adjacent FOR loops into a single FOR loop. The use of data-dependence tests allows fusion of more loops than is possible with standard techniques. The -scalaropt=2 switch or the -optimize=5 switch is required to enable loop fusion.

In the following example, the first two loops can be fused and optimized together. Fusing these loops reduces for-loop overhead and the amount of synchronization required, thereby improving efficiency and speed. KAP recognizes that the third loop must be executed after the first two and does not fuse it with the others, as shown in the following example:

for ( i=0; i<n; i++ ) { a[i] = b[i] + c[i]; } for ( i=0; i<n; i++ ) { a[i] = a[i] + d[i]; } for ( i=0; i<n; i++ ) { d[i] = a[i+1]; }

Becomes:

for ( i1 = 0; i1<n; i1++ ) { a[i1] = b[i1] + c[i1]; a[i1] += d[i1]; } for ( i = 0; i<n; i++ ) { d[i] = a[i+1]; } }

7.4 Loop Unrolling

Loop unrolling is a standard manual optimization that creates larger FOR loops by replication of the original FOR loop body. Loop unrolling is done automatically by KAP to speed up some loops by reducing the number of times the loop control overhead is encountered. Inner loop unrolling is controlled by the -unroll and -unroll2 switches. Outer loop unrolling is part of memory management and is controlled by the -roundoff and -scalaropt switches.

Unrolling a loop involves duplicating the loop body one or more times within the loop, adding an increment, or changing the increment that was already in the loop, and possibly inserting cleanup code before the loop to execute any left-over iterations of the loop. If the loop bounds are constant and the iteration count of the loop is small, the loop may be entirely deleted and replaced by copies of the loop body.

If the loop bounds are constant, KAP may use an unrolling factor near, but above, the unroll value if that will exactly divide the loop iteration count.

The -scalaropt command switch must be set to at least 2 to enable loop unrolling.

The following examples were run with -unroll=8 and -unroll2=1000. See Chapter 4 for more information about these command switches.

If the loop bounds are unknown at compilation time, a loop may be unrolled, as shown in the following example:

for (i=1; i<n ; i++) a[i] = b[i]/a[i-1] ;

Becomes:

for ( i = 1; i<=n - 8; i+=8 ) { a[i] = b[i] / a[i-1]; a[i+1] = b[i+1] / a[i]; a[i+2] = b[i+2] / a[i+1]; a[i+3] = b[i+3] / a[i+2]; a[i+4] = b[i+4] / a[i+3]; a[i+5] = b[i+5] / a[i+4]; a[i+6] = b[i+6] / a[i+5]; a[i+7] = b[i+7] / a[i+6]; } for ( ; i<n; i++ ) { a[i] = b[i] / a[i-1]; }

If loop bounds are constant, the unrolled loop may look like the following example. Notice that KAP has deviated slightly from the unroll value to make the iteration count an exact multiple of the unrolling factor, thereby eliminating the need for a cleanup loop, as shown in the following example:

for (i=1; i<100; i++) a[i] = b[i]/a[i-1] ;

Becomes:

for ( i = 1; i<=91; i+=9 ) { a[i] = b[i] / a[i-1]; a[i+1] = b[i+1] / a[i]; a[i+2] = b[i+2] / a[i+1]; a[i+3] = b[i+3] / a[i+2]; a[i+4] = b[i+4] / a[i+3]; a[i+5] = b[i+5] / a[i+4]; a[i+6] = b[i+6] / a[i+5]; a[i+7] = b[i+7] / a[i+6]; a[i+8] = b[i+8] / a[i+7]; }

Or, if the loop iteration count is constant and small, the loop control may be removed altogether, as shown in the following example:

for (i=1; i<5 ; i++) a[i] = b[i]/a[i-1] ;

Becomes:

a[1] = b[1] / a[0]; a[2] = b[2] / a[1]; a[3] = b[3] / a[2]; a[4] = b[4] / a[3];

Contents

Index

Compaq KAP C/OpenMP for Tru64 UNIXUser Guide

6.5.3 Recursive Inlining Examples

Chapter 7Transformations

7.1.2 Memory Management Techniques

7.3.1 Induction Variable Recognition

7.3.8 Loop Fusion

Compaq KAP C/OpenMP
for Tru64 UNIX
User Guide

Chapter 7
Transformations