Compaq KAP C/OpenMP
for Tru64 UNIX
User Guide

2.10.2 Optimizing Large Programs with KAP

Follow these guidelines to optimize large programs:

Compile the program without KAP, with minimum compiler optimization, and with all compiler run-time checks enabled. Note the execution time and verify the results. If the program fails at this step, there is not much optimization you can do.
Some older programs use standard-violating techniques that KAP will not transform safely. If KAP fails because of this problem, there is little optimization you can do.
If you have the time and you know what the program is supposed to do, you can try to isolate the incorrect code, correct it, and proceed. This action is feasible for large programs only if the problems are easily understood and isolated or if you have enough time to find more intractable problems.
If the problem code is isolated and runs without KAP optimization, you may be able to run KAP on the rest of the program and leave out any problematic sections.
You can also refer to Section 2.13. You may be able to diagnose and correct some problems, and then run KAP on your program successfully.
Compile without KAP but with maximum compiler optimization, note the execution time, and verify the results. If the program fails, reduce compiler optimization and try again.
Compile the fastest/best run not using KAP and run it again with profiling enabled (for example, gprof) to identify the program units that take the most time to run.
If some time-intensive units have many iterative loops and arrays, then those units are good candidates for KAP loop optimizations. Go to step 4. If not, then the lower-payoff optimizations, such as inlining, may provide some performance improvement, especially if there are places where inlining inside loop nests may also allow KAP to perform vectorization optimizations. Go to step 6.
If time-intensive routines were identified as good candidates above, run KAP on them with modest KAP optimization (-optimize=2), compile the whole program with the other switches used in the best run from step 2, note the execution time, and verify the results.
If the program fails, try again with the KAP switch -roundoff=0; if that works, the failure is probably due to a roundoff-sensitive operation. If it still fails with -roundoff=0, try -scalaropt=1.
If step 4 works, repeat with full KAP optimization, with full compiler optimization, and with -roundoff=0 or -scalaropt=1, if needed. If the program fails, reduce the setting to a lower KAP optimization level or a lower compiler optimization level, and try again.
If things are still going well after this step, try the suggestions in Section 2.12.
If there are no routines with arrays and loops, run the whole program with -optimize=0 and -inline_and_copy=aaa,bbb,ccc,.., where aaa, bbb, and so on, are the most frequently called routines from the profiling run in step 3.
If this action succeeds, repeat with -optimize=4 and -inline_and_copy=... If this action fails, try rerunning with -roundoff=0 or -scalaropt=1 or with fewer routines inlined. See Section 2.13 for an explanation of "binary chop."
If things are still going well after this step, try the suggestions in the Section 2.12.

2.10.3 General Optimization Tips

Use the -v switch on the kcc command line to view the switches the KAP preprocessor passes to the compiler and the linker.
Use the -ipa switch to cause KAP to give information in the annotated listing about appropriate settings for the -ipall switch on a loop-by-loop basis.
Avoid writing code that accesses an array outside of the array bounds, because this necessitates that you use the -assume=b switch. Setting -assume=b prevents KAP from performing many of its optimizations.

2.11 Improving and Customizing KAP Performance

After you have used the KAP protocol for either small or large programs, you can find ways to fine-tune KAP to fit your application.

This section helps you discover which KAP command-line switches, directives, or assertions can be used to try to improve KAP performance for a particular application program. The following is a list of common goals and common program situations that KAP users often have, and it offers suggestions for possible improvements.

Remember that KAP is a tool to optimize Fortran and C code. Like any tool, it performs best when you are familiar with the details of how it works and are able to use its switches correctly and advantageously.

Although KAP default switch settings will achieve performance improvement, you can often achieve greater improvement if you understand and use alternate switch settings. Moreover, you can often insert directives or assertions to achieve improved performance improvement.

See Table 2-1 for details about goals and user actions.

Table 2-1 User Actions for Specific Goals
Goal User Action

Have a more informative listing to help answer your questions Use -lo=kl or other listing switches under -listoptions command-line switch.

Recognize more reductions Increase -roundoff switch setting.

Spend less time optimizing deeply nested loops Reduce -limit and -arclimit or their directives.

Disable inner FOR loop unrolling Use -unroll=1 or -scalaropt<2.

Disable outer FOR loop unrolling Use -roundoff<3 or -scalaropt<3.

Expand (inline) function calls within FOR loops Use -inline, -inline_from_files, or -inline_from_libraries. Or, if the goal is to execute the function body concurrently, try -ipa or #pragma _KAP concurrent call.

Inline more routines Increase -inline_depth and
-inline_looplevel. (See also the #pragma _KAP inline directive.)

**Table 2-1 User Actions for Specific Goals**
Goal	User Action
Have a more informative listing to help answer your questions	Use -lo=kl or other listing switches under -listoptions command-line switch.
Recognize more reductions	Increase -roundoff switch setting.
Spend less time optimizing deeply nested loops	Reduce -limit and -arclimit or their directives.
Disable inner FOR loop unrolling	Use -unroll=1 or -scalaropt<2.
Disable outer FOR loop unrolling	Use -roundoff<3 or -scalaropt<3.
Expand (inline) function calls within FOR loops	Use -inline, -inline_from_files, or -inline_from_libraries. Or, if the goal is to execute the function body concurrently, try -ipa or #pragma _KAP concurrent call.
Inline more routines	Increase -inline_depth and -inline_looplevel. (See also the #pragma _KAP inline directive.)

2.12 Using Additional Performance Improvement Techniques

After you have successfully run KAP on a working program by using either the protocol for small programs or the protocol for large programs, you can try the following procedures to find additional opportunities for optimization within your program:

If you have successfully run KAP on some routines in a large program, then try running KAP on the whole program with the same switches.
Try lowering the settings on the Invariant-IF switches -eiifg and -miifg. These actions may reduce the total code space enough to make paging or caching the program code work better.
You can try brute-force inlining. Set -inline_and_copy and -inll=2. Inlining is usually more effective if you inline only a few carefully chosen routines rather than inlining everything and cluttering up the code with too much low-payoff inlining. However, the shotgun approach can sometimes produce good results.
Experiment with each of the following switches to determine if they improve the run-time of your program.
However, the above switches may increase the amount of time and memory KAP needs to process your source files.

2.13 Correcting KAP Problems

The following are some problems you may encounter when using KAP and possible fixes and workarounds:

KAP works best on programs that are CPU-intensive, that spend a great deal of time doing floating-point calculations, and that have large loop bounds.
The two most common reasons KAP is unable to achieve performance improvement in applications code are the following:
- A program with small loop limits or too few loops to work with causes the KAP vectorization setup overhead to outweigh the speedup.
- A program that is I/O bound is not likely to achieve much performance improvement because no amount of improvement to the computation sections will change execution time significantly. However, in the case of a C program, I/O strength reduction can improve I/O performance. Profiling information may provide clues to either problem. You may need to insert additional print statements to verify loop limits.
If the program is correct but the output is significantly different when KAP is run on the program, try reducing the setting on the -optimize switch.
Nonsensical or nonrepeatable values in the output may be the result of the program violating declared array bounds. Nonsensical or nonrepeatable values in the output can also be the result of unstable algorithms. Try setting -roundoff=0.
If you get incorrect results from a large program with a few routines run through KAP, or a small program run through KAP, or a program with a few routines inlined by KAP, you may be able to determine the source of the problem by means of binary chop.
For example, suppose you have five routines, a, b, c, d, and e. When all five are processed with KAP, the program produces incorrect results or dies. Try running KAP again, but only on routines a and b. If they succeed, then the problem is in c, d, or e. If they fail, try with just routine a and so on. By breaking the list of suspects into approximate halves for each test, you can fairly quickly identify which routine or routines cause the failure. Leave the problematic routines out of future KAP runs.
If you have link errors, ensure that the link step loaded all the libraries needed for all parts of the program.
A link failure may also occur because KAP failed while processing a file, and the routines that came after the point of failure in that file were not copied to the compile file. Determine the reason that KAP failed, and try relinking.
If the compiler issues a syntax error on a transformed program, compare the source code with the transformed code. KAP detects and flags some run-time errors, especially in I/O statements, at compilation time.
Insufficient memory for KAP to run can sometimes be fixed by placing fewer names on the -inline_and_copy switch or by reducing the -eiifg and -miifg settings. To compensate for insufficient memory, you can also break up a source file into smaller logical units and run KAP on the separate units.
If you receive the messages "Preprocessor Failed" or "Translator Error," try lowering switch values, especially -scalaropt.

Chapter 3
KAP Parallel Processing

KAP does parallel decomposition of programs so they run on symmetric multiprocessor (SMP) systems. This chapter describes how to compile and run a program for parallel execution using the kcc driver and kapc. Review Chapter 2 for general information on KAP syntax, file naming conventions, and optimizing programs.

3.1 Compaq KAP Parallel Processing

Compaq KAP transforms C source programs so that, when compiled and linked, they execute as multithreaded processes. These threads can run simultaneously --- that is, in parallel --- on symmetric multiprocessor systems. The result is a program whose start-to-finish time is less than a C program that does not execute as a multithreaded process. More specifically, at run time the instructions from FOR loops in a transformed C program execute in parallel mode. Parallelization is the process that transforms FOR loops into instructions in an executable file that execute as multithreaded processes.

Compaq KAP considers all FOR loops in a program as candidates for parallelization. Each loop is or is not parallelized according to:

Parallel processing directives that you have inserted
KAP determination of data dependencies among the loop's iterations
KAP determination of the amount of runtime work (the overhead of parallelization might require too much time); the -minconcurrent switch affects this determination

This chapter describes the three basic methods of controlling parallel processing (automatic, directed, and combination). It explains, for each method, how to:

Change a source program for parallel processing
Give commands, in the form of command-line switches and values, to Compaq KAP that transform the source programs
Direct the compilation, linking, and execution of the parallel-mode program

3.1.1 Parallel Processing Methods

Compaq KAP provides three methods for programmers to control parallel processing. Their summaries follow:

Note

KAP/C will not perform automatic parallel decomposition or serial optimization on files that contain OpenMP directives.

Automatic Detection --- Use this method for programs that do not contain OpenMP (#pragma omp) directives.
Compaq KAP automatically looks at these programs' FOR loops. If these loops are good candidates for parallelization, then Compaq KAP transforms them so that they will be executed by multiple threads. This is the recommended method for initial experiences with parallelization, because the other methods require detailed knowledge of parallel programming concepts and implementation statements. Also, Compaq KAP sets the compiler and linker switches correctly.
Section 3.2 shows how to direct Compaq KAP to perform automatic parallelization of your program. An example of using KAP automatic detection, selection, and transformation of loops is giving the following command line for C source program my_prog.c:
kcc -ckapargs='-concurrent' my_prog.c
The results include a transformed source program and its processing by the compiler and linker to create executable file a.out. The transformed source file will contain OpenMP directives for the loops that Compaq KAP has automatically decided to parallelize when the switch is set. The OpenMP directives are passed onto the compiler for processing.
Directed --- Use this method for programs that contain parallel directives and for which you want only the loops surrounded by parallel directives to be parallelized. These directives explicitly control where and when Compaq KAP performs parallelization inside your program.
Section 3.3 shows how to use KAP to perform directed parallelization of your program. An example of using KAP directed detection and transformation of loops is giving the following command line for C source program my_prog.c containing OpenMP directives:
kcc -ckapargs='-noconc' my_prog.c \ -omp -pthread -call_shared
The results include a transformed source program and its processing by the compiler and linker to create executable file a.out.
Combination --- Use this method for programs where you control, with OpenMP directives, the parallelization of selected source files in a program application and you want KAP to perform automatic detection, transformation, and parallelization of the remaining source files. Section 3.4 shows how to direct KAP to perform combined parallelization of your program. A possible command line for the two C source files openmp.c and no_openmp.c is:
kcc -ckapargs='-concurrentize' openmp.c no_openmp.c
where openmp.c contains OpenMP directives and no_openmp.c does not. The results include a non-KAP-transformed source for openmp.c and a KAP-transformed source file for no_openmp.c and its processing by the compiler and linker to create executable file a.out. KAP automatically parallelizes loops by inserting OpenMP directives. OpenMP directives inserted automatically by KAP in no_openmp.c and manually by the programmer within openmp.c are then processed by the compiler. The compiler switch -omp tells the compiler to recognize the OpenMP directives.

When using any of these three methods you must be aware of the values of environment variables, because they affect the run-time behavior of your program.

Environment Variables

OMP_SCHEDULE (static,dynamic,guided,runtime) OMP_DYNAMIC (true,false) default is false. OMP_NESTED (true,false) default is false. OMP_NUM_THREADS (number) default value is the number of processors on the current system.

For further information on environment variables read by the C compiler see your Compaq C user's guide.

3.1.2 Parallel Processing Controls --- Summary

KAP provides the following parallel command switches, directives, and assertions for use with automatic parallel processing. Refer to the appropriate sections for explanations and code examples as follows:

Parallel Processing Switches
Parallel Processing Directives and Parallel Processing Assertions, which are used to assist automatic parallelization:
#pragma _KAP concurrent #pragma _KAP concurrent call #pragma _KAP concurrent ignore call #pragma _KAP serial
OpenMP Pragmas, which describe parallel processing pragmas used for directed parallelization:
- Parallel pragmas
- Worksharing pragmas
- Workqueuing pragmas
- Combined Parallel and Worksharing/Workqueuing pragmas
- Synchronization pragmas
- Privatization of Global Variables pragmas

Two types of command lines, kcc and kapc, invoke Compaq KAP software:

Invoking the kcc command results in these events:
1. Transformation of the source program using software known as the kapc preprocessor.
2. Compilation of the transformed source program using the C compiler.
3. Linking of the object program from the previous event with runtime libraries.
Compaq recommends using the kcc command to process your files.
Invoking the kapc command only transforms the source program. No compilation and no linking occur.

3.1.3 Parallel Processing Controls --- Interaction

As a programmer, you should always remember that you implement a parallel processing method (automatic, directed, or combination) by making choices from the previous command line options, directives, and assertions. Your choices affect the following actions:

FOR loop detections made during Compaq KAP preprocessing
FOR loop transformations made during Compaq KAP preprocessing
Compiler and linker output after Compaq KAP preprocessing
Runtime behavior of the executable file after preprocessing, compiling, and linking

For example, suppose you choose combination detection and parallelization for source programs openmp.c and no_openmp.c. These programs contain some or none of the parallel processing directives, parallel processing assertions, and OpenMP directives. Consider the following command:

kcc -ckapargs='-concurrent -minconcurrent=1000' \ openmp.c no_openmp.c

This command tells Compaq KAP to:

Automatically detect (-concurrent) and transform loops with at least 1000 units of work (-minconcurrent=1000)
Respond to parallel processing directives and parallel processing assertions and directives such as #pragma _KAP concurrent call.
Give the transformed source program files to the compiler and linker for the creation of executable program file a.out (the default behavior of the kcc command).

Compaq KAP parallel processing options, such as -concurrent, are enclosed in single quotation marks and are values of the -ckapargs option. The kcc driver responds to the options enclosed in these single quotation marks by passing them as arguments to the kapc preprocessor (which actually transforms the source program file).

The default values of the parallel processing options also control Compaq KAP loop detections, loop transformations, calling of the compiler and linker, and runtime scheduling. They are:

-minconcurrent=1000 -scheduling=e -chunk=1

Read the explanations of each of the three methods of parallelization in light of how your choices of options, directives, and assertions affect Compaq KAP detection of loops, changes to loops, compiler and linker behavior, and runtime behavior of executable file a.out.

3.2 Automatic Parallelization Using the kcc Driver

To compile and run your program with parallel processing, use the -concurrentize switch, abbreviated -conc, as follows:

kcc -ckapargs='-conc' myprog.c

For information on running a parallel program, see Section 3.6.

3.2.1 Preprocessing a Program for Parallel Execution Using kapc

To execute KAP as a standalone preprocessor, use the following commands depending on your version of UNIX as follows:

DIGITAL UNIX Version 3.2:
cc -P -D__KAP -U_INLINE_INTRINSICS myprog.c kapc -conc -cmp=myprog_mp.c myprog.i cc -migrate myprog_mp.c -tune host -call_shared -lkio -O4 \ -lkmp_osf -threads
- -threads---causes KAP to link to POSIX 1003.4a/d4-compliant DECthreads library libpthreads.so.
- -lkmp_osf---causes KAP to link to the parallel processing library lkmp_osf.a.
Tru64 UNIX, and DIGITAL UNIX Versions 4.0 and above:
cc -P -D__KAP -U_INLINE_INTRINSICS myprog.c kapc -conc -cmp=myprog_mp.c myprog.i cc -migrate myprog_mp.c -tune host -call_shared -lkio -fast \ -lkmp_osfp10 -pthread
- -pthread---causes KAP to link to POSIX 1003.1c-compliant DECthreads library libpthread.so.
- -lkmp_osfp10---causes KAP to link to the parallel processing library lkmp_osfp10.a.

An explanation of the remaining switches follows:

The -conc switch causes KAP to restructure the source code for parallel processing.
-cmp causes KAP to save the optimized source program under the file name of your choice. The kapc default is to name the optimized source file_name.cmp. Because the Compaq C compiler will not process a file with the default extension of .cmp, you must override the default by using -cmp to rename the optimized source file_name.cmp.c.
-fast provides a single method for turning on a collection of compiler optimizations on Tru64 UNIX, and DIGITAL UNIX Versions 4.0 and above. (See the cc manpage for a detailed description.)
-lkio tells the linker to use the KAP library, lkio, which contains routines to support KAP optimizations.
-lkmp_osf tells the linker to use the KAP parallel processing library for DIGITAL UNIX Version 3.2. The KAP parallel processing library provides the interface to DECthreads.
-lkmp_osfp10 tells the linker to use the KAP parallel processing library for Tru64 UNIX, and DIGITAL UNIX Versions 4.0 and above. The parallel processing library provides an interface to DECthreads.
-migrate calls the Compaq C compiler. For more information about -migrate, see your Compaq C user's guide.
-call_shared tells the linker to link against shared libraries. Compaq recommends that you use the -call_shared default.
-pthread tells the linker to use threadsafe versions of libraries, where they exist, and to include Tru64 UNIX, and DIGITAL UNIX Versions 4.0 and above libpthread when linking the program.
-threads tells the linker to use threadsafe versions of libraries if they exist, and to include DIGITAL UNIX Version 3.2 libpthreads when linking the program.
-tune host tells the compiler to optimize for the architecture of the host processor. The Compaq C compiler switch -tune host and the KAP C switch host=<architecture> work independently and perform different optimizations. For information about the KAP -tune switch, see Section 4.2.9.
-U_INLINE_INTRINSICS --- stops the compiler from inlining intrinsic functions. KAP currently does not support the inlining of intrinsic functions by the compiler.