Compaq KAP C/OpenMP
for Tru64 UNIX
User
Guide
2.10.2 Optimizing Large Programs with KAP
Follow these guidelines to optimize large programs:
- Compile the program without KAP, with minimum compiler
optimization, and with all compiler run-time checks enabled. Note the
execution time and verify the results. If the program fails at this
step, there is not much optimization you can do.
Some older
programs use standard-violating techniques that KAP will not transform
safely. If KAP fails because of this problem, there is little
optimization you can do.
If you have the time and you know what
the program is supposed to do, you can try to isolate the incorrect
code, correct it, and proceed. This action is feasible for large
programs only if the problems are easily understood and isolated or if
you have enough time to find more intractable problems.
If the
problem code is isolated and runs without KAP optimization, you may be
able to run KAP on the rest of the program and leave out any
problematic sections.
You can also refer to Section 2.13. You may
be able to diagnose and correct some problems, and then run KAP on your
program successfully.
- Compile without KAP but with maximum compiler optimization, note
the execution time, and verify the results. If the program fails,
reduce compiler optimization and try again.
- Compile the fastest/best run not using KAP and run it again with
profiling enabled (for example, gprof) to identify the program
units that take the most time to run.
If some time-intensive units
have many iterative loops and arrays, then those units are good
candidates for KAP loop optimizations. Go to step 4. If not, then the
lower-payoff optimizations, such as inlining, may provide some
performance improvement, especially if there are places where inlining
inside loop nests may also allow KAP to perform vectorization
optimizations. Go to step 6.
- If time-intensive routines were identified as good candidates
above, run KAP on them with modest KAP optimization
(-optimize=2), compile the whole program with the other
switches used in the best run from step 2, note the execution time, and
verify the results.
If the program fails, try again with the KAP
switch -roundoff=0; if that works, the failure is probably due
to a roundoff-sensitive operation. If it still fails with
-roundoff=0, try -scalaropt=1.
- If step 4 works, repeat with full KAP optimization, with full
compiler optimization, and with -roundoff=0 or
-scalaropt=1, if needed. If the program fails, reduce the
setting to a lower KAP optimization level or a lower compiler
optimization level, and try again.
If things are still going well
after this step, try the suggestions in Section 2.12.
- If there are no routines with arrays and loops, run the whole
program with -optimize=0 and
-inline_and_copy=aaa,bbb,ccc,.., where aaa, bbb, and so on,
are the most frequently called routines from the profiling run in step
3.
If this action succeeds, repeat with -optimize=4 and
-inline_and_copy=... If this action fails, try rerunning with
-roundoff=0 or -scalaropt=1 or with fewer routines
inlined. See Section 2.13 for an explanation of "binary
chop."
If things are still going well after this step, try
the suggestions in the Section 2.12.
2.10.3 General Optimization Tips
- Use the -v switch on the kcc command line to view
the switches the KAP preprocessor passes to the compiler and the linker.
- Use the -ipa switch to cause KAP to give information in
the annotated listing about appropriate settings for the
-ipall switch on a loop-by-loop basis.
- Avoid writing code that accesses an array outside of the array
bounds, because this necessitates that you use the -assume=b
switch. Setting -assume=b prevents KAP from performing many of
its optimizations.
2.11 Improving and Customizing KAP Performance
After you have used the KAP protocol for either small or large
programs, you can find ways to fine-tune KAP to fit your application.
This section helps you discover which KAP command-line switches,
directives, or assertions can be used to try to improve KAP performance
for a particular application program. The following is a list of common
goals and common program situations that KAP users often have, and it
offers suggestions for possible improvements.
Remember that KAP is a tool to optimize Fortran and C code. Like any
tool, it performs best when you are familiar with the details of how it
works and are able to use its switches correctly and advantageously.
Although KAP default switch settings will achieve performance
improvement, you can often achieve greater improvement if you
understand and use alternate switch settings. Moreover, you can often
insert directives or assertions to achieve improved performance
improvement.
See Table 2-1 for details about goals and user actions.
Table 2-1 User Actions for Specific Goals
Goal |
User Action |
Have a more informative listing to help answer your questions
|
Use
-lo=kl or other listing switches under
-listoptions command-line switch.
|
Recognize more reductions
|
Increase
-roundoff switch setting.
|
Spend less time optimizing deeply nested loops
|
Reduce
-limit and
-arclimit or their directives.
|
Disable inner FOR loop unrolling
|
Use
-unroll=1 or
-scalaropt<2.
|
Disable outer FOR loop unrolling
|
Use
-roundoff<3 or
-scalaropt<3.
|
Expand (inline) function calls within FOR loops
|
Use
-inline, -inline_from_files, or
-inline_from_libraries. Or, if the goal is to execute the
function body concurrently, try
-ipa or
#pragma _KAP concurrent call.
|
Inline more routines
|
Increase
-inline_depth and
-inline_looplevel. (See also the
#pragma _KAP inline directive.)
|
2.12 Using Additional Performance Improvement Techniques
After you have successfully run KAP on a working program by using
either the protocol for small programs or the protocol for large
programs, you can try the following procedures to find additional
opportunities for optimization within your program:
- If you have successfully run KAP on some routines in a large
program, then try running KAP on the whole program with the same
switches.
- Try lowering the settings on the Invariant-IF switches
-eiifg and -miifg. These actions may reduce the total
code space enough to make paging or caching the program code work
better.
- You can try brute-force inlining. Set -inline_and_copy and
-inll=2. Inlining is usually more effective if you inline only
a few carefully chosen routines rather than inlining everything and
cluttering up the code with too much low-payoff inlining. However, the
shotgun approach can sometimes produce good results.
- Experiment with each of the following switches to determine if they
improve the run-time of your program.
However, the above switches may increase the amount of time and
memory KAP needs to process your source files.
2.13 Correcting KAP Problems
The following are some problems you may encounter when using KAP and
possible fixes and workarounds:
- KAP works best on programs that are CPU-intensive, that spend a
great deal of time doing floating-point calculations, and that have
large loop bounds.
The two most common reasons KAP is unable to
achieve performance improvement in applications code are the following:
- A program with small loop limits or too few loops to work with
causes the KAP vectorization setup overhead to outweigh the speedup.
- A program that is I/O bound is not likely to achieve much
performance improvement because no amount of improvement to the
computation sections will change execution time significantly. However,
in the case of a C program, I/O strength reduction can improve I/O
performance. Profiling information may provide clues to either problem.
You may need to insert additional print statements to verify loop
limits.
- If the program is correct but the output is significantly different
when KAP is run on the program, try reducing the setting on the
-optimize switch.
- Nonsensical or nonrepeatable values in the output may be the result
of the program violating declared array bounds. Nonsensical or
nonrepeatable values in the output can also be the result of unstable
algorithms. Try setting -roundoff=0.
- If you get incorrect results from a large program with a few
routines run through KAP, or a small program run through KAP, or a
program with a few routines inlined by KAP, you may be able to
determine the source of the problem by means of binary chop.
For
example, suppose you have five routines, a, b, c, d, and e. When all
five are processed with KAP, the program produces incorrect results or
dies. Try running KAP again, but only on routines a and b. If they
succeed, then the problem is in c, d, or e. If they fail, try with just
routine a and so on. By breaking the list of suspects into approximate
halves for each test, you can fairly quickly identify which routine or
routines cause the failure. Leave the problematic routines out of
future KAP runs.
- If you have link errors, ensure that the link step loaded all the
libraries needed for all parts of the program.
A link failure may
also occur because KAP failed while processing a file, and the routines
that came after the point of failure in that file were not copied to
the compile file. Determine the reason that KAP failed, and try
relinking.
- If the compiler issues a syntax error on a transformed program,
compare the source code with the transformed code. KAP detects and
flags some run-time errors, especially in I/O statements, at
compilation time.
- Insufficient memory for KAP to run can sometimes be fixed by
placing fewer names on the -inline_and_copy switch or by
reducing the -eiifg and -miifg settings. To
compensate for insufficient memory, you can also break up a source file
into smaller logical units and run KAP on the separate units.
- If you receive the messages "Preprocessor Failed" or
"Translator Error," try lowering switch values, especially
-scalaropt.
Chapter 3
KAP Parallel Processing
KAP does parallel decomposition of programs so they run on symmetric
multiprocessor (SMP) systems. This chapter describes how to compile and
run a program for parallel execution using the kcc driver and
kapc. Review Chapter 2 for general information on KAP
syntax, file naming conventions, and optimizing programs.
3.1 Compaq KAP Parallel Processing
Compaq KAP transforms C source programs so that, when compiled and
linked, they execute as multithreaded processes. These threads can run
simultaneously --- that is, in parallel --- on symmetric multiprocessor
systems. The result is a program whose start-to-finish time is less
than a C program that does not execute as a multithreaded process. More
specifically, at run time the instructions from FOR loops in a
transformed C program execute in parallel mode. Parallelization is the
process that transforms FOR loops into instructions in an executable
file that execute as multithreaded processes.
Compaq KAP considers all FOR loops in a program as candidates for
parallelization. Each loop is or is not parallelized according to:
- Parallel processing directives that you have inserted
- KAP determination of data dependencies among the loop's iterations
- KAP determination of the amount of runtime work (the overhead of
parallelization might require too much time); the
-minconcurrent switch affects this determination
This chapter describes the three basic methods of controlling parallel
processing (automatic, directed, and combination). It explains, for
each method, how to:
- Change a source program for parallel processing
- Give commands, in the form of command-line switches and values, to
Compaq KAP that transform the source programs
- Direct the compilation, linking, and execution of the parallel-mode
program
3.1.1 Parallel Processing Methods
Compaq KAP provides three methods for programmers to control parallel
processing. Their summaries follow:
Note
KAP/C will not perform automatic parallel decomposition or serial
optimization on files that contain OpenMP directives.
|
-
Automatic Detection --- Use this method for programs
that do not contain OpenMP (#pragma omp) directives.
Compaq KAP
automatically looks at these programs' FOR loops. If these loops are
good candidates for parallelization, then Compaq KAP transforms them so
that they will be executed by multiple threads. This is the recommended
method for initial experiences with parallelization, because the other
methods require detailed knowledge of parallel programming concepts and
implementation statements. Also, Compaq KAP sets the compiler and
linker switches correctly.
Section 3.2 shows how to direct Compaq
KAP to perform automatic parallelization of your program. An example of
using KAP automatic detection, selection, and transformation of loops
is giving the following command line for C source program my_prog.c:
kcc -ckapargs='-concurrent' my_prog.c
The results
include a transformed source program and its processing by the compiler
and linker to create executable file a.out. The transformed source file
will contain OpenMP directives for the loops that Compaq KAP has
automatically decided to parallelize when the switch is set. The OpenMP
directives are passed onto the compiler for processing.
- Directed --- Use this method for programs that
contain parallel directives and for which you want only the loops
surrounded by parallel directives to be parallelized. These directives
explicitly control where and when Compaq KAP performs parallelization
inside your program.
Section 3.3 shows how to use KAP to perform
directed parallelization of your program. An example of using KAP
directed detection and transformation of loops is giving the following
command line for C source program my_prog.c containing OpenMP
directives:
kcc -ckapargs='-noconc' my_prog.c \
-omp -pthread -call_shared
|
The results include a transformed source program and its processing
by the compiler and linker to create executable file a.out.
- Combination --- Use this method for programs where
you control, with OpenMP directives, the parallelization of selected
source files in a program application and you want KAP to perform
automatic detection, transformation, and parallelization of the
remaining source files. Section 3.4 shows how to direct KAP to perform
combined parallelization of your program. A possible command line for
the two C source files openmp.c and no_openmp.c is:
kcc -ckapargs='-concurrentize' openmp.c no_openmp.c
|
where openmp.c contains OpenMP directives and
no_openmp.c does not. The results include a
non-KAP-transformed source for openmp.c and a KAP-transformed source
file for no_openmp.c and its processing by the compiler and
linker to create executable file a.out. KAP automatically
parallelizes loops by inserting OpenMP directives. OpenMP directives
inserted automatically by KAP in no_openmp.c and manually by
the programmer within openmp.c are then processed by the
compiler. The compiler switch -omp tells the compiler to
recognize the OpenMP directives.
When using any of these three methods you must be aware of the values
of environment variables, because they affect the run-time behavior of
your program.
Environment Variables
OMP_SCHEDULE (static,dynamic,guided,runtime)
OMP_DYNAMIC (true,false) default is false.
OMP_NESTED (true,false) default is false.
OMP_NUM_THREADS (number) default value is the number of
processors on the current system.
|
For further information on environment variables read by the C compiler
see your Compaq C user's guide.
3.1.2 Parallel Processing Controls --- Summary
KAP provides the following parallel command switches, directives, and
assertions for use with automatic parallel processing. Refer to the
appropriate sections for explanations and code examples as follows:
- Parallel Processing Switches
- Parallel Processing Directives and Parallel Processing Assertions,
which are used to assist automatic parallelization:
#pragma _KAP concurrent
#pragma _KAP concurrent call
#pragma _KAP concurrent ignore call
#pragma _KAP serial
|
- OpenMP Pragmas, which describe parallel processing pragmas used for
directed parallelization:
- Parallel pragmas
- Worksharing pragmas
- Workqueuing pragmas
- Combined Parallel and Worksharing/Workqueuing pragmas
- Synchronization pragmas
- Privatization of Global Variables pragmas
Two types of command lines, kcc and kapc, invoke
Compaq KAP software:
- Invoking the kcc command results in these events:
- Transformation of the source program using software known as the
kapc preprocessor.
- Compilation of the transformed source program using the C compiler.
- Linking of the object program from the previous event with runtime
libraries.
Compaq recommends using the kcc command to process your
files.
- Invoking the kapc command only transforms the source program. No
compilation and no linking occur.
3.1.3 Parallel Processing Controls --- Interaction
As a programmer, you should always remember that you implement a
parallel processing method (automatic, directed, or combination) by
making choices from the previous command line options, directives, and
assertions. Your choices affect the following actions:
- FOR loop detections made during Compaq KAP preprocessing
- FOR loop transformations made during Compaq KAP preprocessing
- Compiler and linker output after Compaq KAP preprocessing
- Runtime behavior of the executable file after preprocessing,
compiling, and linking
For example, suppose you choose combination detection and
parallelization for source programs openmp.c and
no_openmp.c. These programs contain some or none of the
parallel processing directives, parallel processing assertions, and
OpenMP directives. Consider the following command:
kcc -ckapargs='-concurrent -minconcurrent=1000' \
openmp.c no_openmp.c
|
This command tells Compaq KAP to:
- Automatically detect (-concurrent) and transform loops
with at least 1000 units of work (-minconcurrent=1000)
- Respond to parallel processing directives and parallel processing
assertions and directives such as #pragma _KAP concurrent call.
- Give the transformed source program files to the compiler and
linker for the creation of executable program file a.out (the
default behavior of the kcc command).
Compaq KAP parallel processing options, such as -concurrent,
are enclosed in single quotation marks and are values of the
-ckapargs option. The kcc driver responds to the
options enclosed in these single quotation marks by passing them as
arguments to the kapc preprocessor (which actually transforms
the source program file).
The default values of the parallel processing options also control
Compaq KAP loop detections, loop transformations, calling of the
compiler and linker, and runtime scheduling. They are:
-minconcurrent=1000
-scheduling=e
-chunk=1
|
Read the explanations of each of the three methods of parallelization
in light of how your choices of options, directives, and assertions
affect Compaq KAP detection of loops, changes to loops, compiler and
linker behavior, and runtime behavior of executable file a.out.
3.2 Automatic Parallelization Using the kcc Driver
To compile and run your program with parallel processing, use the
-concurrentize switch, abbreviated -conc, as follows:
kcc -ckapargs='-conc' myprog.c
|
For information on running a parallel program, see Section 3.6.
3.2.1 Preprocessing a Program for Parallel Execution Using kapc
To execute KAP as a standalone preprocessor, use the following commands
depending on your version of UNIX as follows:
- DIGITAL UNIX Version 3.2:
cc -P -D__KAP -U_INLINE_INTRINSICS myprog.c
kapc -conc -cmp=myprog_mp.c myprog.i
cc -migrate myprog_mp.c -tune host -call_shared -lkio -O4 \
-lkmp_osf -threads
|
- -threads---causes KAP to link to POSIX
1003.4a/d4-compliant DECthreads library libpthreads.so.
- -lkmp_osf---causes KAP to link to the parallel processing
library lkmp_osf.a.
- Tru64 UNIX, and DIGITAL UNIX Versions 4.0 and above:
cc -P -D__KAP -U_INLINE_INTRINSICS myprog.c
kapc -conc -cmp=myprog_mp.c myprog.i
cc -migrate myprog_mp.c -tune host -call_shared -lkio -fast \
-lkmp_osfp10 -pthread
|
- -pthread---causes KAP to link to POSIX 1003.1c-compliant
DECthreads library libpthread.so.
- -lkmp_osfp10---causes KAP to link to the parallel
processing library lkmp_osfp10.a.
An explanation of the remaining switches follows:
- The -conc switch causes KAP to restructure the source
code for parallel processing.
- -cmp causes KAP to save the optimized source program
under the file name of your choice. The kapc default is to
name the optimized source file_name.cmp. Because the
Compaq C compiler will not process a file with the default
extension of .cmp, you must override the default by using
-cmp to rename the optimized source file_name.cmp.c.
- -fast provides a single method for turning on a collection
of compiler optimizations on Tru64 UNIX, and DIGITAL UNIX
Versions 4.0 and above. (See the cc manpage for a detailed
description.)
- -lkio tells the linker to use the KAP library,
lkio, which contains routines to support KAP optimizations.
- -lkmp_osf tells the linker to use the KAP parallel
processing library for DIGITAL UNIX Version 3.2. The KAP parallel
processing library provides the interface to DECthreads.
- -lkmp_osfp10 tells the linker to use the KAP parallel
processing library for Tru64 UNIX, and DIGITAL UNIX Versions
4.0 and above. The parallel processing library provides an interface to
DECthreads.
- -migrate calls the Compaq C compiler. For more
information about -migrate, see your Compaq C user's
guide.
-
-call_shared tells the linker to link against shared
libraries. Compaq recommends that you use the -call_shared
default.
-
-pthread tells the linker to use threadsafe versions of
libraries, where they exist, and to include Tru64 UNIX, and
DIGITAL UNIX Versions 4.0 and above libpthread when linking
the program.
- -threads tells the linker to use threadsafe versions of
libraries if they exist, and to include DIGITAL UNIX Version 3.2
libpthreads when linking the program.
-
-tune host tells the compiler to optimize for the architecture
of the host processor. The Compaq C compiler switch -tune
host and the KAP C switch host=<architecture> work
independently and perform different optimizations. For information
about the KAP -tune switch, see Section 4.2.9.
- -U_INLINE_INTRINSICS --- stops the compiler from inlining
intrinsic functions. KAP currently does not support the inlining of
intrinsic functions by the compiler.
Note
When you use kapc to preprocess a file, you must set the
Compaq C compiler and linker switches appropriately. For this
reason, Compaq recommends that you use kcc whenever possible,
because kcc automatically sets the compiler and linker
switches correctly.
|