Compaq KAP Fortran/OpenMP
for Tru64 UNIX
User Guide

Appendix B
Data Dependence Analysis

This appendix provides a brief explanation of data dependence.

B.1 Data Dependence Definitions

Data dependence is the criterion that KAP uses to determine if a given loop can be optimized. KAP tries to ensure that the optimized program produces the same results as the scalar input. For some loops, certain optimizations can result in incorrect answers; in these cases KAP detects and informs you of the dependencies that can interfere with KAP transformations.

Data dependence analysis was originally motivated by vector and parallel computation. The basic tools and notations apply to advanced serial optimizations on uniprocessor computers.

KAP determines data dependence and bases decisions on such information automatically, informing you only of those dependencies preventing optimization.

KAP uses a data dependence graph that shows where data values are generated and where they are used within a DO loop or DO loop nest. The data dependence graph is processed (with standard graph traversal techniques) to find potential problem areas, which appear as cycles in the graph. Each data dependence cycle is carefully examined to see if it can be broken and the loop transformed, or if part or all of the loop must be executed in its original form.

For an in-depth study of data dependence, consult the following:

Michael Wolfe, "Optimizing Supercompilers for Supercomputers," (Research monographs in parallel and distributed processing) MIT Press, Cambridge, MA, 1989.
Michael Wolfe and Utpal Banerjee, "Data Dependence and Its Application to Parallel Processing," International Journal of Parallel Programming, Vol. 16, No. 2, April, 1988.
David Padua and Michael Wolfe, "Advanced Compiler Optimization for Supercomputers," Communications of the ACM, Vol. 29, No. 12, December, 1986.

B.2 Varieties of Data Dependence

The three kinds of data dependence are described in the following way. The notation S1, S2, and SN is used to denote statement 1, statement 2, and statement N, respectively. In each example, S2 is dependent on S1, due to variable X.

1. Data dependence from an assignment to a use of a variable is called flow dependence (or true dependence):

(S1): X = 3 (S2): Y = X

2. Data dependence from use of a variable to a later reassignment on that variable is called antidependence:

(S1): Y = X (S2): X = 3

3. Data dependence from an assignment of a variable to a later reassignment of that variable is called output dependence:

(S1): X = 3 : (SN): X = 4

B.3 Input and Output Sets

To determine data dependence, it is necessary to form the input and output sets for the given statements.

IN(S1) denotes the set of input items of S1 (items whose values may be read by S1). OUT(S1) denotes the set of output items (scalar variables or array elements) of statement S1 (items whose values may be changed by S1).

The IN and OUT sets for the assignment statement in the loop are shown in the following example:

DO 10 I = 1,10 S1: X(I) = A(I+1) * B 10 CONTINUE IN(S1) = {A(2), A(3), A(4), ..., A(11), B} OUT(S1) = {X(1), X(2), X(3), ..., X(10)}

In practice, KAP often approximates the IN and OUT sets because the actual loop bounds are frequently unknown at compile time.

B.4 Data Dependence Relations

For any two statements, S1 and S2, one of the three types of data dependence relations may be true, or the statements may be data independent.

If some item X is in OUT(S1) and X is in IN(S2) and S2 is to use the value of X computed in S1, S2 is flow dependent on S1, as in example 1 in Section B.2.

If some item X is in IN(S1) and X is in OUT(S2), but S1 is to use the value of X before it is changed by S2, S2 is antidependent on S1, as in example 2 of Section B.2.

If some item X is in OUT(S1) and X is in OUT(S2) and the value computed by S2 is to be stored after the value computed by S1 is stored, S2 is output dependent on S1, as in example 3 of Section B.2.

Antidependence and output dependence relations are sometimes inadvertently caused by programmers' coding practices. These dependencies can often be removed by more careful coding.

B.5 Data Dependence Direction Vectors

The direction vector is defined as a sequence of direction vector elements (one element for each DO loop enclosing both statements involved in the dependence arc). The following symbols are direction vector elements: <, =, >, <, >, /, or *.

If line 12 in iteration I" depends on line 11 in iteration I', the element of the direction vector for loop I is as follows:

direction vector element when < I' must be < I" = I' must be = I" > I' must be > I" <= I' must be < or = I" >= I' must be > or = I" / I' must not = I" * no relation between I' and I" can be proven

Consider the following loop:

DO Loops Line +--------- 10 DO 10 I = 1,N-1 | 11 A(I) = B(I) + C(I) | 12 C(I) = A(I-1) - 1 |_________ 13 10 CONTINUE

The dependence from A(I) to A(I-1) has a direction vector of <, because the dependence flows from iteration I' to iteration I'+1 and I' < I'+1. The data independence from C(I) to C(I) has a direction vector of = because the dependence stays in the same iteration of the loop (from iteration I' to iteration I').

B.6 Loop-Carried Dependence

A dependence is said to be carried by a loop if the corresponding direction vector element for that loop has a directional component (<, <, /, or *). Loop-carried dependence is an important concept for discovering when the iterations of the loop can be executed concurrently. If there are no loop-carried dependencies, all iterations of that loop can be executed in parallel without synchronization.

B.7 Data Dependence Examples

This loop can easily be parallelized, because there are no loop-carried dependencies:

DO 10 I = 1,N A(I) = B(I) + 2 C(I) = A(I) + D(I) 10 CONTINUE

The next loop cannot be parallelized directly. An antidependence on A exists from the second assignment statement to the first. Addition of synchronization is needed to parallelize the loop. (That, in turn, may slow execution.)

DO 10 I = 1,N A(I) = B(I) + 2 C(I) = A(I+1) + D(I) 10 CONTINUE

Appendix C
OpenMP Examples

For your convenience, the following examples have been adapted from the ANSI X3H5 Parallel Extensions for FORTRAN document.

C.1 DO: A Simple Difference Operator

This example shows a simple parallel loop where the amount of work in each iteration is different. We used dynamic scheduling to get good load balancing. The end do has a nowait because there is an implicit barrier at the end parallel. Alternately, using the option -optimize=1 would have also eliminated the barrier.

subroutine do_1 (a,b,n) real a(n,n), b(n,n) c$omp parallel c$omp& shared(a,b,n) c$omp& private(i,j) c$omp do schedule(dynamic,1) do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2 enddo enddo c$omp end do nowait c$omp end parallel end

C.2 DO: Two Difference Operators

Shows two parallel regions fused to reduce fork/join overhead. The first end do has a nowait because all the data used in the second do is different than all the data used in the first do.

subroutine do_2 (a,b,c,d,m,n) real a(n,n), b(n,n), c(m,m), d(m,m) c$omp parallel c$omp& shared(a,b,c,d,m,n) c$omp& private(i,j) c$omp do schedule(dynamic,1) do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2 enddo enddo c$omp end do nowait c$omp do schedule(dynamic,1) do i = 2, m do j = 1, i d(j,i) = ( c(j,i) + c(j,i-1) ) / 2 enddo enddo c$omp end do nowait c$omp end parallel end

C.3 DO: Reduce Fork/Join Overhead

Routines do_3a and do_3b perform numerically equivalent computations, but because the parallel directive in routine do_3b is outside the do j loop, routine do_3b probably forms teams less often, and thus reduces overhead.

subroutine do_3a (a,b,m,n) real a(n,m), b(n,m) do j = 2, m c$omp parallel c$omp& shared(a,b,n,j) c$omp& private(i) c$omp do do i = 1, n a(i,j) = b(i,j) / a(i,j-1) enddo c$omp end do nowait c$omp end parallel enddo end subroutine do_3b (a,b,m,n) real a(n,m), b(n,m) c$omp parallel c$omp& shared(a,b,m,n) c$omp& private(i,j) do j = 2, m c$omp do do i = 1, n a(i,j) = b(i,j) / a(i,j-1) enddo c$omp end do nowait enddo c$omp end parallel end

C.4 SECTIONS: Two Difference Operators

Identical to Section C.2 but uses sections instead of do. Here the speedup is limited to 2 because there are only 2 units of work; whereas in Section C.2 there are

n-1 + m-1 units of work.

subroutine sections_1 (a,b,c,d,m,n) real a(n,n), b(n,n), c(m,m), d(m,m) c$omp parallel c$omp& shared(a,b,c,d,m,n) c$omp& private(i,j) c$omp sections c$omp section do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2 enddo enddo c$omp section do i = 2, m do j = 1, i d(j,i) = ( c(j,i) + c(j,i-1) ) / 2 enddo enddo c$omp end sections nowait c$omp end parallel end

C.5 SINGLE: Updating a Shared Scalar

This example demonstrates how to use a single construct to update an element of the shared array a. The optional end do nowait after the first do is omitted because we need to wait at the end of the do before proceeding into the single.

subroutine sp_1a (a,b,n) real a(n), b(n) c$omp parallel c$omp& shared(a,b,n) c$omp& private(i) c$omp do do i = 1, n a(i) = 1.0 / a(i) enddo c$omp single a(1) = min( a(1), 1.0 ) c$omp end single c$omp do do i = 1, n b(i) = b(i) / a(i) enddo c$omp end do nowait c$omp end parallel end

C.6 SECTIONS: Updating a Shared Scalar

Identical to Section C.5 but using different directives.

subroutine sections_sp_1 (a,b,n) real a(n), b(n) c$omp parallel c$omp& shared(a,b,n) c$omp& private(i) c$omp do do i = 1, n a(i) = 1.0 / a(i) enddo c$omp sections a(1) = min( a(1), 1.0 ) c$omp end sections c$omp do do i = 1, n b(i) = b(i) / a(i) enddo c$omp end do nowait c$omp end parallel end

C.7 DO: Updating a Shared Scalar

Identical to Section C.5 but using different directives.

subroutine do_sp_1 (a,b,n) real a(n), b(n) c$omp parallel c$omp& shared(a,b,n) c$omp& private(i) c$omp do do i = 1, n a(i) = 1.0 / a(i) enddo c$omp end do c$omp do do i = 1, 1 a(1) = min( a(1), 1.0 ) enddo c$omp end do c$omp do do i = 1, n b(i) = b(i) / a(i) enddo c$omp end do nowait c$omp end parallel end

C.8 PARALLEL DO: A Simple Difference Operator

Identical to Section C.1 but using different directives.

subroutine paralleldo_1 (a,b,n) real a(n,n), b(n,n) c$omp parallel do c$omp& shared(a,b,n) c$omp& private(i,j) c$omp& schedule(dynamic,1) do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2 enddo enddo end

C.9 PARALLEL SECTIONS: Two Difference Operators

Identical to Section C.4 but using different directives. The maximum performance improvement is limited to the number of sections run in parallel, so this example has a maximum parallelism of 2.

subroutine sections_2 (a,b,c,d,m,n) real a(n,n), b(n,n), c(m,m), d(m,m) c$omp parallel sections c$omp& shared(a,b,c,d,m,n) c$omp& private(i,j) c$omp section do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2 enddo enddo c$omp section do i = 2, m do j = 1, i d(j,i) = ( c(j,i) + c(j,i-1) ) / 2 enddo enddo c$omp end parallel sections end

C.10 Simple Reduction

This demonstrates how to perform a reduction using partial sums while avoiding synchronization in the loop body.

subroutine reduction_1 (a,m,n,sum) real a(m,n) c$omp parallel c$omp& shared(a,m,n,sum) c$omp& private(i,j,local_sum) local_sum = 0.0 c$omp do do i = 1, n do j = 1, m local_sum = local_sum + a(j,i) enddo enddo c$omp end do nowait c$omp critical sum = sum + local_sum c$omp end critical c$omp end parallel end

The above reduction could also use the REDUCTION () clause as follows:

subroutine reduction_2 (a,m,n,sum) real a(m,n) c$omp parallel do c$omp& shared(a,m,n) c$omp& private(i,j) c$omp& reduction(+:sum) do i = 1, n do j = 1, m local_sum = local_sum + a(j,i) enddo enddo end

C.11 TASKCOMMON: Private Common

This example demonstrates the use of taskcommon privatizable common blocks.

subroutine tc_1 (n) common /shared/ a real a(100,100) common /private/ work real work(10000) c$omp threadprivate (/private/) ! this privatizes the ! common /private/ c$omp parallel c$omp& shared(n) c$omp& private(i) c$omp do do i = 1, n call construct_data() ! fills in array work() call use_data() ! uses array work() enddo c$omp end do nowait c$omp end parallel end

C.12 THREADPRIVATE: Private Common and Master Thread

In this example, the value 2 is printed because the master thread's copy of a variable in a threadprivate privatizable common block is accessed within a master section or in serial code sections. If a single was used in place of the master section, some single thread, but not necessarily the master thread, would set j to 2 and the printed result would be indeterminate.

subroutine tc_2 common /blk/ j c$omp threadprivate (/blk/) j = 1 c$omp parallel c$omp master j = 2 c$omp end master c$omp end parallel print *, j end

C.13 INSTANCE PARALLEL: As a Private Common

This demonstrates the use of instance parallel privatizable common blocks.

subroutine ip_1 (n) common /shared/ a real a(100,100) common /private/ work real work(10000) c$omp instance parallel (/private/) c$omp parallel c$omp& shared(n) c$omp& private(i) c$omp new (/private/) ! this privatizes the c$omp do ! common /private/ do i = 1, n call construct_data()! fills in array work() call use_data() ! uses array work() enddo c$omp end do nowait c$omp end parallel end

C.14 INSTANCE PARALLEL: As a Shared and then a Private Common

This demonstrates the use of an instance parallel common block first as a shared common block and then as a private common block. This would not be possible with taskcommon blocks because taskcommon blocks are always private.

subroutine ip_2 (n,m) common /shared/ a,b real a(100,100), b(100,100) common /private/ work real work(10000) c$omp instance parallel (/private/) c$omp parallel ! common /private/ is c$omp& shared(a,b,n) ! shared here since c$omp& private(i) ! no new appears c$omp do do i = 1, n work(i) = b(i,i) / 4.0 enddo c$omp end do nowait c$omp end parallel do i = 1, n do j = 1, m a(j,i) = work(i) * ( a(j-1,i) + a(j+1,i) x + a(j,i-1) + a(j,i+1) ) enddo enddo c$omp parallel c$omp& shared(m) c$omp& private(i) c$omp new (/private/) ! this privatizes the c$omp do ! common /private/ do i = 1, m call construct_data() ! fills in array work() call use_data() ! uses array work() enddo c$omp end do nowait c$omp end parallel end

C.15 Avoiding External Routines: Reduction

This example demonstrates two coding styles for reductions, one using the external routines omp_get_max_threads() and omp_get_thread_num() and the other using only OpenMP directives.

subroutine reduction_3a (n) real gx( 0:7 ) ! assume 8 processors do i = 0, omp_get_max_threads()-1 gx(i) = 0 enddo c$omp parallel c$omp& shared(a) c$omp& private(i,lx) lx = 0 c$omp do do i = 1, n lx = lx + a(i) enddo c$omp end do nowait gx( omp_get_thread_num() ) = lx c$omp end parallel x = 0 do i = 0, omp_get_max_threads()-1 x = x + gx(i) enddo print *, x end

As is shown below, this example could have been written without the external routines.

subroutine reduction_3b (n) x = 0 c$omp parallel c$omp& shared(a,x) c$omp& private(i,lx) lx = 0 c$omp do do i = 1, n lx = lx + a(i) enddo c$omp end do nowait c$omp critical x = x + lx c$omp end critical c$omp end parallel print *, x end

This example could have also been written more simply using the reduction() clause as follows:

subroutine reduction_3c (n) x = 0 c$omp parallel c$omp& shared(a) c$omp& private(i) c$omp do reduction(+:x) do i = 1, n x = x + a(i) enddo c$omp end do nowait c$omp end parallel print *, x end

Contents

Index

Compaq KAP Fortran/OpenMP for Tru64 UNIXUser Guide

Appendix BData Dependence Analysis

B.2 Varieties of Data Dependence

Appendix COpenMP Examples

Compaq KAP Fortran/OpenMP
for Tru64 UNIX
User Guide

Appendix B
Data Dependence Analysis

Appendix C
OpenMP Examples