As is known, when a program written in a data-parallel language, such as Fortran 90, is compiled to run on a computer having a distributed memory parallel processor architecture, aggregate data objects, such as arrays, are distributed across the processor memories. This mapping determines the amount of residual communication that is required to bring operands of parallel operations into alignment with each other.
Languages such as Fortran D and High Performance Fortran permit programs to be annotated with alignment and distribution statements that specify the desired decomposition of arrays across the individual processors of computers of the foregoing type. Thus, arrays and other aggregate data objects commonly are mapped onto the processors of these computers in two stages: first, an "alignment" phase that maps all the array objects to an abstract template in a Cartesian index space, and then a "distribution" phase that maps the template to the processors. The alignment phase positions all array objects in the program with respect to each other so as to reduce realignment communication cost. The distribution phase, in turn, distributes the template aligned objects to the processors. This two-phase mapping effectively separates the language issues from the machine issues because the alignment of the data objects is machine independent. A copending United States patent application of Gilbert et al, which was filed Aug. 11, 1993 under U.S. Ser. No. 08/104,755 on "Mobile and Replicated Alignments of Arrays in Data Parallel Programs" deals with the alignment issues, so it is hereby incorporated by reference.
Compilers for distributed memory applications of programs written in these data parallel languages are required to partition arrays and to produce "node code" that is suitable for execution on a computer having a distributed memory, parallel processor architecture. For example, consider an array A of size N that is aligned to a template T such that array element A(i)is aligned to template cell ai+b. The notational shorthand that this example brings into play is as follows:
A a distributed array PA1 N size of array A PA1 a stride of alignment of A to the template PA1 b offset of alignment of A to the template PA1 p number of processors to which the template is distributed PA1 k block size of distribution of the template PA1 l lower bound of regular section of A PA1 h upper bound of regular section of A PA1 s stride of regular section of A PA1 m processor number
Now, let template T be distributed across p processors using a cyclic(k) distribution, so template cell i is held by processor (i div k) mod p. This means that array A is split into p local arrays residing in the local memories of the respective processors. Consequently, if a zero-based indexing scheme is used so that array elements, template cells, and processors are numbered starting from zero, a computation involving the regular section A(l:h:s)can be viewed as involving the set of array elements {A(l+js):O.ltoreq.j.ltoreq.(h-l)/s}, where s&gt;0.
Among the issues that must be addressed for effective node code generation by compilers for such data parallel programs are: (1) the sequence of local memory addresses that a given processor m must access while performing its share of the computation, and (2) the set of messages that a given processor m must send while performing its share of the computation.
As will be apparent, cyclic(k) distributions of templates generalize the familiar block and cyclic distributions which are cyclic(n/p) and cyclic(1), respectively. Thus, it is to be understood that there is reason to believe that the generalization may be important. For instance, it has been suggested that generalized cyclic(k) distributions may be essential for efficient dense matrix algorithms on distributed-memory machines (see Dongarra et al., "A Look at Scalable Dense Linear Algebra Libraries," Proc. of the Scalable High Performance Computer Conference, IEEE Computer Society Press, April 1992, pp. 372-379; also available as technical report ORNL/TM-12126 from OakRidge National Laboratory). Indeed, the importance of these generalized cyclic(k) distributions is recognized by High Performance Fortran because it includes directives for multidimensional cyclic(k) distribution of templates.
Others have proposed techniques for generating node code for the block and cyclic mappings of regular sections of arrays. A solution for the case where the regular section has unit stride was proposed by Koelbel and Mehrotra "Compiling Global Name-Space Parallel Loops for Distributed Execution," IEEE Transactions on Parallel and Distributed Systems, Vol. 2, No. 4, Oct. 1991, pp. 440-451. Koelbel, "Compile Time Generation of Regular Communication Patterns," in Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991, pp. 101-110 deals with regular sections having non-unit-stride. MacDonald et al., "Addressing in Cray Research's MPP Fortran," in Proceedings of the Third Workshop on Compilers for Parallel Computers, (Vienna, Austria, July 1992), Austrian Center for Parallel Computation, pp. 161-172 discuss cyclic(k) distributions in the context of Cray Research's MPP Fortran, but they concluded that a general solution requires unacceptable address computation overhead in inner loops. Accordingly, they restrict block sizes and number of processors to powers of two, thereby allowing the use of bit manipulation in address computation. Thus, there still is a need for an efficient process for generating the local memory addresses and the communication sets (i.e., the node code) that enable computers having distributed memory, parallel processor architectures to handle generalized cyclic(k) distributions of arrays.