Computer programs concerning scientific or engineering applications typically comprise a lot of loop controlled operations on array data.
Since the performance of single computers still often does not suffice for such applications and since there is a demand for high performance of the applications, a lot of research is done in the area of parallel computing. In parallel computing, parts of the applications which involve a lot of computations are distributed among a plurality of processing elements which communicate with each other, if possible.
One example of such a parallel computer architecture is a Single Instruction Multiple Data (SIMD) architecture, which comprises a plurality of parallel processing elements which get their instructions from a single instruction memory.
For a computer system comprising a parallel architecture compilers are necessary, which can transform computer programs into a form which can be interpreted by the computer system. There exist variants of programming languages like the derivative Data Parallel C Extension (DPCE) of the well-known programming language C (see Numerical C Extensions Group of X3J11, Data Parallel C Extensions, Technical Report, Version 1.6, 1994), which allow the programmer to express the parallelism of the application via the formulation in the computer language. In contrast to a formulation using systems of affine recurrence equations (see e.g. Catherine Mongenet, Affine Dependence Classification for Communications Minimization, International Journal of Parallel Programming, vol. 25, number 6, 1997), which are single assignment and which in general operate on parallel variables of the same dimensionality as the iteration space, DPCE contains both parallel statements and sequential loops and thus forces (and enables) the programmer to explicitly express the relation between parallelism and memory usage he wishes to be established.
In parallel computing concerning scientific and engineering applications, there are especially considered loop nests operating on multi-dimensional arrays with affine index functions. These can be analysed statically by a compiler and since they are predominant in scientific and engineering applications there is a lot of interest in developing methods for compilation of programs including loop nests operating on multi-dimensional arrays such that high performance of the compiled programs is achieved.
The well known obstacles to achieving high performance for the named applications on parallel architectures are communication overhead and/or the I/O bottleneck.
In a SIMD architecture, the data to be processed are typically stored in a shared memory and have to be transferred to local distributed memories, where it is processed by the parallel processing elements.
In a SIMD architecture, by way of example, in which there exists a general purpose processor used for non-parallel data processing and control flow management, a shared memory and an array of parallel processing elements which are each equipped with a local memory and a number of functional units the data which have to be processed are distributed to the local memories according to the parts of the data which are processed by the different parallel processing elements.
This data transfer between the shared memory and the local memories typically has big bandwidth requirements, produces communication overhead and typically leads to low performance for the application.
For achieving high performance for the applications, a compiler must map the parallelism of the program, for instance parallelism adapted to a SIMD architecture, which is expressed in the program in the according programming language, like DPCE, to the computer system in a way to minimize communication and data transfer overheads.
Typically, data transfers occur in blocks of contiguous bytes. This means that if some data in the shared memory are needed in a local memory, for example a data word which corresponds to some floating point number, that a whole block of data which comprise this data word are transferred to the local memory. The size of this block typically depends on the size of the bus used for the data transfer.
Transferring more data than needed at the moment charges the limited local memory. But it may happen that some of these incidentally transferred data are in fact used later in time. This phenomenon is called spatial reuse, i.e., reuse of data located nearby in memory.
It would increase the performance of the transfer if the data transferred in one block to a local memory were still resident in the local memory when they are later used by the respective processing element and would not have to be transferred again.
Because of the limited size of local memories this becomes only possible when spatial reuse occurs early enough in time since otherwise the data which are used later is deleted from the local memory.
If spatial reuse occurs early enough in time such that data which have been transferred in a data transfer along with data which had to be transferred because they have been accessed have not to be transferred again, it is said in the following that spatial Locality is established for the data transfer.
Another type of reuse which can be exploited is so-called temporal reuse.
This means that data which are used at one time in a program is used later in time, i.e. at a later stage of processing.
In a SIMD architecture as described above, for example, where there is executed a program including a loop nest, there is temporal reuse if the same data are accessed a multiple number of times.
As above in the case of spatial reuse, performance can be increased if data which are accessed a multiple number of times have not to be transferred to a local memory each time they are accessed.
If temporal reuse occurs early enough in time, such that data which had to be transferred in a data transfer because they have been accessed have not to be transferred again when they are accessed again, it is said in the following that temporal locality for the data transfer is established.
Because of the restricted size of local memories, temporal reuse must happen early enough before the data which are reused and have been transferred to a local memory have to be replaced in the local memory. This is a so called scheduling problem. It means the compiler has to (re)organize the order of computations by applying loop transformations such that temporal locality of data is achieved.
In the following, the expression locality means spatial locality as well as temporal locality.
Some techniques are commonly used to achieve locality.
Firstly, loop transformations are used to change the order of computation in order to augment both temporal and spatial locality at the same time Michael E. Wolf, Monica S. Lam, A Data Locality Optimizing Algorithm, Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementtion, 30-44, 1991.
While there is no alternative to loop transformations, especially to tiling (see Wolf et al. (cited above)), for achieving temporal locality, their applicability is restricted by loop dependencies.
Therefore, additional data transformations are suggested in order to change the layout of data in memory. Arrays shall be re-stored in memory in the order in which they are used (see Philippe Claus, Benoit Meister, Automatic Memory Layout Transformations to Optimize Spatial Locality in Parameterized Loop Nests, ACM SIGARCH Computer Architecture News, Vol. 28, No. 1, March 2000, and Adrian Slowik, Volume Driven Selection of Loop and Data Transformations for Cache-Coherent Parallel Processors, Dissertation, University of Paderborn, 1999).
However, such data transformations are global in nature, i.e. they must be propagated to all references of an array in the program, eventually leading to a bad layout for a different reference. This effect must be counterbalanced by a careful combination of loop and data transformations, where loop transformations can be used to locally repair effects of data transformations M. F. P. O'Boyle, P. M. W. Knijnenburg, Nonsingular Data Transformations: Definition, Validity and Applications, Int. J. of Parallel Programming, 27(3), 131-159, 1999.
This additionally narrows down the degrees of freedom in the choice of loop transformations and may not be feasible together with conflicting optimisation goals. The globally best may be a compromise layout only, which is not optimal for an individual loop nest.
Also, achieving a new memory layout requires copying operations and thus costs run time.
Finally, most programming languages allow arrays of rectangular shape only. Since a transformed array has polyhedral shape at first, a bounding box must be constructed which may lead to much larger arrays than the original arrays.
A different means of reducing negative effect of input and output on performance is the collapsing of multidimensional arrays (see M. M. Strout et al. (cited above), W. Thies et al. (cited above), and F. Quilleré et al. (cited above)).
This is a technique used in the context of automatic parallelization, where arrays are expanded to bring programs into single assignment form in order to get rid of storage related dependences.
Later, one wishes to collapse into the same memory locations array dimensions which were just used for temporary storage.
This idea was not yet, but principally is applicable to optimisation of locality: expanded arrays correspond to iteration space tiles and one can “collapse” a tile such that all of its indices refer to different array elements.
The collapsed tile defines local memory addresses and also defines the array elements to be transferred. Thus multiply indexed data will be transferred only a single time.
The most general of these techniques is that of F. Quilleré et al. (cited above) which considers collapsing in multiple directions simultaneously, whereas M. M. Strout et al. (cited above) and W. Thies (cited above) only consider collapsing into a single direction and thus only by one dimension.
A different approach is that suggested in A. W. Lim, S.-W. Liao, M. S. Lam, Blocking and Array Contraction across arbitrarily nested Loops using Affine Partitioning, Proceedings of the eighth ACM SIGPLAN symposium on Principles and Practices of Parallel Programming, 103-112, 2001, however they are restricted in the choice of directions into which an array can be collapsed.
However, the named techniques apply to the problem of reducing temporary memory usage arising in the model of computation defined by affine recurrence equation, which is different from the problem of establishing locality for data transfer.
The purpose of optimising locality is not to minimize memory usage, but to minimize the negative effect of input and output on performance, i.e. to minimize the amount of data transfers between shared and local memory, since this costs run time.