In many modern high-performance computer systems, increased bandwidth to memory and I/O devices is obtained by the use of multiple interleaved devices. Interleaving is a way of permitting many accesses to occur at about the same time. Consider n=2.sup.d devices, D.sub.0, D.sub.2, . . . D.sub.n-1. Using interleaving, the contents of address m is stored in D.sub.q, where q=mod (m,n). By interleaving in this fashion, up to n references can be satisfied at the same time, particularly if those references are to nearby memory addresses. This will have substantial benefit in highly parallel shared memory systems if many processors are simultaneously working on consecutive addresses. Problems occur in this situation if the addresses are not consecutive, but occur with a stride t such that t and n have a common factor i.e., gcd(t,n)&gt;1. Consider for example the sequence of addresses of stride kn (where k.cent.1, and k .epsilon.I) given by a,a+kn, a+2kn,a+3kn, . . . ,a+(n-1)kn, for some starting address a. If the interleaving above is used, all of these references will be addressed to the same device D.sub.mod(a,n). Such stride accesses occur frequently in application programs, for example in accesses to rows or columns of arrays. The performance impairment that results from such stride accesses becomes worse with very large numbers of processors, and can be a major serialization in such hardware.
In the following discussion references are made to prior art publications via brackets [] in the conventional manner. A list of these publications immediately follows this section.
The problem of nonuniformity of memory access is a serious problem in highly parallel systems, because such memory "hot spots" can result in "tree-blockage" [1] : Network as well as memory contention can limit the performance of the entire system to a rate determined by the device in contention. Such systems are particularly vulnerable to power-of-two stride access contention, because these references are usually interleaved among the devices and routed through the interconnection network by fields in the binary representation of their physical addresses.
In an SIMD parallel system, such as the ILLIAC IV [2] memory access conflicts can cause all processors to wait for the last memory access in a parallel operation. For that reason much effort has been devoted to schemes for eliminating or reducing contention associated with stride access.
Memory organizations which allow conflict free access to any row, column, forward diagonal, and backward diagonal of an application's matrix array have been explored for the ILLIAC IV [2], the STARAN [3], and the BSP [4] computers. In most of these papers, arrays are accessed in a deterministic, conflict-free manner for a synchronized SIMD machine.
In [2], Budnik and Kuck, and [4], Lawrie and Vora proposed hardware and software solutions that require a prime number of memory modules. In [6] Lawrie proposed a system with M memory modules where M=2N, where N is the number of processing nodes. All of these solutions are intended to cause M and the stride access to be relatively prime. Batcher [3], and Frailong, Jalby, and Lenfant [7] used skewing schemes that perform XOR operations on indices of an array to map to individual memory units. Wijshoff and Leeuwen in [8] and Shapiro in [9] investigated the mathematical and theoretical limitations to these skewing schemes.
Alignment networks were further studied by Lawrie [6] to provide an alternative solution based on Stone's [10] shuffle-exchange operation to that of building expensive N.times.M crossbar switches for the access and storage of properly aligned data. Others, such as Lenfant [11] designed matrices of control patterns for an interconnection network that allows the dynamic permutation of data.
There are several major drawbacks to these schemes. Since they were primarily designed for special purposes, and have built-in dependence on array size and the number of memory modules, they are not suitable for general purpose computing environments that must satisfy more varied constraints. In addition, some of these designs required expensive and complicated addressing and alignment hardware for modulo operations and integer division. Finally, under-utilization of memory can result from "holes" in the address space created by these methods.