High performance computers require both fast CPUs and high bandwidth storage sub-systems. A high performance CPU design is achieved by using a combination of device technology and a highly pipelined machine organization. Better device technology permits higher circuit density and shorter gate propagation delays. `Pipelining` allows an architect to partition machine functionality into a set of short, sequential activities. By keeping each activity small, the architect can minimize the number of gates required for an action and thus keep the fundamental machine cycle time short.
Continual reductions in machine cycle times have aggravated the disparity between the CPU speed and the baiikwidth capability of the storage sub-system. While some aspects of operand referencing can be pipelined (e.g. delivering addresses to storage or retrieving values from storage), the actual reference to storage cell is not pipelined. Because each storage cell request requires several machine cycler, to complete, storage sub-systems generally run far slower than the CPU.
There are two common methods that are employed to increase the bandwidth capability of the storage subsystem: caches and interleaved memory banks. A cache is a small, high speed memory positioned close to the CPU. The small size of the cache permits operand referencing in a single machine cycle. A machine's cache structure can be hierarchically organized, where more distant caches are larger but slower. Hierarchical memory organization provides a tradeoff between low latency and large storage capability. A cache can be of limited utility for programs iterating over very large data structures (as might be encountered during vector processing). If an operand is used only once (or the times between uses are far apart), loading it through a cache will unnecessarily displace another operand which may be subsequently used; this is referred to in the art as cache pollution. Clearly this has a negative effect upon performance.
A second implementation strategy increasing a storage system's bandwidth is to partition the address space into `N` independent banks each of which can be concurrently accessed. This has the potential for a speedup on the order of: EQU min(bankCycleTime, bankcount)
where the bank cycle time is an integral multiple of the cycle time for the CPU. This speedup can generally be achieved only in special cases where the striding pattern of an operand request is coprime with the interleave factor of the storage system. All other instances cause delays in referencing due to bank conflicts. A bank conflict (or hazard) occurs when 2 operands reference the same memory bank with the bank cycle time of the storage system. When this happens, the second operand must be delayed until the first request has completed.
It would at first appear desirable to maximize the interleave factor. However, hardware implementation problems prohibit this approach. While a high interleave factor might provide better throughput, the engineering problems for implementation also increase. Consider, for example, the fan-in and fan-out of addresses and data from the CPU to the banks. Doubling the interleave factor also doubles the number of wires required for interconnect. With 32 (or more) address bits and 64 data bits, board interconnection rapidly becomes unmanageable.
Substantial effort has been devoted in the past to interleaved memory organizations incorporating bank bypass circuitry, and bank remapping schemes. An early high performance machine which employed an interleaved bank system was the IBM 360/91. As a complement to the scoreboarding system in the CPU, operand references were assigned target register numbers which accompanied them through their traversal of the storage system.
The storage system for the IBM 360/91 provided a mechanism for short circuit passthrough of FLOW and INPUT operand dependencies. When the storage system dynamically detected these dependency relationships it would write multiple copies of the data value into difference registers in the CPU. The target register for the dependent operand would be updated without requiring a read from the storage cell.
Some of the recent work associated with interleaved bank implementations has concentrated upon deriving methods for remapping address streams into pseudo-random bank reference streams. In these schemes, bijective functions which permute the address space are used to remap operand addresses. When applied to constant striding array references, the result is a nearly uniform distribution of bank references. The work presented in this paper complements these pseudo-random memory referencing strategies.
One strategy for reducing bank conflicts is to include a small address buffer at the front of each bank. This implementation has been employed in one simulation study of a bank remapping scheme. The use of bank buffers complicates the synchronization requirements for address and operand transfers within the storage system.
It is interesting to note the role of address decoupling in operand referencing. As shown below, DAE machines use an asynchronous processor to generate addresses for operand references. Data loaded from storage is buffered in a queue in the CPU. Queue references in the CPU are interleaved with other computations. The combination of these properties helps to reduce the instantaneous bandwidth demands on the machine's storage system. Note that a vector load is the degenerate case of address decoupling where the number of operands referenced is a fix length (the size of the vector registers) and the referencing pattern has a fixed stride.
Decoupled address references can cause degradation of performance because of pathological interactions between streams. One study showed through simulation and empirical studies that performance on a range of Cray XMPs.sup.4 could degrade by as much as a factor of 2 because of pairwise interactions between vector loads with specific strides and specific starting bank locations. We experienced a similar effect; this is discussed hereinbelow.