1. Field of the Invention
The present invention generally relates to high performance computer memory organization and operation and, more particularly, to a new basic storage module design having a high storage bandwidth on strides greater than one and minimizes requests to the memory.
2. Description of the Prior Art
Computer system performance is extremely dependent on the average time to access storage. For several generations of machines, cache memory systems have been used to decrease the average memory latency to an acceptable level. In cache systems, the average memory latency can be described as the cache access time multiplied by the percentage of accesses found in the cache (hits) plus the percentage of accesses not found in the cache (misses) times the "out-of-cache" access time. Due to the large discrepancy between the access times for a hit and for a miss, which is sometimes more than a factor of ten, even a small percentage of accesses being misses can result in the effects of the "out-of-cache" access time dominating the average memory latency.
In an effort to increase the hit percentage, many different approaches have been described which attempt to prefetch cache lines on the basis of previous hit/miss information, accessing patterns, and so forth. Since the cache is often completely transparent to the user, hardware must make prefetching predictions with no knowledge of the type of program, whether the current instructions were generated for code in a loop (which would have a bearing on whether a particular access pattern was likely to be repeated), or whether future instructions would reference data in a given cache line. As the code is being executed, it is difficult for hardware to reconstruct loops, especially iteration counts, until the loop is finished.
Still, attempts to accurately prefetch data can be profitable. Through trace driven simulation, A. J. Smith reported in "Sequential program prefetching in memory hierarchies", IEEE Computer, 11, 12 (December 1978), pp. 7-21, finding that "Prefetching all memory references in very fast computers can increase effective CPU speed by 10 to 25 percent." Smith, however, was only concerned with prefetching the line with the "next sequential (virtual) address". J. D. Gindele in "Buffer block prefetching method", IBM Technical Disclosure Bulletin, 20, 2 (July 1977), pp. 696-697, states "With prefetching, equivalent hit ratios can be attained with a cache buffer of only 1/2 to 1/4 capacity of a cache buffer without prefetching." Gindele's method worked well in cases where the next sequential cache line was the correct line to prefetch.
When successive elements are quite distant (in linear address space), sequential address prefetch not only pollutes the cache with data the processor may never reference, the line which the processor will require is never prefetched. Almost every prefetch scheme assumes that the correct line to prefetch is simply the next sequential line. One exception is reported by J. H. Pomerene et al. in "Displacement lookahead buffer", IBM Technical Disclosure Bulletin, 22, 11 (April 1980), p. 5182.
In many scientific/engineering applications, most of the time is spent in loops. Much of the loop time is often spent in nested loops, and a lot of nested loops make use of multi-dimensional arrays. For the internal storage representation of multi-dimensional arrays, a column-wise mapping is assumed as is used in FORTRAN. In the case that the inner loop steps down columns, "stride-1" accesses (adjacent elements in storage) result. Most cache designs perform well in this case since when one element is fetched into the cache, a line (or group of contiguous elements) are fetched. A miss might occur for the first access to the line, but hits are assumed for the next several accesses.
When the inner loop moves across rows, stride-N accessing occurs, where the distance between consecutively referenced addresses is N words. Generally, N is larger than the number of elements fetched in the line; therefore, unless the data remains in the cache long enough to be used on the next row (a future iteration of an outer loop), misses will probably occur for each request, degrading performance. Some numerical solution methods used in scientific and engineering programs, such as Alternating Difference Implicit, sweep the data in several directions. Without careful coding, large arrays will "flush" the cache and no reuse will occur. Each access generates a miss which in turn increases the amount of time the processor sits idle waiting for data. The amount of degradation can be diminished if the cache lines can be prefetched so that the line fetched can be overlapped with other calculations in the loop.
Stride two is particularly important (after stride one) due to complex number representation using two contiguous DWs (real and imaginary components) in scientific applications. However, while the "stride" is important for scientific applications, this invention is aimed at solving a problem which is characterized by storage referencing patterns rather than computational attributes. For example, other potential candidates which might benefit from this invention include portions of applications in the areas of database and payroll processing which access a given field in each of a set of fixed-length records. These would result in accesses with a stride which is the same as the record length.
High performance computer systems frequently involve the use of multiple central processing units (CPUs), each operating independently, but occasionally communicating with one another or with basic storage modules (BSMs) which comprise the main memory when data needs to be exchanged. A storage control element (SCE) which operates a switching system, such as a crossbar switch, is used to interconnect CPUs and BSMs. This type of system is illustrated in FIG. 1 which shows a large number of CPUs 10.sub.0 to 10.sub.M, each operating independently and in parallel with each other. Each of the CPUs 10.sub.0 to 10.sub.M occasionally requires access to one of several BSMs 12.sub.0 to 12.sub.N. Note that the number of BSMs is not necessarily the same as the number of CPUs (i.e., N.noteq.M). Typically, N&gt;M. Each CPU has an input/output (I/O) path 14, and each memory device has an I/O path 16. The paths 14 and 16 can be buses and may be duplicated to provide full-duplex communication. Selective connection of an I/O path 14 to an I/O path 16 is performed by the SCE and switch 18.
In high performance computer systems of the type shown in FIG. 1, each CPU includes a cache complex (CP) which communicates to the BSMs via the SCE. A typical CPU is shown in FIG. 2 and comprises both scalar execution elements (SXE) 20 and a vector execution element (VXE) 22. The SXE 20 and the VXE 22 are both controlled by the instruction decode element 24. Both make memory requests to the SCE 18 (shown in FIG. 1). In the case of the SXE 20, these requests are made via the data cache 26 and the data steering logic 28; however, in the case of the VXE 22, the requests bypass the cache and are routed to the memory directly by the steering logic 28. Although the VXE 22 can process two DW fetches per cycle, it can only generate one address per cycle. Therefore, unless the memory subsystem can fetch two DWs per cycle based on the original address plus the stride value, the VXE will operate at only 50% efficiency.
For such a high performance computer system, what is needed is a memory system that solves two major requirements:
(a) The memory system is required to do data cache line fetches as a single operation. The CPU SXE 20 execute through the data cache 26 (store through) with a fixed line size 27 of 128 bytes (i.e., sixteen double words (DWs) where a DW is 64 bits). In addition, since the cache 26 is store through, the memory system must also accommodate DW stores, but not line stores.
(b) The VXE 22 does not use the data cache 26 but can process two DWs per cycle (at any "stride") as one operation. The ideal memory design, then, would be one that could do cache line fetches with only one fetch request at a quadword (QW), or 128 bits, per cycle data rate, could do fetches or stores of two stride N DWs per request, and finally could do random DW scalar stores at one per; cycle per BSM.
Since on vector operations requests of multiple DWs of data are made and these DWs are contiguous for stride one operations, the requests impose a high storage bandwidth requirement. To satisfy this requirement, a design using multiple BSMs, each with a wide data interface to and from the SCE, could be used. This assumes that quadword (QW) zero is on BSM.sub.0, QW.sub.1 is on BSM.sub.1, etc., such that the first set of QWs is spread evenly across all the BSMs. Similarly, second and remaining sets of contiguous QWs are likewise spread equally across all the BSMs, as generally illustrated in FIG. 3.
The rational for this type of design is to provide a very high storage bandwidth for stride one operations. With this design, every CP storage fetch request, for a QW, is on a QW boundary and, therefore, for stride N requests (N.noteq.1), the bandwidth is half that of stride one.
While the design shown in FIG. 3 can do stride one vector fetches at two DWs per request (the ideal case), it can do requests other than stride one at only one DW per request. In addition, cache line fetches can not be done as a single BSM request but have to be broken up into eight requests all going to separate BSMs. The resulting additional complexity in the cache 26 (to generate the eight requests and resequence the returning data that is potentially out of sequential order) presents an unacceptable design and performance hit for scalar requests.