1. Field of the Invention
The present invention generally relates to data processing systems employing cache memories to improve the performance of the central processing unit (CPU) and, more particularly, to the use of "stride register(s)" to assist in prefetching data, especially for program loops.
2. Description of the Prior Art
Computer system performance is extremely dependent on the average time to access storage. For several generations of machines, cache memory systems have been used to decrease the average memory latency to an acceptable level. In cache systems, the average memory latency can be described as the cache access time multiplied by the percentage of accesses found in the cache (hits) plus the percentage of accesses not found in the cache (misses) times the "out-of-cache" access time. Due to the large discrepancy between the access times for a hit and for a miss, which is sometimes more than a factor of ten, even a small percentage of accesses being misses can result in the effects of the "out-of-cache" access time dominating the average memory latency. Being able to increase the cache hit ratio from 97% to 99% can result in a substantial performance improvement on the order of 20% to 40%. In an effort to increase the hit percentage, many different approaches have been described which attempt to prefetch cache lines on the basis of previous hit/miss information, accessing patterns, and so forth.
Since the cache is often completely transparent to the user, hardware must make prefetching predictions with no knowledge of the type of program, whether the current instructions were generated for code in a loop (which would have a bearing on whether a particular access pattern was likely to be repeated), or whether future instructions would reference data in a given cache line. As the code is being executed, it is difficult for hardware to reconstruct loops, especially iteration counts, until the loop is finished.
Still, attempts to accurately prefetch data can be profitable. Through trace driven simulation, A. J. Smith reported in "Sequential program prefetching in memory hierarchies", IEEE Computer, 11, 12 (December 1978) , pp. 7-21, finding that "Prefetching all memory references in very fast computers can increase effective CPU speed by 10 to 25 percent." Smith, however, was only concerned with prefetching the line with the "next sequential (virtual) address". J. D. Gindele in "Buffer block prefetching method", IBM Tech. Disclosure Bull., 20, 2 (July 1977) , pp. 696-697, states "With prefetching, equivalent hit ratios can be attained with a cache buffer of only 1/2 to 11/4 capacity of a cache buffer without prefetching." Gindele's method worked well in cases where the next sequential cache line was the correct line to prefetch. When successive elements are quite distant (in linear address space), sequential address prefetch not only pollutes the cache with data the processor may never reference, the line which the processor will require is never prefetched. Almost every prefetch scheme assumes that the correct line to prefetch is simply the next sequential line. One exception is reported by J. H. Pomerene et al. in "Displacement lookahead buffer", IBM Tech. Disclosure Bull., 22, 11 (April 1980), p. 5182.
In many scientific/engineering applications, most of the time is spent in loops. Much of the loop time is often spent in nested loops, and a lot of nested loops make use of multi-dimensional arrays. For the internal storage representation of multi-dimensional arrays, a column-wise mapping is assumed as is used in FORTRAN. In the case that the inner loop steps down columns, "stride-1" accesses (adjacent elements in storage) result. Most cache designs perform well in this case since when one element is fetched into the cache, a line (or group of contiguous elements) are fetched. A miss might occur for the first access to the line, but hits are assumed for the next several accesses.
When the inner loop moves across rows, stride-N accessing occurs, where the distance between consecutively referenced addresses is N words. Generally, N is larger than the number of elements fetched in the line; therefore, unless the data remains in the cache long enough to be used on the next row (a future iteration of an outer loop), misses will probably occur for each request, degrading performance. Some numerical solution methods used in scientific and engineering programs, such as Alternating Difference Implicit, sweep the data in several directions. Without careful coding, large arrays will "flush" the cache and no reuse will occur. Each access generates a miss which in turn increases the amount of time the processor sits idle waiting for data. The amount of degradation can be diminished if the cache lines can be prefetched so that the line fetched can be overlapped with other calculations in the loop.
While the term "stride" is described above in terms of scientific applications, this invention is aimed at solving a problem which is characterized by storage referencing patterns rather than computational attributes. For example, other potential candidates which might benefit from this invention include portions of applications in the areas of database and payroll processing which access a given field in each of a set of fixed-length records. These would result in accesses with a stride which is the same as the record length.