1. Field of the Invention
The instant invention relates to hardware-assisted access ordering to increase memory system performance for commercially available high-performance processors.
2. Brief Description of the Prior Art
Processor speeds are increasing much faster than memory speeds. Microprocessor performance has increased by 50% to 100% per year in the last decade, while DRAM performance has risen only 10-15% per year. Memory bandwidth is, therefore, rapidly becoming the performance bottleneck in the application of high performance microprocessors to vector-like algorithms, including many of the "grand challenge" scientific problems. Currently, it may take as much as 50 times as long to access a memory element than to perform an arithmetic operation once accessed. Alleviating the growing disparity between processor and memory speeds is the subject of much current research.
Prior art has centered on a mechanism called a "cache" which automatically stores the most frequently used data in a higher speed, smaller, and much more costly memory. The success of cache technology hinges on a property called "locality", which is the tendency for a program to repeatedly access data that is "close". Assuming locality, a cache can reasonably predict future memory accessed based on recent past references.
Although the addition of cache memory is often a sufficient solution to the memory latency and bandwidth problems in general purpose scalar computing, the vectors used in scientific computations are normally too large to cache, and many are not reused soon enough to benefit from caching. Furthermore, vectors leave large footprints in the cache. For computations in which vectors are reused, iteration space tiling can partition the problems into cache-size blocks, but this can create cache conflicts for some block sizes and vector strides, and the technique is difficult to automate. Caching non-unit stride vectors leaves even larger footprints, and may actually reduce a computation's effective memory bandwidth by fetching extraneous data. " . . . while data caches have been demonstrated to be effective for general-purpose applications . . . , their effectiveness for numerical code has not been established". Lam, Monica, et al, "The Cache Performance and Optimizations of Blocked Algorithms", Fourth International Conference on Architectural Support for Programming Languages and Systems, April 1991.
Software techniques such as reordering and "vectorization" via library routines can improve bandwidth by reordering requests at compile time. Such techniques cannot exploit run-time information and are limited by processor register resources.
The traditional scalar processor concern has been to minimize memory latency in order to maximize processor performance. For scientific applications, however, the processor is not the bottleneck. Bridging this performance gap requires changing the approach to the problem and concentrating on minimizing average latency over a coherent set of accesses in order to maximize the bandwidth for scientific applications.
While many scientific computations are limited by memory bandwidth, they are by no means the only such computations. Any computation involving linear traversals of vector-like data, where each element is typically visited only once during lengthy portions of the computation, can suffer. Examples of this include string processing, image processing and other DSP applications, some database queries, some graphics applications, and DNA sequence matching.
The assumptions made by most memory architectures simply don't match the physical characteristics of the devices used to build them. Memory components are usually assumed to require about the same amount of time to access any random location; indeed, it was this uniform access time that gave rise to the term RAM, or Random Access Memory. Many computer architecture textbooks specifically cultivate this view. Others skirt the issue entirely.
Somewhat ironically, this assumption no longer applies to modern memory devices as most components manufactured in the last ten to fifteen years provide special capabilities that make it possible to perform some access sequences faster than others. For instance, nearly all current DRAMs implement a form of page-mode operation. These devices behave as if implemented with a single on-chip cache line, or page (this should not be confused with a virtual memory page). A memory access falling outside the address range of the current DRAM page forces a new page to be accessed. The overhead time required to set up the new page makes servicing such an access significantly slower than one that hits the current page.
Other common devices offer similar features, such as nibble-mode, static column mode, or a small amount of SRAM cache on chip. This sensitivity to the order of requests is exacerbated in emerging technologies. For instance, Rambus, Ramlink, and the new DRAM designs with high-speed sequential interfaces provide high bandwidth for large transfers, but offer little performance benefit for single-word accesses.
For multiple-module memory systems, the order of requests is important on yet another level, successive accesses to the same memory bank cannot be performed as quickly as accesses to different banks. To get the best performance out of such a system, advantage must be taken of the architecture's available concurrency.
Most computers already have memory systems whose peak bandwidth is matched to the peak processor bus rate. But the nature of an algorithm, its data sizes, and placement all strongly affect memory performance. An example of this is in the optimization of numerical libraries for the iPSC/860. On some applications, even with painstakingly handcrafted code, peak processor performance was limited to 20% by inadequate memory bandwidth.
A comprehensive, successful solution to the memory bandwidth problem must therefore exploit the richness of the full memory hierarchy, both its architecture and its component characteristics. One way to do this is via access ordering, which herein is defined as any technique for changing the order of memory requests to increase bandwidth. This is especially concerned with ordering a set of vector-like "stream" accesses.
There are a number of other hardware and software techniques that can help manage the imbalance between processor and memory speeds. These include altering the placement of data to exploit concurrency, reordering the computation to increase locality, as in "blocking", address transformations for conflict-free access to interleaved memory, software prefetching data to the cache, and hardware prefetching vector data to cache.
Memory performance is determined by the interaction of its architecture and the order of requests. Prior attempts to optimize bandwidth have focused on the placement of data as a way of affecting the order of requests. Some architectures include instructions to prefetch data from main memory into cache, referred to as software prefetching. Using these instructions to load data for a future iteration of a loop can improve processor performance by overlapping memory latency with computation, but prefetching does nothing to actually improve memory performance.
Moreover, the nature of memories themselves has changed. Achieving greater bandwidth requires exploiting the characteristics of the entire memory hierarchy; it cannot be treated as though it were uniform access-time RAM. Moreover, exploiting the memory's properties will have to be done dynamically--essential information (such as alignment) will generally not be available at compile time.
The difference between the foregoing prior art techniques and the instant disclosure is the reordering of stream accesses to exploit the architectural and component features that make memory systems sensitive to the sequence of requests.
Reordering can optimize accesses to exploit the underlying memory architecture. By combining compile-time detection of streams with execution-time selection of the access order and issue, the instant disclosure achieves near-optimal bandwidth for vector-like accesses relatively inexpensively. This complements more traditional cache-based schemes, so that overall effective memory performance need not be a bottleneck.