1. Field of the Invention
The present invention generally relates to improved data cache performance in processors and more particularly to prefetching of data into a cache under control of a prefetch instruction.
2. Description of the Related Art
Computer systems typically include cache memories to reduce the cost of providing a processor with a high speed memory subsystem. A cache memory is a high-speed memory which acts as a buffer between the processor and main memory. Although smaller than the main memory, the cache memory is usually appreciably faster than the main memory. Memory subsystem performance can be increased by storing the most commonly used data in smaller but faster cache memory (or cache memories). Because most programs exhibit temporal and spatial locality, historical access patterns are reasonably predictive of future access patterns.
When the processor accesses a memory address, the cache memory determines if the data associated with the memory address is stored in the cache memory. If the data is stored in the cache memory, a cache hit results and the data is provided to the processor from the cache memory. If the data is not in the cache memory, a cache miss results and a lower level in the memory hierarchy must be accessed. A cache memory that maintains a cached set of memory locations approximating the set of most recently accessed memory locations is a historical cache memory.
Data cache misses can account for a significant portion of an application program's execution time. This is particularly true of scientific, graphic, or multimedia applications which are data bandwidth intensive running on high frequency processors with comparatively long memory latencies. With an increasing mismatch between processor and memory speeds, the penalty for cache misses has become a dominant performance limiting factor in system performance.
Various techniques have been employed to improve cache performance, i.e., to reduce cache miss ratios, over that provided by historical cache memory. One such technique is prefetching, i.e., fetching data (or instructions) into the cache before they are needed by the program. Prefetching involves fetching (from lower levels in the memory hierarchy and into the cache memory) of data not yet accessed by the processor with the expectation that the processor will do so and will be better able to use the prefetched data than the data replaced to make room for the prefetched data. A cache memory which prefetches data is a predictive cache memory. By anticipating processor access patterns, prefetching helps to reduce cache miss rates.
The effectiveness of prefetching is limited by the ability of a particular prefetching method to predict addresses from which the processor will need to access data. Hardware prefetching methods typically seek to take advantage of patterns in memory accesses by observing all, or a particular subset of, memory transactions and prefetching as yet unaccessed data for anticipated memory accesses. Memory transactions observed can include read and/or write accesses or cache miss transactions.
An example of a family of prefetching methods is the stream buffer predictor family. A stream buffer predictor generally uses hardware to observe memory transactions associated with cache misses and prefetches data (typically on a cache line basis) based on the observed pattern of cache misses. For example, while the missed cache line is fetched, the stream buffer predictor will prefetch data for the next sequential cache line or set of cache lines because memory is often accessed in address order. When a processor accesses data in the stream and there is a cache miss, the cache memory will first check the stream buffer before going to a lower level cache or main memory. Another example of a family of prefetching methods is the load stride predictor family. A load stride predictor generally uses hardware to observe processor memory accesses to the cache and looks for patterns upon which to base predictions of addresses from which the processor will need data. For example, based on a memory read access by the processor and on a set of past memory accesses, a load stride predictor predicts the address(es) of additional data to be accessed by a current program sequence executing on the processor. Such a prediction triggers a prefetch of associated data, which is then available to satisfy a cache miss if the prediction was correct.
Unfortunately, hardware prefetch techniques may not be particularly predictive of memory access patterns. This is particularly true in the case of applications that manipulate large data structures for which the next data access may not be an adjacent cache line to that previously accessed and for which memory access patterns cannot be said to stride at a fixed regular offset through memory address. Large linked list and other dynamically allocated data structures are particularly problematic for such predictive methods.
An alternative to hardware prefetching is to let a compiler generate "prefetch instructions" to request data before it is needed by the main computation. In theory, software prefetching techniques allow predictions of future memory access patterns to be informed by actual execution of the software. For example, an executing program will often have, or be able to compute, a pointer to subsequently addressed data structure elements. Compiler techniques for generating prefetch instructions are extensively detailed in Todd C. Mowry, Tolerating Latency Through Software-Controlled Data Prefetching, Ph.D. dissertation, Dept. of Electrical Engineering, Stanford University, March 1994. Mowry provides an extensive analysis of compiler generated prefetch algorithms and simulation results. See also U.S. Pat. No. 5,704,053 Santhanam, issued Dec. 30, 1997. In addition, Mowry describes basic architectural support for prefetching including the definition of a non-bonding, non-blocking, non-excepting prefetch instruction. See Mowry, Architectural Issues, pp. 121-190.
Unfortunately, while the software prefetching literature generally assumes the existence of a non-blocking, non-faulting prefetch or load instruction, descriptions of actual processor implementations thereof are lacking. Several modern processor architectures, e.g., the PowerPC.TM. Microprocessor (available from Motorola and International Business Machines Corp.), the Alpha AXP.TM. Microprocessor (available from Digital Equipment Corporation), and the MIPS R10000 Microprocessor (available from Silicon Graphics), define non-faulting prefetch instructions.
For example, the PowerPC.TM. architecture defines dcbt and dcbtst instructions to act as software initiated prefetch hints. See generally Motorola Inc., PowerPC.TM. Microprocessor family: The Programming Environments, pp.4-64-4-68, 5-8-5-11, and 8-50-8-51, Motorola Inc., G522-0290-00, Rev. 1, January 1997. Prefetch mechanisms to support the instructions are not described.
The Alpha AXP.TM. architecture defines FETCH and FETCH_M instructions that are also prefetch hints and which may or may not be implemented by a particular implementation. See generally, Digital Equipment Corporation, Alpha Architecture Handbook, EC-QD2KB-TE, pp. 4-138-4-139, A-10-A-11, October 1996. Again, underlying prefetch mechanisms to support the instructions are not described.
Finally, the MIPS IV architecture defines PREF and PREFX instructions that are again advisory and again which may or may not be implemented in a particular implementation. See generally, MIPS Technologies, Inc., MIPS IV Instruction Set, Revision 3.2, pp. A-10-A-11, and A-116-A-118 (September 1995). Again, underlying prefetch mechanisms to support the instructions are not described.
While neither the processors, nor their associated instruction sets are necessarily prior art with respect to the present invention, underlying prefetch mechanisms are needed to allow processors to actually implement the prefetch instructions assumed by the software prefetching literature and defined in the various instruction sets described above.