1. Field of the Invention
This invention relates to the field of computer systems and, more particularly, to prefetching mechanisms for reducing effective memory latency within computer systems.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. On the other hand, superpipelined microprocessor designs divide instruction execution into a large number of subtasks which can be performed quickly, and assign pipeline stages to each subtask. By overlapping the execution of many instructions within the pipeline, superpipelined microprocessors attempt to achieve high performance.
Superscalar microprocessors demand low memory latency due to the number of instructions attempting concurrent execution and due to the increasing clock frequency (i.e. shortening clock cycle) employed by the superscalar microprocessors. Many of the instructions include memory operations to fetch (read) and update (write) memory operands. The memory operands must be fetched from or conveyed to memory, and each instruction must originally be fetched from memory as well. Similarly, superpipelined microprocessors demand low memory latency because of the high clock frequency employed by these microprocessors and the attempt to begin execution of a new instruction each clock cycle. It is noted that a given microprocessor design may employ both superscalar and superpipelined techniques in an attempt to achieve the highest possible performance characteristics.
Microprocessors are often configured into computer systems which have a relatively large, relatively slow main memory. Typically, multiple dynamic random access memory (DRAM) modules comprise the main memory system. The large main memory provides storage for a large number of instructions and/or a large amount of data for use by the microprocessor, providing faster access to the instructions and/or data than may be achieved from a disk storage, for example. However, the access times of modern DRAMs are significantly longer than the clock cycle length of modern microprocessors. The memory access time for each set of bytes being transferred to the microprocessor is therefore long. Accordingly, the main memory system is not a high bandwidth, low latency system. Microprocessor performance may suffer due to the high memory latency.
In order to allow low latency memory access (thereby increasing the instruction execution efficiency and ultimately microprocessor performance), computer systems typically employ one or more caches to store the most recently accessed data and instructions. Additionally, the microprocessor may employ caches internally. A relatively small number of clock cycles may be required to access data stored in a cache, as opposed to a relatively larger number of clock cycles required to access the main memory.
Low effective memory latency may be achieved in a computer system if the cache hit rates of the caches employed therein are high. An access is a hit in a cache if the requested data is present within the cache when the access is attempted. On the other hand, an access is a miss in a cache if the requested data is absent from the cache when the access is attempted. Cache hits are provided to the microprocessor in a small number of clock cycles, allowing subsequent accesses to occur more quickly as well and thereby decreasing the memory latency. Cache misses require the access to receive data from the main memory, thereby increasing the memory latency.
In order to increase cache hit rates, computer systems may employ prefetching to "guess" which data will be requested by the microprocessor in the future. The term prefetch, as used herein, refers to transferring data (e.g. a cache line) into a cache prior to a request for the data being generated via instruction execution. A "cache line" is a contiguous block of data which is the smallest unit for which a cache allocates and deallocates storage. If the prefetched data is later accessed by the microprocessor, then the cache hit rate may be increased due to transferring the prefetched data into the cache before the data is requested.
Unfortunately, prefetch algorithms employed by microprocessors are generally very simple algorithms which observe the pattern of memory accesses during execution of a program and attempt to prefetch addresses during that execution based on the observed pattern. For example, stride-based prefetch algorithms have been employed in which the difference between consecutive memory accesses (the "stride") is calculated and used to generate prefetch addresses. These simple prefetch algorithms may not handle a large portion of the memory access patterns which may be exhibited by programs. Particularly, data memory access patterns may not be handled well by simple prefetch algorithms. Generally, only data memory access patterns having a highly regular pattern which can be characterized by one or a small number of values (e.g. strides) are prefetched accurately, and other patterns exhibit varying degrees of prefetch inaccuracy. Inaccurate prefetching consumes memory bandwidth which may be needed by the other memory operations, and may increase cache miss rates by dislodging data from the cache which may later be accessed in response to the program to store data which may not later be accessed in response to the program.
One data memory access pattern which is particularly difficult to prefetch accurately is a pattern in which the data read in response to a first memory operation specifies the address of a subsequent access. A variety of data structures employed by programmers may exhibit this behavior. For example, a linked list data structure comprises elements including a data storage (which stores the data assigned to the element) and a next element pointer storage (which stores a pointer, i.e. an address, to the next element in the list). Traversing the list therefore comprises reading each element in the list to obtain the pointer to the next element. A data structure involving a variety of data storage elements (e.g. a "struct" data structure as defined in the "C" programming language) may include pointers to areas of memory (e.g. a pointer to an array). Accessing the array therefore includes accessing the data structure to obtain the pointer to the array, and then accessing the pointer address (or an offset therefrom). Data access patterns in which the data read specifies the next data to be read are referred to herein as "pointer chasing" patterns.
Prefetching pointer chasing patterns accurately is difficult because the pointers can specify addresses which have no simple mathematical relationship to each other and because a prior memory operation must be completed in order to generate the address for the next memory operation. A method for prefetching pointer chasing patterns accurately and which reduces effective memory latency is therefore desirable.