Most computer systems employ a multilevel hierarchy of memory systems, with relatively fast, expensive, limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost, higher-capacity memory at the lowest level of the hierarchy. The goal of a memory hierarchy is to reduce the average memory access time. Typically, the hierarchy includes a small fast memory called a cache, either physically integrated within a processor integrated circuit or mounted physically close to the processor for speed. A memory hierarchy is cost effective only if a high percentage of items requested from memory are present in the highest levels of the hierarchy (the levels with the shortest latency) when requested. If a processor requests an item and the item is present in the cache, the event is called a cache hit. If a processor requests an item and the item is not present in the cache, the event is called a cache miss. In the event of a cache miss, the requested item is retrieved from a lower level (a level with longer latency) of the memory hierarchy. This may have a significant impact on performance. In general, processor speed is increasing faster than memory speed, so that the relative performance penalty for a cache miss is increasing over time.
The average memory access time may be reduced by improving the cache hit/miss ratio, reducing the time penalty for a miss, and reducing the time required for a hit. The present patent document is primarily concerned with reducing the time penalty for a cache miss.
In many computer systems, multiple instructions overlap in execution in a technique called pipelining. In pipelining, instruction execution is broken down into small parts, called stages or phases, each of which takes a fraction of the overall time required to complete an entire instruction. After a first instruction has completed a first stage or phase and has entered the second stage or phase, a second instruction starts the first stage. At any given time, there may be many instructions overlapped in execution, each in a different phase of completion. The effective instruction rate then becomes the rate at which instructions exit the pipeline. Alternatively, computer systems may issue multiple instructions simultaneously. These systems are called superscalar machines. A variation is very long instruction word machines in which a single instruction includes multiple operations. Finally, there are systems with multiple processors that may share memory. Of course, there are combinations of all these and in particular there are superscalar pipelined machines. Simultaneous execution and overlapping execution assume independent instructions or operations. In contrast, if one operation requires a computational result from another operation, the two operations must be executed sequentially. Typically, the burden is placed on the compiler for presenting independent operations to the hardware. In an environment of simultaneous and overlapping instruction execution, a cache miss can create a substantial problem, possibly stalling many instructions.
The minimum amount of memory that can be transferred into or out of a cache is called a line, or sometimes a block. Typically, memory is organized into words (for example, 32 bits per word) and a line is typically multiple words (for example, 16 words per line). Memory may also be divided into pages, with many lines per page.
Various strategies may be employed to minimize the effects of cache misses. For example, buffers are sometimes placed between a cache and other lower level memory. These buffers typically fetch a block or line of sequential addresses including the miss address, assuming that addresses immediately following the miss address will also be needed. In U.S. Pat. No. 5,317,718 (Jouppi), a buffer called a stream buffer is placed between a cache and lower level memory. In Jouppi, items are stored in the buffer until another cache miss (if ever), and items then go from the buffer into the cache, not directly to the processor. The stream buffer described in Jouppi reduces the impact of a cache miss by rapidly loading a block of items that are likely to be needed by the processor in addition to the specific request resulting in a cache miss. Effectively, the stream buffer increases the block size. For interleaved processes, Jouppi proposes multiple stream buffers, each with a different starting address, replaced on a least-recently-used basis. In U.S. Pat. No. 5,423,016 (Tsuchiya et al), a buffer is provided that holds a single block of data. In Tsuchiya, items in the single block in the buffer are available to the processor directly from the buffer, without having to be placed into the cache. If the block of data is accessed again before being transferred to the cache, the access request is serviced directly from the block buffer. For one block, the buffer described in Tsuchiya et al enhances performance relative to Jouppi by making items in the buffer directly available to the processor without having to first place them in the cache.
There is a need for further cache miss penalty reduction, particularly for multiple misses with out of order execution and multiple misses to the same line.