In computer systems, the performance of processors has improved at a greater rate than the performance of hierarchical memory systems. A hierarchical memory system includes a plurality of memories, such as disk, main memory, off-chip and on-chip caches, and registers. Typically, the memories that are "farther" from the processor have longer access time latencies and larger capacities than memories "closer" to the processor. The closer memories have shorter latencies and smaller capacities.
During the operation of a hierarchical memory system, the more frequently used data are stored in memories that are closer to the processor. This is known as caching. Other data can be stored in closer memories in anticipation of the processor using such data. This is called prefetching. Such prefetched data can be stored in what is known as a stream buffer.
A stream buffer can be a small memory organized as a first-in first-out (FIFO) queue for storing a plurality of cache lines. Stream buffers can typically store four or eight cache lines along with the memory addresses of such cache lines. Cache lines often have 128 or 256 bits of data which are transferred between caches and stream buffers as a unit. It is usual to arrange a stream buffer between one or more first level, on-chip instruction and data caches, and a second level, off-chip, direct or multi-way set-associative cache.
Stream buffers exploit the sequential nature in which processors frequently execute instructions and access data. When a processor makes a request to access data at an address, the request is passed on to farther memory levels of the memory system if the requested address is missing from the first level caches. A memory level at which the address is present responds to the request by returning an entire cache line to the first level cache. This cache line includes the data at the requested address.
In addition, a stream buffer is allocated to receive and store additional cache lines that are likely to be needed next by the processor. These cache lines may be prefetched from sequential addresses in the memory system, or from addresses that are spaced apart at non-unit intervals (strides) according to techniques that are well known in the art. These prefetched cache lines are stored in the stream buffer and not in the first level cache. Because it is not known whether the prefetched data will ever be used, storing the prefetched lines directly in the first level cache could replace useful data with useless data.
Now, when a subsequently requested address is missing from the first level cache, there is a good possibility that this address is in the stream buffer. Thus, data missing at a first level cache can be quickly accessed from the prefetched cache lines stored in the stream buffer.
A prefetched cache line moves from the head of the FIFO queue in the stream buffer to the first level cache when that prefetched cache line is needed. The other cache lines in the queue move up accordingly, making room in the FIFO queue for the prefetching of additional cache lines.
Computer systems that use prefetching techniques often use more than one stream buffer. A number of different methods can be used to determine which stream buffer to allocate for receiving prefetched cache lines. One exemplary method is a least recently used algorithm.
For a detailed description of prior art stream buffers, please see, U.S. Pat. No. 5,317,718, Data Processing and Method for Prefetch Buffers, issued to Jouppi on May 31, 1994, incorporated by reference herein. The patent describes prior art stream buffers that prefetch four cache lines. This prefetching can improve the performance of a computer system when the information needed by the processor is accurately anticipated. Having the data ready in a closer memory can give a processor much faster access than if the processor had to retrieve the data from a memory that was farther away.
However, for processes that access information stored at dispersed addresses, stream buffers may only marginally improve performance. If the processor does not use the prefetched data, prefetching can actually needlessly degrade system performance. The extra data traffic generated by useless prefetching can consume memory bandwidth for no useful purpose and actually overload the memory system.
Therefore, it is desired to improve the performance of the interaction between processors and memory systems, not only for sequential processing of information, but also for those cases where the processors access the information in a non-sequential manner.