An important factor limiting a processor's performance is the memory system. To increase the processor's performance, the memory system preferably has low latency and a high bandwidth. Cost-per-bit is also a factor in determining the practical memory system size and performance. Some previous techniques attempt to reduce the effective memory system access time by adding queues, FIFOs, buffers, and special memory structures. Such previous techniques can result in significantly increased cost, size or complexity.
For reducing the average memory system access time, access times are reduced to one or more levels in the memory hierarchy. The memory subsystem closest to the processor has the largest affect on the average access time and is usually constrained to work in a integer number of processor cycles. This first level in the memory hierarchy is usually a first-level cache.
Processors can include internal first-level caches whose sizes are functions of the available semiconductor area. The speed-size-miss ratio tradeoff for very small caches (less than 1K bytes) can result in inefficient operation due to high miss-rates and total miss-penalties. Accordingly, the first-level caches normally are as large as possible and use the fastest available RAMs. Internal first-level caches normally are at least 16K bytes in total size, and external first-level caches can be as large as 2M bytes.
Many processors are sequenced at a rate comparable to the access time of the first-level cache. This streamlines the cache access and reduces the complexity of information control within the pipeline. A disadvantage of large first-level caches is that their access times are greater than the optimum processing rates of the other pipelined functional units. Accordingly, one technique for decreasing a memory system's average access time involves pipelining the first-level caches.
Pipelining the first-level caches can double their average throughput and reduce their effective access time in half. Nevertheless, cache pipelining adds complexity to the cache structures and increases the pipeline depth of the processor. Increasing the pipeline depth can increase the processing penalty for a mispredicted branch instruction and data conflicts following a load operation. If branch and load delay slots are scheduled by a compiler, then increasing the pipeline depth can increase the number of delay slots to be filled with useful instructions. Previous techniques fail to achieve the average access time of pipelined caches without increasing pipeline depth and CPI.
Thus, a need has arisen for a method and system for prefetching information in a processing system, in which a memory system has relatively low latency and high bandwidth. Also, a need has arisen for a method and system for prefetching information in a processing system, in which a memory system's performance is increased without significantly increasing cost, size and complexity. Further, a need has arisen for a method and system for prefetching information in a processing system, in which the speed-size-miss ratio tradeoff for very small caches does not result in inefficient operation due to high miss-rates and total miss-penalties. Moreover, a need has arisen for a method and system for prefetching information in a processing system, in which access times of a memory system are not greater than the optimum processing rates of other pipelined functional units. Finally, a need has arisen for a method and system for prefetching information in a processing system, in which a memory system's average access time is decreased without significantly increasing pipeling depth and CPI.