The present invention relates generally to superscalar processors and more particularly to a method and system for prefetching instructions in such a processor.
Instruction prefetching has been analyzed in great details over the years. Many of the proposed approaches require the keeping of a large table that indicates what cache line to prefetch when a particular address is being fetched. In highly speculative superscalar processors, instructions are prefetched from a path predicted by a branch prediction algorithm.
To reduce memory access time, a memory subsystem is usually organized within the processor with multiple cache levels. In the memory hierarchy, the first level cache is the fastest but it is also the smallest in size. For instruction accesses, most microprocessors have a dedicated first level cache, called an instruction cache (IL1 cache). During execution, the IL1 cache is usually accessed at every cycle with a very short access time (1 cycle in most processors).
Furthermore, optimization tools such as Feedback Directed Program Restructuring, (FDPR) restructures programs so that the most frequent paths of execution are laid out in the memory in sequential cache lines. This gives rise to the successful use of a simple instruction prefetching algorithm called Next Sequential Address (NSA). In this algorithm on an IL1 miss, the demand line is fetched with high priority and the next one (or more) sequential lines are xe2x80x9cprefetchedxe2x80x9d with lower priority. Also, on a hit in the prefetch buffer, the next sequential line is prefetched. To prevent pollution of the IL1 cache with prefetched lines (since the prefetched lines may not be actually needed), the prefetched lines are stored in a separate area, called the xe2x80x9cprefetch bufferxe2x80x9d. Furthermore, to reduce memory traffic, before sending a prefetch request to the memory subsystem below IL1, the IL1 cache directory and the prefetch buffer is checked to see if the cache line already exists.
Since the IL1 cache is usually small (often no more than 64 KB), significant IL1 cache misses occur for most workloads. On a IL1 cache miss, the execution pipeline is usually dry and the line is brought in from a lower level of the memory hierarchy with a much longer access time (for example, if the line is found in a lower level cache, the access time may be about 10 cycles). Consequently, IL1 cache misses are undesirable due to cache miss latency or the amount of time required to bring the line in from a lower level of the memory hierarchy.
Accordingly, what is needed is an improved method and system for prefetching instructions in a superscalar processor. The method and system should be simple, cost effective and capable of being easily adapted to current technology. The present invention addresses such a need.
In a first aspect of the present invention, a method for prefetching instructions in a superscalar processor is disclosed. The method comprises the steps of fetching a set of instructions along a predicted path and prefetching a predetermined number of instructions if a low confidence branch is fetched and storing the predetermined number of instructions in a prefetch buffer.
In a second aspect of the present invention, a system for prefetching instructions in a superscalar processor is disclosed. The system comprises a cache for fetching a set of instructions along a predicted path, a prefetching mechanism coupled to the cache for prefetching a predetermined number of instructions if a low confidence branch is fetched and a prefetch buffer coupled to the prefetching mechanism for storing the predetermined number of instructions.
Through the use of the method and system in accordance with the present invention, existing prefetching algorithms are improved with minimal additional hardware cost.