As the operating frequencies of microprocessors continues to rise, performance often depends upon providing a continual stream of instructions and data in accordance with the computer program that is running. As such, many processors include branch prediction circuitry that is used to predict branch addresses and to cause the prefetching of instructions in the instruction stream before they are needed. For example, U.S. Pat. No. 5,469,551 discloses a branch prediction circuit along with a subroutine stack used to predict branch address and prefetch at the instruction stream.
As application programs get larger, instruction fetch penalty has become one of the major bottlenecks in system performance. Instruction fetch penalty refers to the number of cycles spent in fetching instruction from different levels of cache memories and main memory. Instruction prefetch is an effective way to reduce the instruction fetch penalty by prefetching instructions from long-latency cache memories or main memory to short-latency caches. The basic idea of any instruction prefetching scheme is to pre-load instructions from external memory or a higher-level cache into the local instruction cache that is most closely associated with the execution unit of the processor. Therefore, when instructions are actually demanded, the fetch penalty of the instructions is small.
When instructions are immediately available in the local cache memory of the processor, program execution precedes smoothly and rapidly. However, if an instruction is not resident in the on-chip instruction cache, the processor must request the instruction from a higher-level cache or from external memory. If the instruction is present in the higher-level cache (e.g., the L1 cache) the delay may only be around eight clock cycles of the processor. The cost of generating a bus cycle to access external memory is much greater: on the order of a hundred clock cycles or more. This means that program execution must halt or be postponed until the required instruction returns from memory. Hence, prefetching is aimed at bringing instructions into a cache local to the processor prior to the time the instruction is actually needed in the programmed sequence of instructions.
It is equally important for the instruction prefetch mechanism to acquire the correct instructions. Because a prefetch needs to be performed before the program actually reaches the prefetch target, the prefetch target is often chosen based on a prediction of the branch. When a branch is predicted correctly, the demanded instructions are prefetched into short-latency caches, thus reducing the fetch penalty. However, when a branch is predicted incorrectly, the prefetched instructions are not useful. In some cases, incorrectly prefetched instructions can actually be harmful to the program flow because they cause cache pollution. In addition, often times prefetching incorrect branch targets results in a "stall" condition in which the processor is idle while the main memory or long-latency cache memories are busy acquiring the critically-demanded instructions.
To overcome these difficulties, designers have developed a variety of different systems and methods for avoiding stalls. By way of example, U.S. Pat. No. 5,396,604 teaches an approach that obviates the need for defining a new instruction in the instruction set architecture of the processor.
Yet another approach for reducing the cache miss penalty is to maintain a scoreboard bit for each word in a cache line in order to prevent the writing over of words previously written by a store instruction. U.S. Pat. No. 5,471,602 teaches a method for improving performance by allowing stores which miss the cache to complete in advance of the miss copy-in.
While each of these systems and methods provides improvement in processor performance, there still exists a need to increase the overall speed of prefetching operations. In other words, by its nature prefetching is a speculative operation. In previous architectures, hardware implementations have been used to bring instructions into the machine prior to execution. Past approaches, however, have failed to make best use of memory bandwidth by specifying how much to prefetch, in addition to where and when to prefetch instructions. That is, many machines simply prefetch in the next sequential line of instructions following a cache miss because most instructions exhibit sequential behavior.
To minimize the instruction fetch penalty and to increase the accuracy of prefetching operations, it is therefore desirable to provide a new mechanism for instruction prefetching.