This invention relates generally to pipelined processors and, more particularly, to methods, systems, and computer program products for recovering from branch prediction latency.
Modern processors use pipelining techniques to execute instructions at very high speeds. A pipeline is roughly analogous to an assembly line. On an automobile assembly line, many interrelated steps need to be performed in order to construct a new car. These steps are typically performed in parallel, such that a given step is performed on a plurality of different cars at substantially the same time. In a processor pipeline, each step completes a part of an instruction. Like the assembly line, different steps are completing different parts of different instructions in parallel. Each of these steps is called a pipe stage. The stages are connected, one to the next, to form a pipe where instructions enter at one end, progress through the stages, and exit at the other end. A pipeline is most effective if it can process a steady stream of instructions in a sequential manner.
As part of continuing efforts to increase the performance of central processing units (CPUs), instruction-level parallelism has been increasingly employed, in part, by deepening instruction pipelines. However, one consequence of a deeper pipeline is greater susceptibility to losses in performance from having to flush instructions being processed in the pipeline (i.e., instructions that are “in flight” in the pipeline). Countering this deleterious effect of branch instructions on deeper pipelines is the use of branch prediction algorithms meant to predict whether or not a branch will be taken, and in response to this prediction, initiating a pre-fetching of an appropriate set of instructions into the pipeline. However, as pipelines become ever deeper, the stakes of lost performance due to an incorrect prediction become ever greater, and so the accuracy of branch prediction becomes ever more important.
More specifically, when a branch is executed, the value of an instruction pointer may be changed to something other than the current value of the pointer plus a predetermined fixed increment. If a branch changes the instruction pointer to an address of a branch target given by the branch instruction, the branch is considered to be a “taken” branch. On the other hand, if a branch does not change the value of the instruction pointer to the address of the branch target, then this branch is not taken. Knowledge of whether or not a branch will be taken, as well as the address of the branch target, typically becomes available when the instruction has reached the last or next to last stage of the pipe. Thus, all instructions that issued later than the branch—and hence not as far along in the pipe as the branch—are invalid. These later issued instructions are invalid in the sense that they should not be executed if the branch is taken, because the next instruction to be executed following the branch is the one at the target address. All of the time spent by the pipeline on these later issued instructions is wasted delay, thus significantly reducing the overall speed that can be achieved by the pipeline.
One existing method for dealing with branches is to use prediction logic, hardware within a processor, or both, to predict whether an address will result in a branch instruction being taken or not taken. Examples of such hardware include a 2-bit saturating counter predictor (see “Computer Architecture A Quantitative Approach”, David A. Patterson and John L. Hennessy, 2nd Edition, Morgan Kauffman Publishers, pp. 262 271,), as well as a local history predictor which uses the past behavior (taken/not-taken) of a particular branch instruction to predict future behavior of the instruction. Another existing technique selects a final prediction at the output of a multiplexer from among a first prediction provided using a branch past history table and a second prediction provided using a global branch history table.
A shortcoming with existing branch prediction schemes is that a start-up penalty for the prediction logic is longer than the amount of time it takes for instructions to be fetched from an instruction cache. One consequence of this start-up penalty, also termed a latency penalty, is that from a fresh start, instruction fetch may get ahead of prediction and never allow prediction to catch up. This occurs in designs where the branch prediction logic acts in parallel with instruction fetch. Without performing the proper branch prediction in time, instruction fetch may proceed down the wrong path which, in turn, may lead to further fetch restarts. As a result, one latent prediction may start a train of incorrect predictions and be very detrimental to overall performance.
One known solution to prevent instruction fetch from proceeding down the wrong path is to stall fetch on a fresh start condition to allow branch prediction to catch up with the new fetch. This approach is detrimental to performance due to the added latency in the instruction fetch. Such an approach should only be utilized if performance analysis reveals that the performance gain in allowing the branch prediction to catch up with the fetch more than offsets this fetch delay. Accordingly, it would be advantageous to provide an enhanced branch prediction technique that overcomes the foregoing deficiencies.