1. Field of the Invention
This invention relates to microprocessors, and more particularly, to branch prediction mechanisms.
2. Description of the Relevant Art
Modern microprocessors may include one or more processor cores, or processors, wherein each processor is capable of executing instructions of a software application. These processors are typically pipelined, wherein the processors include one or more data processing stages connected in series with storage elements (e.g. registers and arrays) placed between the stages. The output of one stage is made the input of the next stage during a transition of a clock signal that defines a clock cycle or a phase, which may be a fraction of a clock cycle. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
Ideally, every clock cycle produces useful execution of an instruction for each stage of a pipeline. However, a stall in a pipeline may cause no useful work to be performed during that particular pipeline stage. Some stalls may last several clock cycles and significantly decrease processor performance. Some examples of a stall include a data-cache or instruction-cache miss, data dependency between instructions, and control flow misprediction, such as a mispredicted branch instruction.
The negative effect of stalls on processor performance may be reduced by overlapping pipeline stages. A further technique is to allow out-of-order execution of instructions, which helps reduce data dependent stalls. In addition, a core with a superscalar architecture issues a varying number of instructions per clock cycle based on dynamic scheduling. However, a stall of several clock cycles still reduces the performance of the processor due to in-order retirement that may prevent hiding of all the stall cycles. Therefore, another method to reduce performance loss is to reduce the occurrence of multi-cycle stalls. One such multi-cycle stall is a misprediction of a control flow instruction, such as a branch instruction.
Branch instructions comprise many types such as conditional or unconditional and direct or indirect. A conditional branch instruction performs a determination of which path to take in an instruction stream. If the branch instruction determines a specified condition, which may be encoded within the instruction, is not satisfied, then the branch instruction is considered to be not-taken and the next sequential instruction in a program order is executed. However, if the branch instruction determines a specified condition is satisfied, then the branch instruction is considered to be taken. Accordingly, a subsequent instruction which is not the next sequential instruction in program order, but rather is an instruction located at a branch target address, is executed. An unconditional branch instruction is considered an always-taken conditional branch instruction. There is no specified condition within the instruction to test, and execution of subsequent instructions always occurs in a different sequence than sequential order.
In addition, a branch target address may be specified by an offset, which may be stored in the branch instruction itself, relative to the linear address value stored in the program counter (PC) register. This type of branch target address is referred to as direct. A branch target address may also be specified by a value in a register or memory, wherein the register or memory location may be stored in the branch instruction. This type of branch target address is referred to as indirect. Further, in an indirect branch instruction, the register specifying the branch target address may be loaded with different values.
Examples of unconditional indirect branch instructions include procedure calls and returns that may be used for implementing subroutines in program code, and that may use a Return Address Stack (RAS) to supply the branch target address. Another example is an indirect jump instruction that may be used to implement a switch-case statement, which is popular in object-oriented programs such as C++ and Java.
An example of a conditional branch instruction is a branch instruction that may be used to implement loops in program code (e.g. “for” and “while” loop constructs). Conditional branch instructions must satisfy a specified condition to be considered taken. An example of a satisfied condition may be a specified register now holds a stored value of zero. The specified register is encoded in the conditional branch instruction. This specified register may have its stored value decrementing in a loop due to instructions within software application code. The output of the specified register may be input to dedicated zero detect combinatorial logic.
In addition, conditional branch instructions may have some dependency on one another. For example, a program may have a simple case such as:if (value==0) value==1;if (value==1)
The conditional branch instructions that will be used to implement the above case will have global history that may be used to improve the accuracy of predicting the conditions. In one embodiment, the prediction may be implemented by 2-bit counters. Branch prediction is described in more detail next.
Modern microprocessors may need multiple clock cycles to both determine the outcome of the condition of a branch instruction and to determine the branch target address. For a particular thread being executed in a particular pipeline, no useful work may be performed by the branch instruction or subsequent instructions until the branch instruction is decoded and later both the condition outcome is known and the branch target address is known. These stall cycles decrease the processor's performance.
Rather than stall, predictions may be made of the conditional branch condition and the branch target address shortly after the instruction is fetched. The exact stage as to when the prediction is ready is dependent on the pipeline implementation. In order to predict a branch condition, the PC used to fetch the instruction from memory, such as from an instruction cache (i-cache), may be used to index branch prediction logic. One example of an early combined prediction scheme that uses the PC is the gselect branch prediction method described in Scott McFarling's 1993 paper, “Combining Branch Predictors”, Digital Western Research Laboratory Technical Note TN-36, incorporated herein by reference in its entirety. The linear address stored in the PC may be combined with values stored in a global history register in a hashing function. The output of the hashing function and the PC may be used to index prediction tables such as a pattern history table (PHT), a branch target buffer (BTB), or otherwise. The update of the global history register with branch target address information of a current branch instruction, rather than a taken or not-taken prediction, may increase the prediction accuracy of both conditional branch direction predictions (i.e. taken and not-taken outcome predictions) and indirect branch target address predictions, such as a BTB prediction or an indirect target array prediction. Many different schemes may be included in various embodiments of branch prediction mechanisms.
High branch prediction accuracy contributes to more power-efficient and higher performance microprocessors. Instructions from the predicted instruction stream may be speculatively executed prior to execution of the branch instruction, and in any case are placed into a processor's pipeline prior to execution of the branch instruction. If the predicted instruction stream is correct, then the number of instructions executed per clock cycle is advantageously increased. However, if the predicted instruction stream is incorrect (i.e. one or more branch instructions are predicted incorrectly), then the instructions from the incorrectly predicted instruction stream are discarded from the pipeline and the number of instructions executed per clock cycle is decreased.
Frequently, branch prediction mechanism comprises a history of prior executions of a branch instruction in order to form a more accurate behavior for the particular branch instruction. Such a branch prediction history typically requires maintaining data corresponding to the branch instruction in a storage. Also, a branch target buffer (BTB) may be used to store whole or partial branch target addresses used in target address predictions. In the event the branch prediction data comprising history and address information are evicted from the storage, or otherwise lost, it may be necessary to recreate the data for the branch instruction at a later time.
One solution to the above problem may be to increase the size of the branch prediction storage. However, increasing the size of branch prediction storage may require a significant increase in gate area and the size of the branch prediction mechanism. Consequently, by reducing the size of the branch prediction storage in order to reduce gate area and power consumption, valuable data regarding the behavior of a branch may be evicted and must be recreated.
In view of the above, efficient methods and mechanisms for improving branch prediction capability that does not require a significant increase in the gate count or size of the branch prediction mechanism are desired.