1. Field of the Invention
This invention relates to microprocessors, and more particularly, to branch prediction mechanisms.
2. Description of the Relevant Art
Modern microprocessors may include one or more processor cores, or processors, wherein each processor is capable of executing instructions of a software application. These processors are typically pipelined, wherein the processors include one or more data processing stages connected in series with storage elements placed between the stages. The output of one stage is made the input of the next stage during each transition of a clock signal. Some processors may have multiple pipelines. Ideally, every clock cycle produces useful execution of an instruction for each stage of a pipeline. However, a stall in a pipeline may cause no useful work to be performed during that particular pipeline stage. Some stalls may last several clock cycles and significantly decrease processor performance. Some examples of a stall include a data-cache or instruction-cache miss, data dependency between instructions, and control flow misprediction, such as a mispredicted branch instruction.
The negative effect of stalls on processor performance may be reduced by overlapping pipeline stages. A further technique is to allow out-of-order execution of instructions, which helps reduce data dependent stalls. However, a stall of several clock cycles still reduces the performance of the processor due to in-order retirement that may prevent hiding of all the stall cycles. Therefore, another method to reduce performance loss is to reduce the occurrence of the multi-cycle stalls. One such multi-cycle stall is a misprediction of a control flow instruction, such as a branch instruction.
Control flow instructions comprise many types such as conditional or unconditional, direct or indirect, and monomorphic, duomorphic, or polymorphic. A conditional control flow instruction performs a determination of which path to take in an instruction stream. If the control flow instruction determines a condition is not satisfied, then the control flow instruction is considered to be not-taken and the next sequential instruction in program order is executed. However, if the control flow instruction determines a condition is satisfied, then the control flow instruction is considered to be taken, and an instruction which is not the next sequential instruction in program order, but rather is located at the branch target address, is executed. An unconditional control flow instruction is considered an always taken conditional control flow instruction. There is no condition to test, and execution of instructions always occurs in a different sequence than sequential order.
In addition, a branch target address may be specified by an offset, which may be stored in the control flow instruction itself, relative to the program counter (PC) register value. This type of branch target address is referred to as direct. A branch target address may also be specified by a value in a register or memory, wherein the register or memory location may be stored in the control flow instruction. This type of branch target address is referred to as indirect. Further, in an indirect control flow instruction, the register specifying the branch target address may be loaded with different values. If the register specifying the branch target address only stores one value for the corresponding indirect control flow instruction, then the indirect control flow instruction is referred to as monomorphic. If this register may store 2 values for the corresponding indirect control flow instruction, then the indirect control flow instruction is referred to as duomorphic. And if this register may store more than 2 values for the corresponding indirect control flow instruction, then the indirect control flow instruction may be referred to as polymorphic.
Examples of unconditional indirect control flow instructions include procedure calls and returns that may be used for implementing subroutines in program code, and that may use a Return Address Stack (RAS) to supply the branch target address. Another example is a jump instruction that may be used for case and switch statements in program code. An example of a conditional control flow instruction is a branch instruction that may be used to implement loops in program code.
Conditional branch instructions must satisfy a condition to be considered taken. An example of a satisfied condition may be a specified register now holds a stored value of zero. The specified register is encoded in the conditional branch instruction. This specified register may have its stored value decrementing in a loop due to instructions within software application code. The output of the specified register may be input to dedicated zero detect combinatorial logic. An example of a loop in code may be as follows:
LWR5, 0(R3)loop:ADDR8, R8, R2SUBR5, R5, R6BNEZR5, loopLWR7, 0(R4)
The above loop may be implementing a FOR loop construct in code where the register R5 holds an index value. The register R5 has its value decremented by a value stored in register R6 during each iteration of the loop. The branch instruction, BNEZ, determines if the value stored in register R5 is not equal to zero. If the condition is taken, or R5 holds a non-zero value, the instruction sequence continues with the instruction at the branch target address designated by “loop” in the branch instruction. Here, the branch target address is a PC-relative address. An immediate field, such as “loop” above, that holds a displacement value may be encoded in the direct branch instruction. The above branch instruction is a conditional direct control flow instruction.
In the taken case, rather than continue a sequential order of the instructions within the application, the taken conditional branch instruction causes execution to occur in a different sequence, such as with the ADD instruction designated with the “loop” label.
If the condition is not satisfied, such as register R5 above now does hold a zero value, then the conditional direct branch instruction is considered not-taken. In this case, the instructions within the application continue execution in sequential order. In this case the load word instruction, LW, will load a value into register R7.
Conditional branch instructions may have some dependency on one another. For example, a program may have a simple case such as:
if (value == 0) value == 1;if (value == 1)
The conditional branch instructions that will be used to implement the above case will have global history that may be used to improve the accuracy of predicting the conditions. The prediction may be implemented by 2-bit counters and is described in more detail later.
An indirect jump instruction may be used to implement a switch-case statement, which is popular in object-oriented programs such as C++ and Java. An example of a switch-case statement is as follows:
switch (menu) {// indirect jump  case 1:// branch target address 1   // 12 instructions   break;  case 2:// branch target address 2   // 8 instructions   break;  case 3:// branch target address 3   // 4 instructions   break;  case default:// branch target address 4   break;}if (somevalue == 2) {// merge point of indirect jump
In the above example, the indirect jump instruction has 4 static branch target addresses, which may not be taken in an evenly distributed manner (i.e. each branch target address may not be taken 25% of the time). The above indirect jump may be referred to as a polymorphic indirect unconditional branch instruction. In the above example, the indirect jump only has 4 branch target addresses, but the number of branch target addresses may reach as high as a few dozen.
Modern microprocessors may need multiple clock cycles to both determine the outcome of the condition of a branch instruction and to determine the branch target address. For a particular thread being executed in a particular pipeline, no useful work may be performed by the branch instruction or subsequent instructions until the branch instruction is decoded and later both the condition outcome is known and the branch target address is known. These stall cycles decrease the processor's performance.
Rather than stall, predictions may be made of the conditional branch condition and the branch target address shortly after the instruction is fetched. The exact stage as to when the prediction is ready is dependent on the pipeline implementation. In order to predict a branch condition, the PC used to fetch the instruction from memory, such as an instruction cache (i-cache), may be used to index branch prediction logic. One example of a combined prediction scheme that uses the PC is the gselect branch prediction method described in Scott McFarling's 1993 paper, “Combining Branch Predictors”, Digital Western Research Laboratory Technical Note TN-36, incorporated herein by reference in its entirety.
High branch prediction accuracy contributes to more power-efficient and higher performance microprocessors. Polymorphic indirect unconditional branch (PIUB) instructions are occurring more frequently in application programs due to an increase in object-oriented programs that commonly use this type of instruction. An indirect target buffer is commonly used to make predictions for duomorphic and polymorphic indirect branch instructions, including PIUB instructions. The overhead for an indirect target buffer, similar to a branch target buffer (BTB) used to make predictions for monomorphic branch instructions, is higher than predictors for conditional branches. This greater overhead for an indirect target buffer entry or a BTB entry is generally due to the fact that the entry stores a full 32-bit or 64-bit branch target address compared to, for example, a 2-bit counter and a taken or not-taken bit for a conditional branch predictor. Also the prediction rates for PIUB instructions are lower than prediction rates for conditional branches. Further, although PIUB instructions contribute to the dynamic code path, their outcomes are not used in branch prediction mechanisms.
In view of the above, efficient methods and mechanisms for using PIUB instructions to improve control flow prediction is desired.