1. Technical Field
The present application relates generally to an improved data processing system and method. More specifically, the present application is directed to a system and method for optimizing the branch logic of a processor to improve handling of hard to predict indirect branches.
2. Description of Related Art
In modern superscalar processors, branch predictors are crucial to achieving high performance when executing program code. A branch predictor is a functional unit of a processor that determines whether a conditional branch in the instruction flow of a program is likely to be taken or not. Branch predictors allow processors to fetch and execute instructions without waiting for a branch to be resolved.
There are a number of different types of branch predictors utilized in various microprocessor designs. One such branch predictor is a branch target predictor. A branch target predictor is a functional unit of a processor that predicts the target of a conditional branch, or unconditional jump instruction, before that instruction has been fetched from the instruction cache. Branch target prediction is not the same as branch prediction. Branch prediction attempts to guess whether the branch will be taken or not. Branch target prediction attempts to guess the target of the branch or unconditional jump before it is computed from parsing the instruction itself. Essentially, the branch target predictor predicts the target of the branch given the address of the branch using a branch target address cache.
Many modern processors invest heavily in branch prediction mechanisms such as that discussed above to help mitigate the effects of long instruction execution pipelines which are needed by the high frequency designs of the processors. For example, in the PowerPC family of processors, available from International Business Machines Corporation of Armonk, N.Y., such as the Power4 processor, up to eight instructions may be fetched from the instruction cache with branch prediction logic scanning the fetched instructions looking for up to two branches for each processor cycle (see “Power4 System Microarchitecture,” Tendler et al., Technical White Paper, October 2001, available at www.ibm.com). Depending upon the branch type found, various branch prediction mechanisms engage to help predict the branch direction or the target address of the branch, or both. Branch direction for unconditional branches is not predicted. All conditional branches are predicted, even if the condition register bits upon which they are dependent are known at instruction fetch time.
As branch instructions flow through the pipeline of the processor, and ultimately execute in the branch execution unit of the processor, the actual outcome of the branches is determined. At that point, if the predictions were found to be correct, the branch instructions are simply completed like all other instructions. In the event that a prediction is found to be incorrect, the instruction fetch logic of the processor causes the mispredicted instructions to be discarded and starts re-fetching instructions along the corrected path.
The Power4 processor uses a set of three branch history tables to predict the direction of branch instructions. The first table, referred to as the local predictor, is similar to a traditional branch history table (BHT). The local predictor is a 16K entry array indexed by the branch instruction address producing a 1-bit predictor that indicates whether the branch direction should be taken or not.
The second table, referred to as the global predictor, predicts the branch direction based on the actual path of execution to reach the branch. The path of execution is identified by an 11-bit vector, one bit per group of instructions fetched from the instruction cache for each of the previous eleven fetch groups. This vector is referred to as the global history vector. Each bit in the global history vector indicates whether the next group of instructions fetched are from a sequential cache sector or not. The global history vector captures this information for the actual path of execution through these sectors. That is, if there is a redirection of instruction fetching, some of the fetched group of instructions are discarded and the global history vector is immediately corrected. The global history vector is hashed, using a bitwise exclusive OR with the address of the branch instruction. The result indexes into a 16K entry global history table to produce another 1-bit branch direction predictor. Similar to the local predictor, this 1-bit global predictor indicates whether the branch should be predicted to be taken or not.
Finally, a third table, referred to as the selector table, keeps track of which of the two prediction schemes works better for a given branch and is used to select between the local and global predictions. The 16K entry selector table is indexed exactly the same way as the global history table to produce the 1-bit selector. This combination of branch prediction tables has been shown to produce very accurate predictions across a wide range of workload types.
As branch instructions are executed and resolved, the branch history tables and other predictors are updated to reflect the latest and most accurate information. Dynamic branch prediction can be overridden by software, such as in cases where software can predict better than the hardware which branches will be taken. Such overriding of the hardware may be accomplished by setting two bits in conditional branch instructions, one to indicate a software override and the other to predict the direction. When these two bits are zero, hardware branch prediction is utilized.
The Power4 processor microarchitecture supports a number of different types of branch instructions including the branch to link register (bclr) and branch to count register (bcctr) instructions. The bcctr instruction, for one, is an instruction for conditionally branching to an instruction specified by the branch target address contained within a count register of the processor. The count register is a special purpose register (SPR) of the processor that can be used to hold a loop count that can be decremented during execution of branch instructions and can also be used to provide a branch target address for the bcctr instructions. Branch target addresses for the bclr and bcctr instructions can be predicted using a hardware implemented link stack and count cache mechanism, respectively. Target addresses for absolute and relative branches are computed directly as part of a branch scan function.
As mentioned above, the Power4 processor uses a link stack to predict the target address for a branch to link instruction that it believes corresponds to a subroutine return. By setting hint bits in a branch to link register (bclr) instruction, software communicates to the processor whether a branch to link register (bclr) instruction represents a subroutine return, a target address that is likely to repeat, or neither.
When the instruction fetch logic of the processor fetches a bclr instruction (either conditional or unconditional) predicted as taken, it pushes the address of the next instruction onto the link stack. When it fetches a bclr instruction with a “taken” prediction and with hint bits indicating a subroutine return, the link stack is popped and instruction fetching starts from the popped address. In order to preserve integrity of the link stack in the face of mispredicted branch target link instructions, the Power4 processor employs extensive speculation tolerance mechanisms in its link stack implementation to allow recovering the link stack under most circumstances.
The target address of a branch to count register (bcctr) instruction is often repetitive. This is also true for some of the bclr instructions that are not predictable through the use of the link stack (because they do not correspond to a subroutine return). By setting the hint bits appropriately, software communicates to the hardware whether the target address for such branches are repetitive. In these cases, the Power4 processor uses a 32 entry, tagless, direct mapped cache, referred to as the count cache, to predict the repetitive targets, as indicated by the software hints. Each entry in the count cache can hold a 62-bit address. When a bclr or bcctr instruction is executed, for which the software indicates that the target is repetitive, and therefore predictable, the target address is written in the count cache. When such an instruction is fetched, the target address is predicted using the count cache. That is, the count cache stores the target address for previously encountered bcctr instructions so that if the same indirect branch instruction is encountered later, the prediction is that the indirect branch instruction will branch to the same target address.
In known PowerPC microarchitectures, the count cache is used as the sole mechanism to predict bcctr instructions. However, there are significant cases where the count cache based prediction does not generally result in a correct prediction. For example, with computed branches (function pointers), which are most frequently used in object oriented code, and case or switch statements, which use a branch table to jump to a desired code section, the count cache based prediction does not generally result in a correct prediction since such branches are hard to predict, i.e. the target address of such branches are not typically found in the count cache or the target address in the count cache is incorrect.
In addition, with known PowerPC microarchitectures, the processor design requires a “bubble” of a predetermined number of cycles, such as 4 cycle “bubble,” between dispatching the move to count register (mtctr) instruction and its dependent bcctr instruction. That is, as mentioned above the count register stores the branch target address for the bcctr instructions. The target address must be loaded into the count register from the general purpose registers for use when executing the bcctr instruction. The mtctr instruction is used to move the branch target address from the general purpose register to the count register for use in executing the bcctr instruction. The 4 cycle “bubble” is used to ensure that the data representing the branch target address, that is moved by the mtctr instruction, is in the count register before the bcctr instruction executes. This requirement for a 4 cycle bubble between the mtctr instruction and the bcctr instruction causes additional execution latency.