A computer program executing on a computer system usually includes branch instructions, or branches. A conditional branch directs that one of two or more instructions or sets of instructions be executed dependent upon some condition or conditions being met. An unconditional branch directs that a certain instruction or set of instructions always be executed whenever the branch is encountered. Because time and hardware usage is involved in resolving a branch, that is, determining which of the available possible branches will be taken, it is known practice to attempt to predict branch behavior so that overhead associated with resolving branches can be reduced.
Current branch prediction methods do not always predict the actual branch behaviors, so some mispredictions inevitably occur. Typically, the performance penalty for a branch misprediction is greater than the overhead associated with executing the branch without attempting to predict branch behavior. This is particularly true in modern, pipelined processors. As such processors become faster, pipelines become correspondingly deeper so that a greater number of instructions are in flight at any given time. In the case of a branch misprediction, the deep pipeline, which is filled with incorrect instructions, typically must be completely flushed. Given a pipeline with a depth of seven cycles (where depth is the number of cycles from the start of an instruction execution to the end of instruction execution) a penalty of fifteen cycles could be incurred for a misprediction. This includes a minimum penalty of seven cycles for the depth of the pipeline with additional cycles necessary to save and restore relevant processor states.
Methods that do not involve branch prediction have also been used to attempt to reduce the overhead associated with branch instructions. One such method is scheduling branch delays. In this method the delays associated with executing a branch instruction are simply filled with other instructions to be executed in the interim. As processors become faster and pipelines deeper, however, branch delays become longer and the amount of code required to schedule sufficient activity to fill the delay becomes prohibitive.
Another method used to attempt to reduce branch delay is that of annotating instructions in the instruction cache. In this method, information constituting hints about branch behavior is annotated in a cache that stores instructions. Such hints include: successor block and/or line information; whether or not there is a branch in the line; any potential for misprediction; and if there is a branch in the line, where the branch went on the previous execution. The instruction cache annotation method creates a direct record of branch behavior and is costly in terms of hardware. Another disadvantage of this method is that, although it is fairly successful in the case of rhythmic behavior, it is not appreciably successful when application program behavior is relatively unpredictable.
Branch target buffers (BTBs) are small storage devices used in another technique of predicting branch behavior. BTBs are essentially small storage tables that store full branch addresses and, for each full branch address, a history of associated branch behavior that is collected over time where history indicates whether a branch was taken or not taken in the past. In the ideal case each time a branch is encountered in the execution of a program the full branch address would be stored in the BTB along with the associated history. Ideally, a BTB of unlimited size stores an unlimited number of branch addresses and upon each subsequent encounter with a particular branch the branch history is used to predict branch behavior. In reality, however, it is typically not economically feasible to devote enough hardware to a BTB to achieve this ideal case. Therefore, typical BTBs store approximately sixty-four (64) branch addresses with their associated histories. Because a finite number of branch addresses are stored at one time, branch addresses are displaced over time and may not always be found in the BTB.
An index into a BTB is typically generated by using a certain number of bits of the branch address as the index. When a branch is encountered in execution of the application program, the index is used as a lookup into the BTB in an attempt to find the branch address corresponding to the application program branch. This results in a many-to-one relationship between BTB entries and BTB indices. For this reason, one of the problems experienced with typical BTBs is that of address collision. Address collision occurs when a branch address encountered in the application program has an index belonging to both the branch address itself and another branch address previously stored in the BTB. In the case of a collision, although the lookup operation is successful, the subsequent comparison of the actual program branch address with the branch address stored in the BTB reveals that a collision has occurred and the associated history stored in the BTB is not the desired history. Another problem encountered in the use of typical BTBs is that of branch context collision. Context as used herein means the address that a current function was called from In the case of context collision, although the comparison of the branch address of the program with the branch address in the BTB reveals a match, the history in the BTB is inappropriate. Context collision can occur because more than one line of the application program may utilize a single branch address where each context's use of the branch address has a different behavior. Table 1 illustrates this case.
TABLE 1 ______________________________________ foo(){ """ """ line 23 goo(value 1); /* this call returns to next line, address 0x1234*/ """ """ line 33 goo(value 2); /* this cell returns to next line, address 0x2348*/ """ """ } goo(VARIABLE){ """ """ if(value 1){/* mispredicted branch */ /* address 0x4560 */ """ """ } else { """ """ } } ______________________________________
Table 1 shows pseudocode describing the function foo(). In the example of Table 1, function foo() call function goo() from lines 23 and 33 with return addresses 0x1234 and 0x2348, respectively, where "0x" denotes a hexadecimal number. The call that returns to 0x1234 is thirty percent of all calls to goo(). The call that returns to 0x2348 is seventy percent of all calls to goo(). The call described at line 23 calls with arguments/context that require the first branch in goo(), at 0x4560, to be taken. The call described at line 33 calls with arguments/context that require the first branch in goo() to be not taken. If the goo() branch is predicted based solely upon bits from the branch address, as in the usual method this branch is likely to be mispredicted, as both of the differing behaviors have to be recorded and reconciled in the history portion of the BTB.
As shown in Table 1, seventy percent of calls to goo() will pass value 2 as VARIABLE. As shown, if VARIABLE is not value 1 the branch is mispredicted. In this example, context collision will occur seventy percent of the time.
Another possible situation that occurs with the use of a BTB is that of a lookup miss. In the case of a lookup miss, the index does not look up any entry in the BTB. Various alternate schemes are typically used in such a case, for example, always assuming that the branch will be taken. Lookup misses and their context collisions both cause performance degradation, however, for most processors context collisions are significantly more costly in terms of performance degradation than are lookup misses. This is because in the case of a lookup miss, the instruction prompting the lookup is typically not executed until the data sought in the lookup is actually found. Therefore, in the case of a lookup miss, the pipeline is not filled with data that later needs to be flushed. In the case of a misprediction, on the other hand, instructions are executed with inappropriate data, making it necessary to flush a potentially deep pipeline and restore previous processor states before reexecuting with appropriate data.
Another conventional method that can be used in combination with a BTB is inlining of functions. In the case of inlining, complete copies of the function code, for instance, the code for function foo() as shown in Table 1, are copied into the main body of the application program each time the function occurs. Therefore, the overhead associated with calling a function is reduced. Another affect of inlining is that one occurrence of a function in a particular section of code will have a different branch address from another occurrence of the function in another section of the code. By distributing copies of the function (and thus, copies of the branch) to the different contexts from which it was called, the branches can be predicted separately and in many cases more accurately. For this reason context collision as described above with respect to Table 1 may be reduced. Commonly, however, inlining is prohibitively costly because of the percentage of growth of the application code resulting from insertion of complete functions in place of function calls. Another disadvantage of inlining is that it requires special compilation or source manipulation by a user program. Also, if a poorly predicted branch resides in a precompiled library, this approach may be impossible.