1. Technical Field
The present invention relates generally to computer processing systems and, in particular, to a method and apparatus for predicting the target of a subroutine return branch in a computer processing system. The present invention may be employed in the case of conventional subroutines, nested subroutines, foliated subroutines, and in the case of subroutine invocations through stubs (such as, for example, in the cases of virtual method invocation or dynamic library procedure invocation).
2. Background Description
Early microprocessors generally processed instructions one at a time. Each instruction was processed using four sequential stages: instruction fetch; instruction decode; instruction execute; and result writeback. Within such microprocessors, different dedicated logic blocks performed each different processing stage. Each logic block waited until all the preceding logic blocks completed operations before beginning its operation.
Improved computational speed has been obtained by increasing the speed with which the computer hardware operates and by introducing parallel processing in one form or another. One form of parallel processing relates to the recent introduction of microprocessors of the "superscalar" type, which can effect parallel instruction computation. Typically, superscalar microprocessors have multiple execution units (e.g., multiple integer arithmetic logic units (ALUs)) for executing instructions and, thus, have multiple "pipelines". As such, multiple machine instructions may be executed simultaneously in a superscalar microprocessor, providing obvious benefits in the overall performance of the device and its system application.
For the purposes of this discussion, latency is defined as the delay between the fetch stage of an instruction and the execution stage of the instruction. Consider an instruction which references data stored in a specified register. Such an instruction requires at least four machine cycles to complete. In the first cycle, the instruction is fetched from memory. In the second cycle, the instruction is decoded. In the third cycle, the instruction is executed and, in the fourth cycle, data is written back to the appropriate location.
To improve efficiency and reduce instruction latency, microprocessor designers overlapped the operations of the fetch, decode, execute, and writeback logic stages such that the microprocessor operated on several instructions simultaneously. In operation, the fetch, decode, execute, and writeback logic stages concurrently process different instructions. At each clock pulse the result of each processing stage is passed to the subsequent processing stage. Microprocessors that use the technique of overlapping the fetch, decode, execute, and writeback stages are known as "pipelined" microprocessors. In principle, a pipelined microprocessor can complete the execution of one instruction per machine cycle when a known sequence of instructions is being executed. Thus, it is evident that the effects of the latency time are reduced in pipelined microprocessors by initiating the processing of a second instruction before the actual execution of the first instruction is completed.
In general, instruction flow in a microprocessor requires that the instructions are fetched and decoded from sequential locations in a memory. Unfortunately, computer programs also include branch instructions. A branch instruction is an instruction that causes a disruption in this flow, e.g., a taken branch causes decoding to be discontinued along the sequential path, and resumed at a new location in memory. The new location in memory may be referred to as a target address of the branch. Such an interruption in pipelined instruction flow results in a substantial degradation in pipeline performance.
There are various types of branch instructions. One type of branch instruction is known as an unconditional branch in that it unconditionally transfers control from the branch instruction to the target instruction. That is, at the time that the branch instruction is decoded, it is known that the transfer of control to the target instruction will take place. Examples of unconditional branches include subroutine CALL/RETURN and GOTO. In terms of performance, a more costly branch instruction is known as a conditional branch. This instruction specifies that control is to be transferred to the target instruction only if some condition, as determined by the outcome of a previous instruction, is met. Examples of conditional branch constructs include the DO LOOP and the IF/THEN/ELSE.
Subroutine linkage typically involves a call to a subroutine and a return from the subroutine back to the instruction immediately following the call. Usually, the call is done through a branch instruction which saves the address to return to in a register, while the return is done by branching indirectly through the contents of this register. For example, in the PowerPC, the branch-and-link instruction (BL) is used for the call. This instruction saves the address of the immediately following instruction in a special register referred to as the link register. The branch-using-link-register (BCLR) is used to return from the subroutine through the contents of the link register. In the System 390, the corresponding instructions are BAL or BALR for the call, and BR for the return. In this case, the link information is kept in a general purpose register that is specified with the instruction, instead of in the link register.
Subroutines pose a problem for heavily pipelined computers (those with many stages in the pipeline). Although the instruction which calls a subroutine will contain enough information to determine which is the next instruction to enter the pipeline (i.e., the first instruction in the called subroutine), the return instruction in the subroutine will not contain such information. Instead, a return instruction must pass through all of the stages of the pipeline before the return address will be known from the return instruction. If the computer waited for the return instruction to pass through the pipeline before entering another instruction, there would then be a "bubble" in the pipeline behind the return instruction in which there would be no instructions, thereby lowering the performance of the computer.
To help alleviate the penalty due to the latency of pipelines, many pipelined microprocessors use branch prediction mechanisms that predict the existence and the outcome (i.e., taken or not taken) of branch instructions within an instruction stream. The instruction fetch unit uses the branch predictions to fetch subsequent instructions.
When a branch prediction mechanism predicts the outcome of a branch instruction and the microprocessor executes subsequent instructions along the predicted path, the microprocessor is said to have "speculatively executed" along the predicted instruction path. During speculative execution the microprocessor is performing useful processing if the branch instruction was predicted correctly.
However, if the branch prediction mechanism mispredicted the branch instruction, the microprocessor is executing instructions down the wrong path and therefore accomplishes nothing. When the microprocessor eventually detects the mispredicted branch, the microprocessor must flush the instructions that were speculatively fetched from the instruction pipeline and restart execution at the correct address. The effect of the above described non-sequential operation, and of the resultant flushing of the pipeline, is exacerbated in the case of superscalar pipelined microprocessors. For example, if a branch or other interruption in the sequential instruction flow of the microprocessor occurs, the number of lost pipeline slots, or lost execution opportunities, is multiplied by the number of parallel execution units (i.e., parallel pipelines). The performance degradation due to branches and corresponding non-sequential program execution is therefore amplified in superscalar pipelined microprocessors.
Prediction of subroutine return branches is usually more difficult than the prediction of most other branches, because the same branch instruction could have different targets corresponding to the different points of subroutine invocation. The instruction used for calls and returns are not unique--there are different instructions that are used in different instances to perform these functions. Moreover, these instructions may be used for purposes other than subroutine calls and returns. This makes it difficult to use simple stack-based schemes for predicting returns.
Prediction techniques have included the use of Branch History Tables (BHTs), Branch Target Buffers (BTBs), and return address stacks. In its simplest form, a BHT maintains the outcomes of previously executed branches. The table is accessed by the instruction prefetch unit and decides whether prefetching should be redirected or not. The table is searched for a valid entry, just as a cache is searched. The table is typically set-associative, as is the case with many cache organizations. An entry is only added to the table when a taken branch is executed by the processor. On each BHT hit, the historical information in that entry is used by the prediction algorithm. The algorithm redirects prefetching for a taken prediction, or continues with the next sequential instruction for a not-taken prediction. Some implementations invalidate the entry when the branch changes to not taken. In this case, a BHT miss will occur subsequently, and next-sequential prefetching will ensure. If the prediction is wrong, the processor must be equipped with a back-out strategy to restore the necessary state.
Thus, stated generally, a BHT stores past actions and targets of branches, and predicts that future behavior will repeat. However, while past action is a good indicator of future action, the subroutine CALL/RETURN paradigm makes correct prediction of the branch target difficult.
Conventional BTBs are cache-like buffers that are used in the fetch units of microprocessors to store an identifier of a previously performed branch instruction as a tag, along with the target address (i.e., the address to which the branch points in its predicted state) and an indication of the branch's history. Upon subsequent fetches of the branch, the target address is used (depending on the branch history) as the next address to fetch in the pipeline. Upon execution of the branch instruction itself, the target address is compared against the actual next instruction address determined by the execution unit to verify whether the speculative execution was valid. However, the use of BTBs is not without deficiency. For example, as with the BHT, a BTB indexed using the address of the branch is able to provide the address of the target only when the branch is decoded in the instruction stream.
Return address stacks store the next sequential instruction address to be executed after return from the subroutine (i.e., the next instruction in the calling program after a subroutine), in similar fashion as the actual return address is stored in a logical stack upon execution of the call. The instruction address stored in the return address stack is used to speculatively fetch the next instruction after the return. Upon execution of the return, this value from the return address stack is compared against the actual return address popped from the logical stack to verify whether the speculative pipeline operation was valid.
IBM Technical Disclosure Bulletin Vol. 30, No. 11, April 1988, pp. 221-225, "Subroutine Call/Return Stack" by Webb describes a pair of stacks for saving subroutine addresses. This mechanism also uses a branch history table with an extra bit in each entry to identify Return instructions. A Return is identified when the branch prediction is verified for the Return. If the stack prediction was correct, a potential return instruction must have functioned as a Return. Consequently, the first time a particular Return is encountered, it is not handled as a Return. On subsequent executions of the instruction, the branch history table identifies the instruction as a Return and it is predicted using the stack. This mechanism requires two stacks, which are used in associative searches to find the prediction of a Return and to identify Returns.
Unfortunately, a problem with the stack mechanism is that the instructions used for calls and returns may be used for other purposes also. For example, the return instruction in the PowerPC, branch-using-register (br), is also used for implementing the C-language "switch" statement, which determines the target of a branch based on a variable which could take on one of several values unknown at compile time. The occurrence of such a branch could make the stack get out-of-sync and reduce the effectiveness of the prediction. The problem is even worse when a given instruction in the instruction set is used to implement both a call as well as a return as in some System/390 implementations.
Thus, it would be desirable and highly advantageous to have a method and apparatus for accurately predicting the target of a subroutine return branch. It would also be desirable and highly advantageous to have a method and apparatus for prefetching and processing target instructions before execution of the return.