This invention is in the field of microprocessors, and is more specifically directed to branch prediction techniques in pipelined microprocessors.
In the field of microprocessors and other programmable logic devices, many improvements have been made in recent years which have resulted in significant performance improvements. One such improvement is the implementation of pipelined architectures, in which multiple microprocessor instructions are processed simultaneously along various stages of execution, so that the processing of subsequent instructions begins prior to the completion of earlier instructions. Because of pipelining, the effective rate at which instructions are executed by a microprocessor can approach one instruction per machine cycle in a single pipeline microprocessor, even though the processing of each individual instruction may require multiple machine cycles from fetch through execution. So-called superscalar architectures effectively have multiple pipelines operating in parallel, providing even higher theoretical performance levels.
Of course, as is well known in the art, branching instructions are commonplace in most conventional computer and microprocessor programs. Branching instructions are instructions that alter the program flow, such that the next instruction to be executed after the branching instruction is not necessarily the next instruction in program order. Branching instructions may be unconditional, such as JUMP instructions, subroutine calls, and subroutine returns. Some branching instructions are conditional, as the branch depends upon the results of a previous logical or arithmetic instruction.
Conditional branching instructions present complexity in microprocessors of pipelined architecture, because the condition upon which the branch depends is not known until execution, which may be several cycles after fetch. In these situations, the microprocessor must either cease fetching instructions after the branch until the condition is resolved, introducing a "bubble" of empty stages (i.e., potential instruction processing slots) into the pipeline, or must instead speculatively fetch an instruction (in effect guessing the condition) in order to keep the pipeline full, at a risk of having to "flush" the pipeline of its current instructions if the speculation is determined to be incorrect.
The benefit of speculative execution of instructions in keeping the pipeline full, particularly in architectures with long or multiple pipelines, typically outweighs the performance degradation of pipeline flushes, so long as the success rate of the speculative execution is sufficient to achieve the desired performance benefit. Many modern microprocessors therefore follow some type of branch prediction techniques by way of which the behavior of conditional branching instructions may be predicted with some accuracy. One type of branch prediction is referred to as "static" prediction, as the prediction does not change over time or history. A simple static prediction approach merely predicts all conditional branches to be "taken". An improved static branch prediction approach predicts according to branch direction, for example by predicting all conditional branches in the forward direction to be "not taken" and predicting all conditional backward branches (e.g., LOOP instructions in DO loops) to be "taken". Of course, unconditional branches may always be statically predicted as "taken".
Dynamic branch prediction refers to a known technique of branch prediction that uses the results of past branches to predict the result of the next branch. A simple well-known dynamic prediction technique merely uses the results of the most recent one or two conditional branching instructions to predict the direction of a current branching instruction.
A more accurate dynamic branch prediction approach predicts the direction of a branching instruction by its own branching history, as opposed to the branch results of other instructions. This approach is generally incorporated into modern microprocessors by way of a branch target buffer. A conventional branch target buffer, or BTB, is a cache-like table of entries that each store an identifier (a "tag") for recently-encountered branching instructions, a branch history-related code upon which prediction is made, and a target address of the next instruction to be fetched if the branch is predicted as taken (the next sequential address being the address to be fetched for a "not taken" prediction). When a branching instruction is fetched, its address is matched against the tags in the BTB to determine if this instruction has been previously encountered; if so, the next instruction is fetched according to the prediction code indicated in the BTB for that instruction. Newly-encountered branching instructions are statically predicted, as no history is present in the BTB. Upon execution and completion of the instruction, the BTB entry is created (typically, for taken branches only) or modified (for branches already having a BTB entry) to reflect the actual result of the branching instruction, for use in the next occurrence of the instruction.
Various conventional alternative actual prediction algorithms that predict branches based upon the most recently executed branches or upon the branching history of the same instruction, are known in the art. A well-known simple prediction algorithm follows a four-state state machine model, and uses the two most recent branch events to predict whether the next occurrence will be taken or not taken. The four states are referred to as "strongly taken", "taken", "not taken", and "strongly not taken". A "strongly" state corresponds to at least the last two branches (either generally or for the particular instruction, depending upon the implementation) having been taken or not taken, as the case may be. The taken and not taken states (i.e., not a "strongly" state) correspond to the last two branches having differing results, with the next branch result either changing the prediction to the other result, or maintaining the prediction but in a "strongly" state.
A recent advance in branch prediction algorithms uses not only branch history results, but also branch pattern information, in generating a prediction of branch behavior. For example, a certain branch instruction may be a loop of three passes, such that its branch history will repetitively follow a pattern of taken-taken-not taken. Use of a simple two-bit, or four-state, prediction mechanism will not correctly predict the branching of this instruction, even though its behavior is entirely predictable. The well-known two-level adaptive branch prediction mechanism, described in Yeh & Patt, "Two-Level Adaptive Branch Prediction", Proceedings of the 24th International Symposium on Microarchitecture, (ACM/IEEE, November 1991), pp. 51-61, uses both branch history and branch pattern information to predict the results of a branching instruction. Branch prediction using the Yeh & Patt approach has been applied to microprocessor architectures using BTBs, as described in U.K. Patent Application 2 285 526, published Jul. 12, 1995. Attention is also directed, in this regard, to U.S. Pat. No. 5,574,871.
According to the approach described in the above-referenced Yeh and Patt paper and U.K. Patent Application 2 285 526, a pattern history is maintained and updated for each unique branch pattern. In this approach, the pattern history consists of the four-state state machine model described above, in which the two most recent branch events for each branch pattern predicts whether the next occurrence of a branch having the same branch pattern will be taken or not taken (along with its "strongly" attribute). In operation, upon detection of a branching instruction having an entry in the BTB, the branch pattern contained in the branch history field for that instruction indexes into the pattern history table, from which the prediction is obtained. Upon resolution of the branch, both the branch history field for the particular instruction and the pattern history for its previous pattern (i.e., the branch pattern used in the prediction) are updated. The updated pattern history is then available for use in predicting the outcome of the next branch instruction having its associated branch pattern in its branch history field of the BTB. The pattern history table according to this approach is thus "global", in the sense that the branch prediction is generated for any branch instruction having the same branch history pattern, regardless of the identity of the instruction. Accordingly, the pattern history for a particular branch pattern will be defined and updated based upon the branch prediction results for any branching instruction having that branch history. The branch prediction for any given instruction will thus be determined based upon the branch results of other, dissimilar, instructions, according to this basic two-level technique.
As described in Yeh and Patt, "Alternative Implementations of Two-Level Adaptive Branch Prediction", Conference Proceedings of the 19th Annual International Symposium on Computer Architecture, (ACM, May 1992), pp. 124-134, an alternative implementation of two-level branch prediction addresses this limitation. This alternative implementation provides address-specific pattern history tables, such that each entry in the BTB has its own pattern history table, as shown in FIG. 3 of this paper. Accordingly, the branch prediction for a branching instruction is made based upon the pattern history as generated and modified by its own past history, and is not dependent upon the branch results for other branching instructions having similar branch patterns.
While the use of address-specific pattern history tables eliminates interference in the branch prediction from other branching instructions having the same branch patterns, the cost of implementation can be quite substantial. For example, modem microprocessors may have BTBs with up to as many as 4 k entries. The use of an index of four bits of branch history into address-specific pattern history tables thus requires 4 k pattern history tables, each with sixteen entries that are two bits in width, resulting in 128 kbits of storage. The chip area required for implementation of this approach is thus quite substantial. This cost rapidly increases, however, as branch prediction is sought to be improved through the use of additional branch history bits as the index to the pattern history tables; for example, the use of six branch history bits would require 512 kbits of pattern history storage. As microprocessors continue to have more pipelines, each deeper in stages, resulting in more severe penalties for branch misprediction and thus a higher premium on accurate branch prediction, the cost of implementing address-specific pattern history tables becomes even greater.
By way of further background, it has been observed that microprocessor programs of different types have similarities in branch behavior within the type, and dissimilarities across types. For example, as described in Calder and Grunwald, "The Predictability of Branches in Libraries", Proceedings of the 28th International Symposium on Microarchitecture (ACM/IEEE, November 1995), pp. 24-34, commonly used UNIX library subroutines tend to have predictable branching behavior and, as a class or type, different branching behavior from non-library programs.
By way of further background, indexing into a global pattern history table using both branch history and a portion of the tag field of the BTB is known.
By way of further background, modern microprocessors are now capable of supporting multitasking operating systems, in which the microprocessor sequentially switches its operation among several tasks to give the appearance of the parallel operation of multiple tasks. Typically, for example in microprocessors constructed according to the well-known x86 architecture, each task is carried out for a short time and is then interrupted by an event commonly referred to as a task switch, after which a different task is started or restarted and then executed for a short time, with the sequence continuing with additional task switches. In order to carry out such multitasking operation, the system context for each task must be saved upon interruption of the task, and restored upon restarting of the task. Portions of memory are typically reserved and used for storage and recall of the system context for each task. According to the x86 architecture, system segments referred to as the task state segment (TSS) are assigned to each task, for storage of its condition when interrupted by a task switch.