The present invention relates to computer processor having branch target buffer(s) for improving performance of branch instruction execution. The invention is useful for pipelined and non-pipelined architectures. For the pipelined architectures, this invention is useful for both single and superscalar pipelined architectures that have two or more pipelines for processing instructions.
The original computers were designed to completely process one instruction before beginning the next instruction in the sequence. Major architectural advances that have increased performance include the use of pipelined and superscalar architectures. These architectures introduce higher levels of design complexity and cost in the computer processors, however this additional cost is more than offset by the increased performance of pipelined and superscalar computers.
Performance can also be increased by use of caches in computer architectures. Caches are utilized to store and supply often used information such as data and instructions. Within one clock cycle, a cache can supply the needed information without the memory access that could consume several cycles. One example of a cache that increases performance during a branch instruction is termed a branch target buffer ("BTB").
As mentioned briefly above, the speed of computers is increased by pipelining instructions. A pipelined computer divides instruction processing into a series of steps, or stages, each of which is preferably executable in a single clock cycle. In a non-pipelined computer, each instruction is processed until it is complete and only then does processing begin on the next instruction. In a pipelined computer, several sequential instructions are processed simultaneously in different stages of the pipeline. Processing in the different processing stages may proceed simultaneously in one clock period in separate portions of the computer.
For example, in a computer processor running pipeline instructions, each stage of the operation is handled in one clock period. The stages into which instruction processing for the processor are divided include an instruction cache fetch stage for fetching the instruction from wherever it is stored, an instruction decode stage for decoding the instruction, an address generation stage for generating the operand address(es), an operand fetch stage for fetching the operands, an execution stage for executing the instruction, and a writeback stage for writing the results of the execution to the registers and memory for later use. Each of the stages is designed to occupy one clock period. Thus during the first clock period, the instruction fetch portion of the computer fetches an instruction from storage and aligns it so that it is ready for decoding. During the second clock period, the instruction fetch portion of the computer fetches the next instruction from storage and aligns it, while the instruction decoder portion of the computer decodes the first instruction fetched. During the third clock period, the first instruction fetched is moved into the instruction issue stage while the second instruction fetched is moved into the instruction decode stage, and another instruction is moved into the instruction fetch stage. Pipelining continues through each of the stages including the execution stage and the writeback stage, and thus the overall speed of computer processing is significantly increased over a non-pipelined computer.
In a superscalar architecture, two or more instructions may be processed simultaneously in one stage. A superscalar computer has two or more processing paths that are capable of simultaneously executing instructions in parallel. In a scalar computer, the same type of instructions would be run serially. It should be apparent that if two or more instructions are run simultaneously, then the computer can process instructions faster.
If a branch instruction, such as a jump, return, or conditional branch, is in the series of instructions, a pipelined computer will suffer a substantial performance penalty on any taken branch unless there is some form of branch prediction. The penalty is caused on a taken branch because the next instructions following in the pipeline must be thrown away, or "flushed." For example, if the microarchitecture has three stages preceding an execution stage, then the penalty will be at least three clock cycles when a branch is taken and not predicted, assuming the branch is resolved in the execution stage. This penalty is paid when the incorrect instructions are flushed from the pipeline and the correct instruction at the actual target address is inserted into the pipeline.
One way to increase the performance of executing a branch instruction is to predict the outcome of the branch instruction, and insert the predicted instruction into the pipeline immediately following the branch instruction. If such a branch prediction mechanism is implemented in a microprocessor, then the penalty is incurred only if the branch is mispredicted. It has been found that a large number of the branches actually do follow the predictions. That this is so can be exemplified by the prevalence of repetitive loops. For example, it may be found that 80% of the branch predictions are correct.
Several types of branch prediction mechanisms have been developed. One type of branch prediction mechanism uses a branch target buffer (i.e. "BTB") that stores a plurality of entries including an index to a branch instruction. In addition to the index, each entry of the BTB table may include an instruction address, an instruction opcode, history information, and possibly other data. In a microprocessor utilizing a branch target buffer, the branch prediction mechanism monitors each instruction as it enters into the pipeline. Specifically, each instruction address is monitored, and when the address matches an entry in the branch target buffer, then it is determined that instruction is a branch instruction that has been taken before. After the entry has been located, the history information is tested to determined whether or not the branch will be predicted to be taken. Typically, the history is determined by a state machine which monitors each branch in the branch target buffer, and allocates bits depending upon whether or not a branch has been taken in the preceding cycles. If the branch is predicted to be taken, then the predicted instructions are inserted into the pipeline. Typically, the branch target entry will have opcodes associated with it for the target instruction, and these instructions are inserted directly into the pipeline. Also associated with the branch target buffer entry is an address that points to the predicted target instruction of the branch. This address is used to fetch additional instructions.
Processing the branch instruction and each following instruction then proceeds down the pipeline for several clock cycles until the branch instruction has completed the execution stage, after which the "takeness" of the branch is known. If the branch is taken, the actual branch target address of the branch will be known. If the branch has been correctly predicted, then execution will continue in accordance with the prediction. However, if the branch has been mispredicted, then the pipeline is flushed and the correct instruction is inserted into the pipeline. In a superscalar computer, which has two or more pipelines through which instructions flow side-by-side, the performance penalty on a misprediction is even greater because, in most cases, at least twice the number of instructions may need to be flushed.
As the instruction issue rate and pipeline depth of processors increases, the accuracy of branch prediction becomes an increasingly significant factor in performance. Many schemes have been developed for improving the accuracy of branch predictions. These schemes may be classified broadly as either static or dynamic. Static schemes use branch opcode information and profiling statistics from executions of the program to make predictions. Static prediction schemes may be as simple as predicting that all branches are Not Taken or predicting that all branches are Taken. Prediction that all branches are Taken can achieve approximately 68 percent prediction accuracy as reported by Lee and Smith (J. Lee and A. J. Smith, "Branch Prediction Strategies and Branch Target Buffer Design", IEEE Computer, (January 1984), pp. 6-22). Another static scheme predicts that certain types of branches (for example, jump-on-zero instructions) will always be Taken or Not Taken. Static schemes may also be based upon the direction of the branch, as in "if the branch is backward, predict Taken, if forward, predict Not Taken". This latter scheme is effective for loop intensive code, but does not work well for programs where the branch behavior is irregular.
One method of static prediction involves storing a "branch bias" bit with each branch instruction. When the instruction is decoded, the "branch bias" bit is used to predict whether the branch is Taken or not. The bias bit is usually determined statistically by profiling the program with sample data sets, prior to execution. A profiling method is used to generate the branch bias bit. First the program is loaded into the computer memory. Starting with the first instruction in the program, a branch instruction is located. Instructions are added to the program to record branch decisions for the instruction. The program is then executed with a number of sample data sets. Execution is stopped, and beginning with the first instruction in the program each instruction is located. The profiling instructions are removed from the program, and if the probability that the branch will be Taken exceeds 50%, then the branch bias bit is set in the branch instruction and saved with the program. When the program is next executed, the bias bit is examined. If set, the branch is always predicted as Taken during execution of the program. Otherwise, the branch is always predicted as Not Taken.
A disadvantage of all static prediction schemes is that they ignore branch behavior in the currently executing program. By contrast, dynamic prediction schemes examine the current execution history of one or more branch instructions when making predictions. Dynamic prediction can be as simple as recording the last execution of a branch instruction and predicting the branch will behave the same way the next time. More sophisticated dynamic predictors examine the execution history of a plurality of branch instructions. Dynamic prediction typically requires more hardware than static prediction because of the additional run-time computation required.
In dynamic prediction, branch history information is applied to an heuristic algorithm. The heuristic algorithm inputs the branch execution history and outputs an indication of whether the branch will be Taken or Not Taken the next time it is executed. An example of a heuristic algorithm is one which counts the number of Taken and No Taken decisions in the last M branch decisions. If the number of Taken decisions or exceeds the number of Not-Taken decisions, the branch is predicted as Taken.
Dynamic prediction schemes may be further classified into local and global prediction schemes.
One method of local branch prediction uses a history table to record history information for a branch instruction. N bits of the instruction address are used to index an entry in the history table, where N is typically less than the number of bits in the branch instruction address. Because N is less than the number of bits in the branch instruction address, the history table serves as a hash table for all possible branch instructions in a program. Each entry of the history table stores the address of the branch for which the information in the entry is current. Storing the branch address in the entry makes it possible to detect hash-collisions when the address of a branch instruction does not match the address of the instruction for which the history information in an entry is current.
For the global prediction schemes, for example, each entry of the history table also contains an L bit branch sequence for a branch instruction, where L is a number of prior branch decisions to record for the branch. The L-bit branch sequence records whether the last L executions of the branch instruction resulted in the branch being Taken or Not-Taken. For example, if L=2 and the last two executions of the branch resulted in a Taken and a Not-Taken decisions, then the branch sequence is 10, where logical one (1) represents the Taken decision and logical zero (0) represents the Not-Taken decision. Each entry in the table also contains an array of 2L saturating up-down counters. For L=2, each entry also contains four saturating up-down counters, one counter for each of the four possible branch sequences. The possible sequences are: &lt;Not-Taken, Not-Taken&gt;, &lt;Not-Taken, Taken&gt;, &lt;Taken, Not-Taken&gt;, and &lt;Taken, Taken&gt;. In binary, these sequences are 00, 01, 10, and 11. Each counter counts the number of times a particular branch sequence results in a Taken decision when the branch is next executed. For example, counter 0 records the number of times the sequence 00 results in a branch decision of Taken when the branch instruction is next executed.
To predict whether a branch will be taken or Not Taken upon the next execution of the branch instruction, the count associated with the branch sequence for the instruction is examined by the prediction heuristic logic. A typical heuristic works as follows: if the count is greater than or equal to a predetermined threshold value, the branch is predicted Taken, otherwise the branch is predicted Not Taken. If the count has P bits of resolution, a typical threshold value is 2.sup.(P-1), which is the midpoint of the range of a P-bit counter. Once the branch is executed, resulting in a branch decision, the branch decision is input to the history update logic. If the branch is Taken, the count for the branch sequence is incremented by one. Otherwise the count is decremented by one. If the count reaches 2.sup.P -1 (i.e. the counter is saturated), the count remains at that value as long as the branch is Taken on subsequent executions for the same history sequence. If the count reaches 0, it remains at zero as long as the branch is Not Taken on subsequent executions for the same history sequence. Once the count is updated, the branch sequence is updated with the result of the branch decision. The high-order bit is shifted out of the branch sequence, and the result of the branch decision is shifted on. If the branch is Taken, a 1 is shifted in, otherwise a 0 is shifted in.
As mentioned in the previous paragraphs, some conventional branch prediction mechanisms create pipeline stalls in the processing pipeline when the branch prediction is needed because of the complexity in determining the outcome of the branch instructions. These pipeline stalls greatly decrease the pipeline efficiency of the processing system, therefore, an advanced branch prediction mechanism is desirable to solve this problem.