Conventional microprocessors which do not use a superscalar or multipipelined architecture accept instructions from a serial instruction stream, and process those instructions sequentially, in a logical order allowing jumps and branches. When a conditional branch instruction is encountered, the microprocessor tests certain flags which have been set by instructions previously executed by the microprocessor, and either resumes executing at the instruction which followed the conditional branch instruction in the serial instruction stream, or resumes execution at an instruction stored at a location described by the conditional branch instruction.
Superscalar microprocessors can accept a serial instruction stream, and produce the same results as a non superscalar microprocessor. However, superscalar microprocessors may internally process multiple instructions simultaneously, which may cause instructions to be executed out of their logical order, the order intended by the original creator of the instructions.
Referring now to FIG. 1, a conventional superscalar microprocessor 102 and memory 104 is shown. Fetch circuitry 106 directs memory 104 to transfer blocks of instructions 110, 112 starting with the memory address contained in the fetch program counter 108 a block at a time into memory area 114, for simultaneous processing by execution units 116, 118, 120. Although the size of the blocks 110, 112 and memory area 114 shown in FIG. 1 are four words, and the number of execution units 116, 118, 120 shown in FIG. 1 is three, conventional superscalar microprocessors may have blocks 110, 112 and storage areas 114 of any size, and any number of execution units 116, 118, 120.
Producing results in a superscalar microprocessor which are identical to the results which would be produced by a conventional non-superscalar microprocessor poses certain problems for a superscalar microprocessor. One problem posed by a superscalar microprocessor design arises in the processing of a conditional branch instruction. Because the instructions which set the flags in a superscalar microprocessor may not have been processed at the time the branch instruction is ready for execution by the superscalar microprocessor, it is impossible to determine with certainty which instruction the non-superscalar microprocessor would have executed after the execution of the conditional branch instruction without waiting for all instructions which logically precede the conditional branch instruction to execute. Waiting for all such preceding instructions to execute would introduce undesirable delays.
One approach to avoid these delays has been to attempt to predict the result of the conditional branch instruction without waiting for the logically preceding instructions to execute, and continue processing instructions as if the prediction was accurate. When the instructions which logically precede the conditional branch have all completed execution, the prediction may be tested for accuracy. If the result of the branch prediction is indeed accurate, processing continues and the undesirable delays are avoided. If the result of the branch prediction is inaccurate, processing stops, and resumes at the instruction which should have been executed after the conditional branch, with the delay no greater than if processing had suspended waiting for execution of the instructions which logically preceded the conditional branch instruction.
Various conventional ideas exist for predicting which branch direction to take. One approach is to always predict the branch described in the branch instruction will be taken. Such a prediction can often be correct more than fifty percent of the time, as many programs contain loop instructions that result in the branch described in the branch instruction being taken more often than not. For example, the PASCAL instructions:
______________________________________ For i:=1 to 100 do begin . . . end; ______________________________________
cause the branch described in the branch instruction to be taken 99 percent of the time. Of course, other instructions, such as if..then, while..do, and repeat..until may not yield the same prediction accuracy, but the scheme is relatively simple to implement, saving valuable area in a superscalar microprocessor 102.
When an instruction described in the conditional branch instruction is executed following the conditional branch instruction, the action is described as "taking the branch" and thus, the branch or "direction" of the branch is "taken". When the instruction which physically follows the branch instruction is executed because the conditions of the conditional branch instruction were not true, the action is described as "not taking the branch" and the branch or "direction" of the branch is described as "not taken."
One idea which can improve the accuracy of branch prediction is known as "bimodal" branch prediction and involves the use of a two-bit saturating counter as a prediction indicator to indicate whether a branch should be taken. A two bit saturating counter makes use of the assumption that branches should be taken in groups, and so the whether a branch or group of branches should be taken may be predicted by reference to whether the last branch or branches were taken. Referring now to FIGS. 2A and 2B, an illustration of a state table of a two-bit saturating counter is shown. State 210 represents a strong indication that the branch should not be taken. State 212 represents a weak indication that the branch should not be taken. State 214 represents a weak indication that the branch should be taken. State 216 represents a strong indication that the branch should be taken. The state of the prediction may be initialized to any state 210, 212, 214, 216. The branch is predicted taken if the most significant bit of the current prediction state has a value of "1", such as states 214, 216, and the branch is not taken if the most significant bit in the current prediction state has a value of "0", such as states 210, 212. When the prediction is tested after the instructions logically preceding the branch have been executed, the state of the prediction is changed according to table 218. Column 220 represents the current state, column 222 represents the new state, and column 224 represents the actual branch action: taken, meaning the branch was actually taken, or not taken, meaning the branch was not actually taken. From a strong indication, two actual branches opposite the indication are required before a change is made to the branch prediction. Other arrangements of counters, including those with more than two bits, may be utilized to vary the number of actual branches opposite the strong indication required to change the prediction.
The states of FIG. 2A may also have the values opposite those shown: strong taken, weak taken, weak not taken and strong not taken for states 210, 212, 214, 216, respectively. In this case, the most significant bit having a value of "1" indicates the branch should be predicted not taken, "0" indicates the branch should be predicted taken. Table 218 of FIG. 2B is used as described above, with the opposite actual actions in column 224.
The accuracy of bimodal branch prediction may be enhanced through the use of a history register, which records the history of the actual branch action taken. The use of a history register assumes that conditional branches are taken according to repeating patterns. For example, in the following PASCAL program:
______________________________________ For i:= 1 to 100 do For j:= 1 to 3 do begin . . . end; ______________________________________
The inner branch will be taken two times, but not the third, followed by the outer branch taking its branch, behavior which will be repeated ninety-eight times due to the outer branch. Knowledge of the behavior of the last four branches of both the inner and outer branch can predict the behavior of the next branch with higher accuracy than bimodal branch prediction. A shift register may be used as a history register to keep track of the behavior of the branches by shifting bits one position in a single direction (right or left) for each branch encountered, shifting in a "1" for each branch that is actually taken, and shifting in a "0" for each branch that is not actually taken. For example, a left shift register would read 1101 after the outer branch was taken, with the zero in the second least significant position showing that the end of the inner loop had been reached. The next branch should be predicted taken, as it will be the first branch in the next iteration of the inner loop.
The history register is used with a history table and the two-bit saturating counters of bimodal branch prediction to complete the prediction. Referring now to FIG. 3, the contents of a history register 308 as described above are used as an index to a history table 312. The pointer 316 having the same index 314 as that of history register 308 points to a two-bit saturating counter 318, 320, 322, 324, 326 having a state table as described above with reference to FIG. 2A which is used to determine the branch prediction as described above. The entire history register 308 may be used as an index to the table 312, or a certain number of bits including and adjacent to the bit most recently shifted in to the history register 308 may be used as an index to the table 312.
Another method is similar to the history table method described above, except that the address of all, or a certain number of the least significant bits, of the address of the conditional branch instruction are used in place of the history register 308 as the index to the table 312.
Still other methods combine the low order bits of the address of the conditional branch instruction and some or all of the branch history, for example by concatenation or exclusive-OR-ing, to create an index to the table 312, in place of the history register 308 alone.
Referring again to FIG. 1, if the address of the conditional branch instruction is used to create the index, the address must be computed from the fetch program counter register 108 and the position of the conditional branch instruction in the memory 114, causing added complexity of the microprocessor 102 and computational delay. If the history is used to create the index, it must be updated for each conditional branch instruction executed, resulting in additional complexity in the design of the microprocessor 102.