The present invention relates to the field of processor design in digital computer systems. More specifically, the present invention relates to the prediction of directions for branch instructions executed by such a computer system.
The recent advances in reduced instruction set computing (RISC) architectures and VLSI technologies allow computer designers to exploit more instruction level parallelism by designing deeper pipelines and adding more functional Units for increasing the scalar performance of high speed processors. As more sophisticated hardware is built to exploit the available instruction level parallelism, more attention must be paid to the disruption of the pipeline caused by branches. Since branches constitute anywhere from 15% to 30% of all the executed instructions, the efficiency of handling branches is important. A significant improvement in processor performance can be achieved through the use of specially designed methods and hardware to reduce the cost incurred as a consequence of branching.
Many hardware implementations that reduce the cost of branches have been reported in the prior art. Among them, the branch target cache (BTC) is most commonly used. A BTC is a small cache memory used to keep information related to branches. Typical information kept in the BTC includes the address of a previously executed branch instruction, the target address for that branch, branch prediction bits, and sometimes, the instruction at the target address. The BTC may be organized as a direct-mapped, set-associative, or fully-associative cache. The organization of the BTC and the information kept in it are implementation dependent, but the general operation of the BTC is the same. During the instruction fetch stage, the instruction address is compared against the branch addresses in the BTC. If there is a match, then a branch is found and a prediction is made based on the stored prediction information associated with that branch. If the prediction is that the branch will be taken, the target instruction, if there is any in the BTC, will be dispatched and the target address will be used for the subsequent instruction fetches. When the branch is actually resolved, the branch information associated with that branch in the BTC is updated based upon the outcome of that branch.
Another hardware approach which has been implemented in a real machine is known as branch target prefetch. In this approach, only a short instruction buffer is used to temporarily hold the target instructions that have been prefetched. However, a mechanism for anticipating a branch in the upcoming instruction stream is required to initiate the target fetch in advance. If a branch instruction can be detected early and the target address can be calculated in advance, the target instruction can be prefetched with zero delay. In this kind of design, the branch prediction is implemented either in a dedicated table or in the instruction cache directory.
In both implementations the overall reduction of the branch cost for pipeline processing depends on the accuracy of the prediction. When a branch is mispredicted, incorrectly dispatched instructions must be flushed from the pipeline, causing extra delays in the pipeline. Given that the cost of pipeline flushing is high, a small increase in the accuracy of the prediction has a large impact on the pipeline performance.
A commonly used branch prediction scheme which is not overly complex yet reasonably effective is the N-bit counter based branch prediction technique. The basic idea for the N-bit counter based branch prediction is the use of an N-bit up/down counter for prediction. In the ideal case, an N-bit counter is assigned to each one of the unique branches (branches with different addresses). When a branch is about to be executed, the counter value C associated with that branch is used for prediction. If the value C is greater than or equal to a predetermined threshold value L, the branch is predicted to be taken, otherwise it is predicted to be not taken. A typical candidate for L is 2.sup.N-1. The counter value C for a branch is updated whenever that branch is resolved. If the branch is taken, C is incremented by one, otherwise it is decremented by one. If C=2.sup.N-1, it remains at that value as long as the branch is taken. If C=0, it remains at zero as long as the branch is not taken. In actual implementations, a small cache memory is used as a table to store the N-bit counter value. Usually the lower order bits of the branch address are used to access the table, yielding the N-bit counter value C. These N-bits are used as the prediction bits for all branches whose addresses are mapped into the same memory entry. See U.S. Pat. No. 4,370,711 for additional background information.
Generally speaking, counter based branch prediction is a history based scheme, meaning that it determines the outcome of a branch based merely on the past history of that branch. The prediction information represented by the counter value for each branch is independent. Such a prediction scheme typically works relatively well for scientific workloads where programming structures are dominated by inner loops. In these applications the outcome of a branch tends to be more related to its own past history and less related to the outcomes of other branches. Such self-related history provides the predictability.
In many integer workloads where control transfers are intensive, the relation between branches is not as simple as that is in scientific workloads. The outcome of a branch for such applications usually is not only affected by its own past history but also is affected by the outcomes of other preceding branch operations. In other words, branches are usually correlated or interdependent upon one another. Because of this correlation or interdependence relation, the history of each individual branch looks chaotic, which in turn reduces the accuracy of any prediction scheme based merely on the history of that branch. Thus there exists the need for prediction apparatus and methods which increase accuracy of counter based branch prediction for workloads where branches are correlated.