1. Field of the Invention
The present invention relates to techniques for performing branch prediction in a data processing apparatus.
2. Description of the Prior Art
A data processing apparatus will typically include one or more execution units which are operable to execute a sequence of instructions. A prefetch unit will typically be provided for prefetching instructions from a memory, for example an instruction cache, prior to those instructions being passed to the required execution unit. The prefetch unit will typically include an incrementer for incrementing a current program counter (PC) value to produce an instruction address for a next instruction or set of instructions to be prefetched. However, when a branch instruction is encountered within the instruction flow, this may cause a discontinuous jump in the PC value if the branch specified by that branch instruction is taken. As an example of such a branch instruction, a branch instruction may be used at the end of an instruction loop to indicate that the program should return to the beginning of that instruction loop if certain conditions are met (for example the required number of iterations have not been performed). Clearly, if the prefetch unit continues merely to fetch instructions based on the incremented PC value, then the execution unit will not have the required instructions once the branch instruction has been executed, and the branch has been taken.
Accordingly, it is common to provide such a prefetch unit with prediction logic aimed at predicting for a prefetched branch instruction whether the branch will be taken or not taken, and to cause the prefetch unit to prefetch further instructions based on that prediction. Such prediction as to whether a branch instruction will be taken or not taken is often referred to as “direction” prediction.
The article “The YAGS Branch Prediction Scheme” by A N Eden and T Mudge, in Proceedings of the 31st ACM/IEEE International Symposium on Microarchitecture, pages 69-77, 1998, describes a number of known branch prediction schemes, and introduces another scheme referred to as “Yet Another Global Scheme (YAGS)”. As acknowledged in the introduction of that article, a significant problem which reduces prediction rate in known global schemes is aliasing between two indices (an index is typically formed from history information and sometimes combined with certain address bits of an instruction) that map to the same entry in a Pattern History Table (PHT). The PHT will store information identifying for a particular index whether that relates to a branch instruction which should be predicted as “taken” or “not taken”. Accordingly, two aliased indices that relate to two different branch instructions whose corresponding behaviour (i.e. whether they are to be predicted as taken or not taken) is the same will not result in mispredictions when referencing the PHT. This situation is defined as neutral aliasing in the article. On the other hand, two aliased indices relating to two different branch instructions with different behaviour may likely give rise to mispredictions when referencing the PHT. This situation is defined as destructive aliasing in the article. The YAGS technique is aimed at reducing the likelihood of such destructive aliasing.
FIG. 1 is a block diagram schematically illustrating the YAGS technique. History information is stored within a history buffer 30, identifying for a number of preceding branch instructions, whether those branch instructions were taken or not taken. This history information is input to index generation logic 10, which is also arranged to receive the address 20 of an instruction whose branch behaviour is to be predicted. The index generation logic 10 is arranged to perform some logical combination of the address 20 and the history information from the history buffer 30 in order to generate an index output over path 12 to the two caches 50, 60. Typically, the index generation logic is arranged to perform an “exclusive OR” function upon the input address and history information in order to generate the index output over path 12.
In addition, a portion of the address 20 is used to index the choice PHT 40 over path 42, which results in the contents of the entry identified by that index being output over path 44 to the multiplexer 90. Each entry in the choice PHT stores data indicating whether an instruction whose address portion indexes that entry should be predicted as taken or not taken. In the example illustrated in FIG. 1, this taken/not taken information takes the form of a two-bit counter (2bc).
The output from the choice PHT 40 is not only routed to the multiplexer 90, but is also input to the multiplexers 70 and 80 to control the selection of their inputs, in the manner described in more detail below.
Each of the caches 50, 60 includes a plurality of entries, with each entry containing both a TAG portion (a number of address bits) and a two-bit counter value. The cache 50 is referred to as the Taken cache, or T cache, whilst the cache 60 is referred to as the Not Taken cache or NT cache. Both the T cache 50 and the NT cache 60 are used to identify exceptions to the predictions produced by the choice PHT 40. More particularly, if the choice PHT 40 produces for a particular index a “not taken” prediction, then the T cache 50 is referenced to determine whether the actual instruction address being evaluated corresponds to a branch instruction which is an exception to that prediction, i.e. is an instruction which should be predicted as “taken”. Similarly, if the choice PHT 40 outputs for a particular index a “taken” prediction, then the NT cache 60 is referenced to determine whether the instruction address being evaluated corresponds to a branch instruction which should actually be predicted as “not taken”. This processing is performed as follows.
As can be seen from FIG. 1, both the T cache 50 and the NT cache 60 are referenced by the index signal output over path 12 from the index generation logic 10, this causing each cache to output the contents of the entry referenced by that index. The two-bit counter value from the indexed entry of the T cache and the two-bit counter value from the indexed entry of the NT cache are routed to multiplexer 80, whilst the TAG portion of the indexed entry of the T cache is routed to the comparison logic 55 and the TAG portion of the indexed entry of the NT cache 60 is routed to the comparison logic 65. Both the comparison logic 55 and the comparison logic 65 receive a predetermined address portion of the address 20 over path 57. Each comparison logic 55, 65 then outputs a signal to the multiplexer 70 indicative of whether there is a match between its two inputs.
Which of the two inputs to the multiplexer 70 is selected for output as a control signal to the multiplexer 90 will depend upon the output from the choice PHT 40 over path 44. Hence, as an example, if the signal output over path 44 from the choice PHT 40 indicates that the branch should be predicted as taken, then the multiplexer 70 is arranged to output as a control signal to the multiplexer 90 the signal received from the comparison logic 65, this signal indicating whether there has been a hit detected for the entry read out of the NT cache 60. Similarly, in this instance, the multiplexer 80 will output as one of the inputs to the multiplexer 90 the two-bit counter value from the entry read out of the NT cache 60. If the signal output by the multiplexer 70 indicates that there has been a hit in the NT cache, then this signal will cause the multiplexer to select as the prediction output therefrom the two-bit counter value provided from the multiplexer 80, i.e. the two-bit counter value from the relevant entry of the NT cache 60.
It can be seen that an analogous process takes place if the signal on path 44 indicates that the branch should be predicted as not taken, but this time the T cache 50 and comparison logic 55 are referenced instead of the NT cache 60 and comparison logic 65. Hence if the comparison logic 55 indicates that a hit has occurred in the T cache 50 this will cause the multiplexer 90 to output the two-bit counter value from the T cache 50 entry in preference to the signal output by the choice PHT 40.
However, in the absence of a hit in the relevant cache 50, 60, then the multiplexer 90 is arranged to output as the prediction signal the output signal from the choice PHT 40.
Hence, it can be seen that the YAGS technique stores for each branch a bias, i.e. the signal output by the choice PHT 40, and the instances when that branch does not agree with the bias, i.e. the signal output from the T cache 50 or NT cache 60 as appropriate. It has been found that the use of the T cache 50 and NT cache 60, and the TAG values stored therein (which typically will consist of a certain number of least significant bits of the branch address), reduce the above described destructive aliasing between two consecutive branches.