1. Field of the Invention
This invention relates to the field of data processing systems. More particularly, this invention relates to the field of predicting branch instructions in data processing.
2. Description of the Prior Art
A data processing apparatus will typically include a processor core for executing instructions. Typically, a prefetch unit will be provided for prefetching instructions from memory that are required by the processor core, with the aim of ensuring that the processor core has a steady stream of instructions to execute, thereby aiming to maximise the performance of the processor core.
To assist the prefetch unit in its task of retrieving instructions for the processor core, prediction logic is often provided for predicting which instruction should be prefetched by the prefetch unit. The prediction logic is useful since instruction sequences are often not stored in memory one after another, since software execution often involves changes in instruction flow that cause the processor core to move between different sections of code depending on the task being executed.
When executing software, a change in instruction flow typically occurs as a result of a “branch”, which results in the instruction flow jumping to a particular section of code as specified by a target address for the branch. The branch can optionally specify a return address to be used after the section of code executed by the branch has executed.
Accordingly, the prediction logic can take the form of a branch prediction unit which is provided to predict whether a branch will be taken. If the branch prediction unit predicts that a branch will be taken, then it instructs the prefetch unit to retrieve the instruction that is specified by the target address of the branch, and clearly if the branch prediction is accurate, this will serve to increase the performance of the processor core since it will not need to stop its execution flow whilst that instruction is retrieved from memory. Typically, a record will be kept of the address of the instruction that would be required if the prediction made by the branch prediction logic was wrong, such that if the processor core subsequently determines that the prediction was wrong, the prefetch unit can then retrieve the required instruction.
Branch prediction logic has been used in conjunction with branch target address caches (BTACs). In order to improve branch prediction success rates, dynamic branch prediction can be performed which uses historical information about what happened on previous branch instructions to predict what may happen. This historical information is typically stored in a BTAC, the BTAC being accessed by the prediction logic to determine if a branch should be taken or not.
Typically in such systems the program fetch unit PFU looks up the program counter to access the instruction within the Icache and at the same time accesses the BTAC to see if there is an entry corresponding to that instruction. If the instruction that is fetched is a branch instruction the processor awaits the result from the BTAC look up to predict whether to branch or not. Such systems will have some latency as data accesses take a finite amount of time. Typical systems have a two cycle latency, thus two cycles are required before the information from the BTAC is accessed and branch prediction for the retrieved instruction can be performed. In some systems buffers have been used to store fetched instructions and their branch predictions in order to avoid this wait manifesting as bubbles in the pipeline. In this way the bubbles can be hidden and a continuous flow of instructions can be provided to the pipeline.
FIG. 1 shows a timing diagram of the instruction accesses of a system with a two cycle latency. As is shown a bubble is introduced into the system due to the latency, although this can be removed using buffers to store instructions before sending them to the pipeline. A further disadvantage of such a system is that as the fact that instruction A+1 branches to B is not known until two cycles after the access to instruction A+1 is initiated, accesses to instruction A+2 has been initiated before it is known that it is not required. Thus, an unnecessary Icache access is made, which is expensive on power consumption. In systems with a latency of more than two cycles further unnecessary data accesses will be made.