1. Technical Field
This invention relates to microprocessor architecture, and in particular to methods for processing branch instructions.
2. Background Art
Advanced processors employ pipelining techniques to execute instructions at very high speeds. On such processors, the overall machine is organized as a pipeline consisting of several cascaded stages of hardware. Instruction processing is divided into a sequence of operations, and each operation is performed by hardware in a corresponding pipeline stage (xe2x80x9cpipe stagexe2x80x9d). Independent operations from several instructions may be processed simultaneously by different pipe stages, increasing the instruction throughput of the pipeline. Where a pipelined processor includes multiple execution resources in each pipe stage, the throughput of the processor can exceed one instruction per clock cycle.
Contemporary superscalar, deeply pipelined processors may have anywhere from 5 to 15 pipe stages and may execute operations from as many as 4 to 8 instruction simultaneously in each pipe stage. In order to make full use of a processor""s instruction execution capability, the execution resources of the processor must be provided with sufficient instructions from the correct execution path. This keeps the pipeline filled with instructions that contribute to the forward progress of the program.
The presence of branch instructions poses major challenges to filling the pipeline with instructions from the correct execution path. When a branch instruction is executed and the branch condition met, control flow of the processor is resteered to a new code sequence and the pipeline is refilled with instructions from the new code sequence. Since branch execution occurs in the backend of the pipeline, and instructions are fetched at the front end of the pipeline, several pipe stages worth of instructions may be fetched from the wrong execution path by the time the branch is resolved. These instructions need to be flushed, causing bubbles (idle stages) in the pipeline. The processor then begins fetching instructions at the target address indicated by the branch instruction. The intervening stages of the pipeline remain empty until they are filled by instructions from the new execution path.
To reduce the number of pipeline bubbles, processors incorporate branch prediction modules at the front ends of their pipelines. When a branch instruction enters the front end of the pipeline, the branch prediction module forecasts whether the branch instruction will be taken when it is executed at the back end of the pipeline. If the branch is predicted taken, the branch prediction module communicates a target address to the fetch module at the front end of the pipeline. The fetch module begins fetching instructions at the target address.
Conventional branch prediction modules employ branch target buffers (BTBs) to track the history (target address, branch direction) of branch instructions. Target addresses and branch directions (taken/not taken) are collected in the BTB as the branch instructions are processed. If a branch is resolved taken when it is first encountered, instructions beginning at its branch target address (branch target instructions) may be stored in an instruction cache for the encounter. Dynamic branch prediction algorithms use the stored branch history information to predict branch outcomes on subsequent encounters. Dynamic branch prediction schemes range from relatively simple algorithms, e.g. the predicted outcome is the same as the last outcome, to complex algorithms that require substantial time and resources to execute. When the branch is subsequently encountered, the dynamic branch prediction algorithm predicts the branch direction. If the predicted branch direction is xe2x80x9ctakenxe2x80x9d, the branch target address is used to access branch target instructions in the cache, if they have not been displaced.
There are a number of problems with the conventional approach to branch prediction. For example, the BTB typically accumulates branch history/prediction information indiscriminately for all branch instructions that are processed. A relatively large BTB is required to reduce the risk of overwriting branch history information for important branch instructions with information for less important branch instructions (Important branch instructions are those critical to program performance). The greater size of the BTB makes it correspondingly slower, reducing the performance of branch processing operations.
The dynamic branch prediction algorithms employed by the BTB can also impact system performance. More accurate dynamic prediction algorithms tend to be more complex. They require more die area to implement, further increasing the size of the BTB, and they require more time to provide a prediction. Dynamic branch prediction algorithms also make no use of branch information available from the compiler, e.g. static prediction information. This reduces their prediction accuracy for branches that are not encountered frequently, i.e. branches that lack temporal locality. Branch history information for these branches is more likely to be displaced from the BTB before it is used.
Another problem is created by the limited availability of cache space. Target branch instructions saved to a cache for a branch that is resolved taken may be evicted before they are used if the branch is not encountered frequently. Some processors support prefetching to improve the availability of branch target instructions for important branch instructions. A prefetch instruction may be scheduled ahead of the branch instruction. The prefetch instruction triggers the processor to fetch the branch target instructions and return them to an instruction cache. When the branch instruction is subsequently encountered, the branch target instructions can be accessed from the cache using the target address provided by the BTB or the decoder. Provided the prefetch instruction is properly scheduled, it can deliver the branch target instructions to the cache before they are needed. This can improve the speed with which the processor pipeline is resteered, but it does increase traffic on the processor-memory channel, and use of prefetching may be limited for this reason. Prefetching alone also does nothing to reduce the size/speed/accuracy constraints of the BTB.
The present invention addresses these and other problems associated with conventional branch processing systems.
The present invention supports efficient processing of branch operations by providing early, intelligent branch prediction information to the branch prediction system.
In accordance with the present invention, a branch operation is processed through a branch predict instruction and an associated branch instruction. The branch predict instruction indicates a target address and an instruction address for the associated branch instruction. When the branch predict instruction is detected, the target address is stored at an entry indicated by the associated branch instruction address.
For one embodiment of the invention, the branch predict instruction triggers a prefetch of the branch target instructions into an instruction cache or buffer. When the associated branch instruction is subsequently detected, the target address is read from the entry and instructions indicated by the target address are retrieved from the instruction cache.
The branch predict instruction may also include hint information for managing branch prediction information. For a hierarchical branch prediction system, hint information may indicate in which structure the information is to be stored. Hint information may also indicate whether static or dynamic information is used to predict the branch direction. In the latter case, the hint may trigger the dynamic prediction algorithm, allowing more time for the dynamic prediction algorithm to complete.