Improving computer architecture performance is a difficult task. Improvements have been sought through frequency scaling, Single Instruction Multiple Data (SIMD), Very Long Instruction Word (VLIW), multi-threading and multiple processor techniques. These approaches mainly target improvements in the throughput of program execution. Many of the techniques require software to explicitly unveil parallelism. In contrast, frequency scaling improves both throughput and latency without requiring software explicit annotation of parallelism. Recently, frequency scaling hit a power wall so improvements through frequency scaling are difficult. Thus, it is difficult to increase throughput unless massive explicit software parallelization is expressed.
With respect to single threaded program execution, program execution is controlled by branching instructions that dictate the program control flow. Program instruction sequences are dynamic when the branching instructions are conditional or the branch target is indirect. In such cases, it is essential for the fetch logic of the processor to find out for conditional branches if the branch is taken or not taken. This enables the fetch logic to bring in the sequence of instructions that either follow the target of the branch or those that follows the branch instruction itself. There exists a problem, however, in that at the fetch stage, the outcome of the condition of the branch is not known before the branch itself executes.
In an attempt to overcome this problem, prior art designs have implemented branch prediction logic to predict the outcome of a branch. At the fetch stage of the microprocessor, the predicted outcome enables the fetch logic to anticipate where to bring the next sequence of instructions from. Problems still exists, however, since this processing needs to be sequential in nature. The current branch needs to be processed first in order to know where to bring the next instruction sequence. Accordingly the sequential nature of processing branches in the fetch stage imposes a performance bottleneck on the single threaded execution speed of a microprocessor. Penalties for an incorrect branch prediction typically involve flushing the whole pipeline of a microprocessor, accessing caches and reloading with a new instruction sequence. These penalties greatly reduce the incentives for predicting more than one branch at a time.