Microprocessors often employ the use of pipelining to enhance performance. Within a pipelined microprocessor, the functional units necessary for executing different stages of an instruction are operated simultaneously on multiple instructions to achieve a degree of parallelism leading to performance increases over non-pipelined microprocessors. As an example, an instruction fetch unit, an instruction decode unit and an instruction execution unit may operate simultaneously. During one clock cycle, the instruction execution unit executes a first instruction while the instruction decode and execute unit decodes a second instruction and the fetch unit fetches a third instruction. During a next clock cycle, the execution unit executes the newly decoded instruction while the instruction decode and execute unit decodes the newly fetched instruction and the fetch unit fetches yet another instruction. In this manner, neither the fetch unit nor the decode and execute unit need to wait for the instruction execution unit to execute the last instruction before processing new instructions. In state-of-the-art microprocessors, the steps necessary to fetch and execute an instruction are sub-divided into a larger number of stages to achieve a deeper degree of pipelining.
A pipelined CPU operates most efficiently when the instructions are executed in the sequence in which the instructions appear in memory. Unfortunately, this is typically not the case. Rather, computer programs typically include a large number of branch instructions, which, upon execution, may cause instructions to be executed in a sequence other than as set forth in memory. More specifically, when a branch instruction is encountered in the program flow, execution continues either with the next sequential instruction from memory or execution jumps to an instruction specified at a "branch target" address. Typically the branch specified by the instruction is said to be "Taken" if execution jumps and "Not Taken" if execution continues with the next sequential instruction from memory.
Branch instructions are either unconditional, meaning the branch is taken every time the instruction is executed, or conditional, meaning the branch is taken or not depending upon a condition. Instructions to be executed following a conditional branch are not known with certainty until the condition upon which the branch depends is resolved. However, rather than wait until the condition is resolved, state-of-the-art microprocessors may perform a branch prediction, whereby the microprocessor tries to determine whether the branch will be Taken or Not Taken, and if Taken, to predict or otherwise determine the target address for the branch. If the branch instruction is predicted to be Taken, the microprocessor fetches and speculatively executes the instruction found at the predicted branch target address. The instructions executed following the branch prediction are "speculative" because the microprocessor does not yet know whether the prediction will be correct or not. Accordingly, any operations performed by the speculative instructions cannot be fully completed. For example, if a memory write operation is performed speculatively, the write operation cannot be forwarded to external memory until all previous branch conditions are resolved, otherwise the instruction may improperly alter the contents of the memory based on a mispredicted branch. If the branch prediction is ultimately determined to be correct, the speculatively executed instructions are retired or otherwise committed to a permanent architectural state. In the case of a memory write, the write operation is normally forwarded to external memory. If the branch prediction is ultimately found to be incorrect, then any speculatively executed instructions following the mispredicted branch are typically flushed from the system. For the memory write example, the write is not forwarded to external memory, but instead is discarded.
As can be appreciated, when a branch prediction is correct, a considerable improvement in processor performance is gained. If the branch prediction is incorrect, the microprocessor is no worse off than had it initially waited until resolution of the branch condition.
A wide variety of techniques have been developed for performing branch prediction. Typically, various tables are provided for storing a history of previous branch executions or branch predictions along with indications of whether the branch predictions were proven to be correct or not. Predictions are made for newly encountered branches by evaluating the history of prior branch executions or prior branch predictions. In some microprocessors, the logic for performing branch predictions is quite complex and time consuming. It is, however, desirable to perform the branch prediction as quickly as possible. Ideally, a branch prediction is performed using the same number of clock cycles as it takes to fetch an instruction, thereby ensuring that the instruction cache need not be stalled while waiting for instruction pointers corresponding to predicted branches.
Advanced compiler techniques now used, reorder the instructions of a program to move non-branch instructions forward in the execution so instructions can be executed without instruction flow changes. This reordering results in multiple branch instructions being scheduled together at the bottom of an instruction scheduling block. Prior art branch prediction systems require each branch instruction to be separately predicted and fetched, which requires multiple clock cycles. Therefore, to improve performance, it is desirable for the system to be able to predict the outcome of multiple branch instructions simultaneously, instead of sequentially.