Superscalar microprocessors have a plurality of execution units that execute the microinstruction set of the microprocessor. Superscalar microprocessors attempt to improve performance by including multiple execution units so they can execute multiple instructions per clock in parallel. A key to realizing the potential performance gain is to keep the execution units supplied with instructions to execute; otherwise, superscalar performance is no better than scalar, yet it incurs a much greater hardware cost. The execution units load and store instruction operands, calculate addresses, perform logical and arithmetic operations, and resolve branch instructions, for example. The larger the number and type of execution units, the farther back into the program instruction stream the processor must be able to look to find an instruction for each execution unit to execute each clock cycle. This is commonly referred to as the lookahead capability of the processor.
One way to improve the lookahead capability is to allow instructions to execute out of their program order, commonly referred to as an out-of-order execution microprocessor. Although instructions can execute out-of-order, the architecture of most microprocessors requires that instructions be retired in program order. That is, the architectural state of the microprocessor affected by an instruction result must only be updated in program order.
Out-of-order execution in-order retire microprocessors typically include a relatively large number of pipeline stages, sometimes referred to as super-pipelining. One reason a microprocessor may have a relatively large number of pipeline stages is if its instruction set architecture allows instructions to be variable length, which typically requires a relatively large number of pipeline stages at the top of the pipeline to parse the stream of undifferentiated instruction bytes into distinct instructions and, commonly, to translate the parsed instructions into microinstructions.
The detrimental impact on performance of taken branch instructions in a super-pipelined microprocessor is well known, as is the performance benefits of branch prediction in the art of microprocessor design. More specifically, the larger the number of pipeline stages between the stage that fetches instructions (in response to a branch predictor providing a predicted branch target address) and the stage that causes the fetcher to begin fetching at a resolved target address different from the predicted target address, the larger the penalty associated with branch misprediction.
Therefore, what is needed is a high performance method of executing branch instructions within an out-of-order execution in-order retire microprocessor.