1. Field of the Invention
The present invention relates in general to microprocessors and, more particularly, to a system, method, and microprocessor architecture that avoids mis-steering of instruction fetches resulting from mis-speculation in an out-of-order machine.
2. Relevant Background
Basic computer processors such as microprocessors, whether complex instruction set computers (CISC), reduced instruction set computers (RISC), or hybrids, generally include a central processing unit or instruction execution unit that execute a single instruction at a time. Processors have evolved to attain improved performance, extending capabilities of the basic processors by various techniques including pipelining, superpipelining, superscaling, speculative instruction execution, and out-of-order instruction execution.
Pipelined processor architectures divide execution of a single instruction into multiple stages, corresponding to execution steps. Pipelined designs increase instruction execution rate by beginning instruction execution before a previous instruction finishes execution. Superpipelined and extended pipeline architectures further increase performance by dividing each execution pipeline into smaller stages, increasing microinstruction granularity. Superpipelining increases the number of instructions that can execute in the pipeline at one time.
Superscalar processor architectures include multiple pipelines that process instructions in parallel. Superscalar processors typically execute more than one instruction per clock cycle by executing instructions in two or more instruction execution pipelines in parallel. Each of the execution pipelines may have a different number of stages. Some pipelines may be optimized for specialized functions such as integer operations or floating point operations. Other execution pipelines are optimized for processing graphic, multimedia, or complex math instructions.
Superscalar and superpipeline processors increase performance by executing multiple instructions per cycle (IPC). Software programs can be created that exploit instruction-level parallelism (ILP) to increase IPC performance if instructions can be dispatched for execution at a sufficient rate. Unfortunately, some types of instructions inherently limit the rate of instruction dispatch. For example, branch instructions hinder instruction fetching since the branch outcome and the target address are not known with certainty. In the event of a conditional branch both the outcome, whether taken or not taken, and the target address of the instructions following the branch must be predicted to supply those instructions for execution. In the event of an unconditional register-indirect branch, the target address of the instructions following the branch must be predicted to supply those instructions for execution.
Various branch prediction techniques have been developed that predict, with various degrees of accuracy, the outcome of branch instructions, allowing instruction fetching of subsequent instructions based on a predicted outcome. Branch prediction techniques are known that can predict branch outcomes with greater than 95% accuracy. Instructions are “speculatively executed” to allow the processor to proceed while branch resolution is pending. For a correct prediction, speculative execution results are correct results, greatly improving processor speed and efficiency. For an incorrect prediction, completed or partially completed speculative instructions are flushed from the execution pathways and execution of the correct stream of instructions initiated.
Basic processors are generally “in-order” or “sequential” processors and execute instructions in an order determined by the compiled machine-language program running on the processor. Superscalar processors have multiple pipelines that can simultaneously process instructions but only when no data dependencies exist between the instructions in each pipeline. Data dependencies cause one or more pipelines to stall while waiting for the dependent data to become available. Superpipelined processors have additional complications because many instructions exist simultaneously in each pipeline so that the potential quantity of data dependencies is large. Out-of-order processors include multiple pipelines that process instructions in parallel and can attain greater parallelism and higher performance. Out-of-order processing generally supports instruction execution in any efficient order that exploits opportunities for parallel processing that may be provided by the instruction code.
Out-of-order processing greatly improves throughput but at the expense of increased complexity in comparison to simple sequential processors. For example, an out-of-order processor must address the complexity of recovering the processing state following an unpredicted change in instruction flow. At any time during execution many instructions may be in the execution stage, some awaiting scheduling, some executing, and some having completed execution but awaiting retirement. Processor state at the time of the change in instruction flow is to be recovered for execution to continue properly. Specifically, if a change in instruction flow occurs during execution of an instruction, preceding instructions are to proceed to retirement and following instructions are to be discarded. State recovery involves restoring the pipeline to a state that would have existed had the mispredicted instructions not been processed. A challenge for superscalar processors is state recovery following an unexpected change of instruction flow caused by internal or external events such as interrupts, exceptions, and branch instructions.
Out-of-order execution can result in conflicts between instructions attempting to use the same registers, even for instructions are otherwise independent. Instructions may produce two general types of actions when executed: (1) storing results that are directed to an architectural register location, and (2) setting condition codes (CCs) that are directed to one or more architectural condition code registers (CCRs). Results and CC's for an instruction that is speculatively executed cannot be stored in the architectural registers until all conditions existing prior to the instruction are resolved. Temporary storage of speculative results has previously been addressed by a technique called “register renaming” through usage of rename registers, register locations allocated for new results while the registers remain speculative. A similar technique stores the CC set by a speculatively executed instruction. One difficulty with register renaming of conditions codes is that the speculative CC is stored separately from the speculative result, typically resulting in cumbersome operation and slow processor throughput to handle results and set condition codes with precision.
In register renaming, an instruction that attempts to read a value from the original register instead obtains the value of a newly allocated rename register. Hardware renames the original register identifier in the instruction to identify the new register and the correct stored value. The same register identifier in several different instructions may access different hardware registers depending on the locations of the renamed register references with respect to the register assignments. Register renaming typically uses a tracking table having entries for each register in the processor that indicate, among other things, the instruction identification and the particular instruction assigned to the register. The described register renaming method becomes unwieldy for large designs with hundreds or thousands of registers.
Processors with pipelined architectures fetch instructions far in advance of instruction execution. Control transfer instructions alter the sequence of instruction fetches. Since execution of control transfer instructions is downstream of the target instruction fetch, various techniques have been devised to predict the instruction execution path to prevent the pipeline from stalling. The predicted path, also known as the speculative path, is either committed to an architectural state or flushed, depending on the result of branch execution, also known as branch resolution.