1. Field of the Invention
This invention relates to microprocessors, and more particularly, to efficient reduction in branch misprediction penalty.
2. Description of the Relevant Art
Modern processor cores, or processors, are pipelined in order to increase throughput of instructions per clock cycle. However, the throughput may be reduced by pipeline stalls, which may be caused by a branch misprediction, a cache miss, data dependency, or other, wherein no useful work may be performed for a particular instruction during a clock cycle. Different techniques are used to fill these unproductive cycles in a pipeline with useful work. Some examples include loop unrolling of instructions by a compiler, branch prediction mechanisms within a core, and out-of-order execution within a core.
An operating system may divide a software application into processes and further divide processes into threads. A thread is a sequence of instructions that may share memory and other resources with other threads and may execute in parallel with other threads. A processor core may be constructed to execute more than one thread per clock cycle in order to increase efficient use of the hardware resources and reduce the effect of stalls on overall throughput. A microprocessor may include multiple processor cores to further increase parallel execution of multiple instructions per clock cycle.
As stated above, a processor core may comprise a branch prediction mechanism in order to continue fetching and executing subsequent instructions when the outcome of a branch instruction is not yet known. When the branch is predicted correctly, the processor core benefits from the early fetch and execution of the subsequent instructions. No corrective action is required. However, when a branch instruction is mispredicted, recovery needs to be performed. The cost, or the penalty, for this recovery may be high for modern processors.
Branch misprediction recovery comprises restoring the architectural state (i.e. internal core register state and memory state) of the processor to the architectural state at the point of the completed branch instruction. In other words, the effects of incorrectly executing the instructions subsequent to the mispredicted branch instruction need be undone. Then instruction fetch is restarted at the correct branch target address.
The penalty for the branch misprediction, or simply the branch misprediction penalty, includes two components. The first component is the time, or the number of clock cycles, spent on speculative execution of fetched instructions within the same thread or process subsequent the branch instruction until the branch misprediction is detected. The second component is the time, or the number of clock cycles, to restart the pipeline with the correct instructions once the branch misprediction is detected. Modern processor core designs increase both of these components with deep pipelines and with large instruction fetch, dispatch, and issue windows.
To support the out-of-order execution and completion of instructions as well as maintaining precise interrupts, modern processors typically buffer the data results of executed instructions in a working register file (WRF). Different implementations of a WRF may include a reservation station, a future file, a reorder buffer, or other. When an instruction retires due to being the oldest instruction in the processor and its execution did not result in any exceptions, its corresponding data results are then transferred from the WRF to the architectural register file (ARF). For such processors, the simplest branch misprediction recovery mechanism is to wait for the mispredicted branch instruction to retire, and then flush, or clear, both the entire processor pipeline and the WRF. Afterwards, instruction fetch restarts at the correct branch target address.
A disadvantage of the above approach is the branch misprediction penalty may be high. A relatively large number of clock cycles may be used before the mispredicted branch instruction is able to retire. For example, an older (earlier in program order than the mispredicted branch instruction) load instruction may require a long latency main memory access due to a cache miss, and, therefore, cause a long wait before both the load instruction and subsequently the mispredicted branch instruction are able to retire. Then processing of instructions beginning at the correct branch target address is delayed.
Another more complex approach is a branch misprediction recovery mechanism that selectively flushes both the processor pipeline and the WRF as soon as a branch misprediction is detected and not when the mispredicted branch instruction retires. Specifically, only the instructions that are younger (later in program order) than the mispredicted branch instruction are flushed from both the pipeline and the WRF. Then the mechanism restarts instruction fetch at the correct branch target address. This alternative mechanism allows the instructions at the branch target address to be processed sooner. However, this mechanism is significantly more complex. Maintaining precise interrupts in modern processors is already expensive due to deep pipelining. A large amount of hardware is typically required. Handling a branch misprediction with the above complex mechanism may further increase the amount of needed hardware, which increases both on-die area and wire route lengths, which increases on-die noise effects and signal transmission delays, and resultantly may diminish overall performance.
In view of the above, efficient methods and mechanisms for reducing branch misprediction penalty are desired.