1. Field of the Invention
This invention relates to the field of superscalar microprocessors and, more particularly, to branch misprediction recovery and load/store retirement.
2. Description of the Relevant Art
Superscalar microprocessors achieve high performance by executing multiple instructions simultaneously using the shortest possible clock cycle. As used herein, the term "clock cycle" refers to the interval of time that a superscalar microprocessor requires to complete the various tasks employed within its pipeline (for example, the processing of instructions). Two features important to high performing superscalar microprocessors are branch prediction and out-of-order execution.
Branch prediction is the process of speculatively selecting the direction that a branch instruction will select before the branch instruction is executed. Microprocessors execute instructions sequentially: when a first instruction is executed, the second instruction to be executed is the instruction stored in memory adjacent to the first instruction. Branch instructions, however may cause the next instruction to be executed to be either the next sequential instruction, or alternatively an instruction which resides in another memory location that is specified by the branch instruction. The memory location specified by the branch instruction is typically referred to as the "target" of the branch. Which of the instructions is selected for execution typically depends on a condition that the branch instruction tests. An exemplary tested condition is the value stored in a register, wherein the branch target is selected if the register contains zero and the next sequential instruction is selected if the register does not contain zero. It is noted that some branch instructions do not test a condition. Unconditional branches always select the target path, and typically no instructions are specifically encoded into the next sequential memory locations.
Branches occur relatively frequently in computer programs. In order to continue executing large numbers of instructions simultaneously, superscalar microprocessors predict which direction (or which "path") each branch instruction will select: next sequential or target. The microprocessor then speculatively executes instructions residing on the predicted path. If the superscalar microprocessor "mispredicts" the path that a branch instruction selects, the speculatively executed results are discarded and the correct path is fetched and executed. Various branch prediction mechanisms are well-known.
Because branch instruction mispredictions occur, a misprediction recovery mechanism is necessary. A misprediction recovery mechanism is a mechanism which causes the corrected fetch address to be fetched from the cache and the associated instructions to be dispatched to the instruction processing pipelines. The corrected fetch address is the address generated by the branch instruction for locating the next instruction to be executed. The misprediction recovery mechanism is required to complete in relatively few clock cycles, so that correct instructions are executed soon after the misprediction is determined. Typically, the clock cycles between the clock cycle in which the misprediction is discovered and the clock cycle in which the corrected instructions begin execution are idle cycles. Overall performance of the superscalar microprocessor is degraded by the number of idle cycles it must endure.
Superscalar microprocessors are evolving such that they execute larger and larger numbers of instructions simultaneously. However, branch instructions continue to occur in programs with the same frequency. Therefore, superscalar microprocessors are implementing branch prediction schemes in which multiple branch predictions may be outstanding in a given clock cycle (i.e. multiple branch paths have been predicted, but have not been validated by the execution of the associated branch). With the possibility of multiple branch instructions executing in a given clock cycle, and therefore multiple mispredictions being detected, the misprediction recovery mechanism becomes more complex. However, the importance of the misprediction recovery mechanism completing in relatively few clock cycles is not diminished. A misprediction recovery mechanism which requires relatively few clock cycles to complete and that can correctly resolve multiple branch mispredictions is desired.
Along with branch prediction, another feature intended to improve the performance of superscalar microprocessors is out-of-order execution. Out-of-order execution is the process of executing a particular instruction in a clock cycle that is before a clock cycle in which instructions which are before the particular instruction in program order are executed. An instruction which does not depend on the results generated by the instructions before it in program order need not delay its execution until the instructions before it execute. Because the instruction must be executed at some time, performance is advantageously increased by executing the instruction in a pipeline stage that would otherwise be idle in a clock cycle.
Unfortunately, certain instructions cannot be executed out-of-order. Programs assume that instructions are executed in-order, and therefore out-of-order execution must be employed in a manner which is transparent to programs. Exemplary instructions that cannot be executed out-of-order are load instructions that miss the data cache and store instructions. Store instructions modify memory, as opposed to other instructions which modify registers. If the store instruction is allowed to modify the data cache out-of-order and is then cancelled due to a previous branch misprediction or an interrupt, then the data cache would contain corrupted data. Therefore, the store instruction must not be allowed to modify the data cache or main memory until previous instructions have executed, so that the store instruction is not going to be cancelled. Load instructions that miss the data cache cannot be executed out-of-order either, as will be discussed below.
Data caches are implemented either on the same silicon substrate as a superscalar microprocessor, or are coupled nearby. The data cache is a high speed memory which is configured to store copies of a main system memory (when employed in a computer system). When a load or store instruction accesses the data cache, the access is found to be either a "hit" or a "miss". If an access is a hit, then the associated data is currently stored in the data cache. If the access is a miss, the associated data is in main memory. Load instructions are allowed to execute out-of-order when reading the data cache. However, when load instructions miss the data cache they are required to execute in order. Otherwise, a load miss may begin a transfer from main memory and then be cancelled. The external bus bandwidth used by the access would then be wasted. Furthermore, the data being transferred may cause a line to be removed from the cache. If that removed line is later needed, it will have to be transferred in from main memory and more external bus bandwidth would be wasted. Therefore, load instructions that are data cache misses should not execute out of order. A mechanism is needed to correctly order load instructions that are data cache misses and store instructions.