In an effort to speed processor execution, prefetcher's were incorporated in the Intel x86 processor family (e.g., 8086, 80186, 80286, 80386, 80486). The prefetcher works on the assumption that the execution unit will want the next instruction from the next sequential memory location. While the execution unit is executing an instruction, the prefetcher takes advantage of idle bus time to fetch the next sequential instruction from memory and place this instruction in the processor's prefetch queue. This technique works fine for non-branch instructions. However, whenever a jump or branch instruction is encountered, the entire prefetch queue must be flushed and reloaded with the correct instructions, thereby slowing the processor.
The Pentium and Pentium Pro processor from Intel Corp. added branch prediction logic to permit the prefetcher to make more intelligent decisions regarding what information to prefetch from memory. The instruction pipeline 10 of the Pentium Pro processor is illustrated in FIG. 1 and includes eleven stages, including Instruction Fetch Units 12 (IFU1, IFU2, IFU3), Decode stages 14 (DEC1, DEC 2), the Register Alias Table and Allocator stage 16 (RAT), the ReOrder Buffer 18 (ROB) (also known as the instruction pool), the Dispatch stage 20 (DIS), the Execution stage (EX) 22, and the Retirement stages 24 (RET1, RET2).
The first seven stages of the instruction pipeline 10 (fetch 12, decode 14, RAT 14 and ROB 16 stages) are known as the In-Order Front End section 30 of the processor because the instructions are kept in strict program order. The Dispatch stage 20 and the Execution stage 22 are known as the Middle Out-of-Order section 32 of the processor because micro-ops can be executed in any order. The Retirement stages 24 are also known as the In-Order Rear End section 34 of the processor because micro-ops are retired in program order.
FIG. 2 illustrates the operation of the instruction pipeline of the Pentium Pro processor. The instructions are fetched by the Instruction Fetch Units 12 (FIG. 1) and placed into a prefetch streaming buffer 40. The instructions are decoded into micro-ops by the Decode stages 14 and stored in the instruction decode queue (ID Queue) 42. The micro-ops are then moved into the ROB (or instruction pool) 18 in strict program order to await execution.
The ROB 18 is a circular buffer with 40 entries and includes a start-of-buffer pointer and an end-of-buffer pointer. The start of buffer pointer points to the oldest (unretired) micro-op in the ROB 18. The end-of-buffer pointer points to where the next micro-op will be stored in the ROB 18. Initially the ROB 18 is empty and the start-of-buffer pointer and the end-of-buffer pointer point to the first ROB entry (entry 0). As instructions are decoded into micro-ops (up to three per clock), the micro-ops are placed in the ROB 18 starting at entry 0in strict program order and the end-of-buffer pointer is incremented once for each micro-op.
If any instructions are branches, the fetched memory addresses are provided to a Branch Target Buffer (BTB) for branch prediction. Whenever a branch instruction enters the instruction pipeline, the prediction logic predicts whether the branch will be taken (by examining the execution history of the instruction). The branch prediction is used to load instructions into the In-Order Front End section 30 corresponding to the predicted path.
Once the micro-ops are placed in the ROB (or instruction pool) 18 in strict program order, they are executed one or more at a time, as data and execution units for each micro-op become available. In the Dispatch stage 20 (FIG. 1), each micro-op is copied from the ROB 18 to a Reservation Station if all the data required by the micro-op is available. The micro-op is then dispatched by Dispatch stage 20 to an execution unit when one becomes available. The micro-ops can be executed out of order. Once executed, the results of the micro-ops' execution are stored in the ROB 18 at the ROB entry occupied by the micro-op. As each micro-op in the ROB completes execution, it is marked as ready for retirement.
The retirement stages 24 constantly check the status of the three oldest micro-ops in the ROB 18 to determine when all three of them have been executed and marked as ready for retirement. The micro-ops are then retired (by RET2) in original program order. As each is retired, the micro-op's execution result is committed to architectural state by copying the micro-op's execution result from the ROB entry into the processor's real register set. The respective ROB entry is then deleted or flushed and the start-of-buffer pointer is incremented to point to the oldest unretired micro-op in the ROB 18.
FIG. 3 illustrates the operation of the instruction pipeline 10 of FIG. 1. Instructions are fetched, decoded into micro-ops, and stored in the ROB 18 to await execution. Steps A-C illustrate some examples of the type of operations performed in the instruction pipeline 10. At step A, a micro-op is stored at sequence number 4 in the ROB 18 (ROB entry 4) and the End-of-Buffer pointer 48 is incremented from ROB entry 4 to ROB entry 5 (where the next decoded micro-op will be stored in the ROB 18).
After the micro-op of sequence number 1 (ROB entry 1) has completed execution, the execution result is stored in the ROB entry 1 and ROB entry 1 is marked as executed (a 1 is marked in the Executed bit for ROB entry 1). At step B, the micro-op of ROB entry 1 is retired by copying the micro-op's execution result from ROB entry 1 into the processor's real register set. The micro-op and other information in ROB entry 1 is then deleted and the start-of-buffer pointer 46 is incremented to point to ROB entry 2, which is the oldest (unretired) micro-op in ROB 18. ROB entry 1 is now available to receive a new micro-op.
At step C, the branch micro-op at sequence number 2 is executed and validated. In this example, during validation, it is determined that the branch of sequence number 2 has been incorrectly predicted, and therefore, the instructions that were prefetched and stored in the ROB after the branch entry 2 (micro-ops for ROB entries 3 and 4) are incorrect (were mispredicted). The micro-ops for ROB entries 3 and 4 correspond to the mispredicted path 75. In addition, the instructions in the pipeline stages earlier than the ROB 18 (e.g., the Instruction Fetch Units 12, the prefetch streaming buffer 40, the Decode stages 14, the RAT 16, the ID Queue 42) also correspond to the incorrect path and must be flushed. After flushing the micro-ops in the pipeline stages earlier than the ROB, the instructions corresponding to the correct path are then fetched and decoded. However, the ROB 18 cannot be flushed and the new micro-ops corresponding to the correct path cannot be loaded from the Front End section 30 into the ROB 18 until all instructions prior to and including the mispredicted branch (at ROB entry 2) have been executed and retired. When the mispredicted branch has been retired, then the ROB 18 is flushed or cleared and the micro-ops corresponding to the correct path can be loaded into the ROB 18.
In the example of FIG. 3, there are no unexecuted instructions before the mispredicted branch. Generally, however, because the branch operation may have been executed out of order (before one or more older instructions), the Front End section 30 can stall (wait) at the out of order boundary and cannot load the new (correct path) micro-ops from the Front End section 30 (e.g., Decode stages 14) into the ROB 18 until the mispredicted branch instruction has been retired and the ROB 18 flushed. (The out of order boundary refers to where the in order Front End section 30 meets the out of order Middle section 32). In other words, the Pentium does not mix correct and incorrect micro-ops in the ROB 18.
To determine when the correct path can be loaded into ROB 18, the Retirement stages 24 check the status of the three oldest micro-ops in the ROB 18. When the Retirement stages 24 find that the oldest executed micro-op in the ROB 18 is the mispredicted branch, this indicates that all previous (older) micro-ops have been executed and retired. The mispredicted branch can then be committed to architectural state and retired. The ROB 18 is then completely flushed or cleared and the stall is released. The micro-ops of the correct path are then loaded from the Decode stages 14 into the ROB 18 beginning at the first ROB entry.
In the Pentium processor, the time required to flush and fully reload the Front End section 30 with the correct path was typically greater than the time required for the mispredicted branch to be retired. Therefore, in the Pentium processor, the Front End section 30 of the processor rarely stalled or waited for the mispredicted branch to be retired before reloading the ROB 18 with micro-ops of the correct path.
However, improved techniques have increased the speed of the Instruction Fetch Units 12 and Decode stages 14, thereby, greatly increasing the Front End section 30 stall time after a mispredicted branch. As a result, a need exists for a more efficient technique to recover from a mispredicted branch in order to reduce the Front End stall time.