This invention relates generally to an apparatus and a method for improving processor microarchitecture in superscalar microprocessors. In particular, the invention relates to an apparatus and a method for a modified reorder buffer and a distributed instruction queue that increases the efficiency by reducing the hardware complexity, execution time, and the number of global wires in superscalar microprocessors that support multi-instruction issue, decoupled dataflow scheduling, out-of-order execution, register renaming, multi-level speculative execution, load bypassing, and precise interrupts.
2. Background of the Related Art
The main driving force in the research and development of microprocessor architectures is improving performance/unit cost. The true measure of performance is the time (seconds) required to execute a program. The execution time of a program is basically determined by three factors (see Patterson and Hennessey, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, 1990); the number of instructions executed in the program (dynamic Inst_Count), the average number of clock cycles per instruction (CPI), and the processing cycle time (Clock_Period), or EQU T.sub.program =Inst_Count.times.CPI.times.Clock_Period. (1)
To improve performance (reduce execution time), it is necessary to reduce one or more factors. The obvious one to reduce is Clock_Period, by means of semiconductor/VLSI technology improvements such as device scaling, faster circuit structures, better routing techniques, etc. A second approach to performance improvement is architecture design. CISC and VLIW architectures take the approach of reducing Inst_Count. RISC and superscalar architectures attempt to reduce the CPI. Superpipelined architectures increase the degree of pipelining to reduce the Clock_Period.
The true measure of cost is dollars/unit to implement and manufacture a microprocessor design in silicon. This hardware cost is driven by many factors such as die size, die yield, wafer cost, die testing cost, packaging cost, etc. The architectural choices made in a microprocessor design affect all these factors.
It is desirable to focus on finding microarchitecture techniques/alternatives to improve the design of superscalar microprocessors. The term microprocessor refers to a processor or CPU that is implemented in one or a small number of semiconductor chips. The term superscalar refers to a microprocessor implementation that increases performance by concurrent execution of scalar instructions, the type of instructions typically found in general-purpose microprocessors. It should be understood that hereinafter, the term "processor" also means "microprocessor".
A superscalar architecture can be generalized as a processor architecture that fetches and decodes multiple scalar instructions from a sequential, single-flow instruction stream, and executes them concurrently on different functional units. In general, there are seven basic processing steps in superscalar architectures; fetch, decode, dispatch, issue, execute, writeback, and retire. FIG. 1 illustrates these basic steps.
First, multiple scalar instructions are fetched simultaneously from an instruction cache/memory or other storage unit. Current state-of-the-art superscalar microprocessors fetch two or four instructions simultaneously. Valid fetched instructions (the ones that are not after a branch-taken instruction) are decoded concurrently, and dispatched into a central instruction window (FIG. 1a) or distributed instruction queues or windows (FIG. 1b). Shelving of these instructions is necessary because some instructions cannot execute immediately, and must wait until their data dependencies and/or resource conflicts are resolved. After an instruction is ready it is issued to the appropriate functional unit. Multiple ready instructions are issued simultaneously, achieving parallel execution within the processor. Execution results are written back to a result buffer first. Because instructions can complete out-of-order and speculatively, results must be retired to register file(s) in the original, sequential program order. An instruction and its result can retire safely if it completes without an exception and there are no exceptions or unresolved conditional branches in the preceding instructions. Memory stores wait at a store buffer until they can commit safely.
The parallel executions in superscalar processors demand high memory bandwidth for instructions and data. Efficient instruction bandwidth can be achieved by aligning and merging the decode group. Branching causes wasted decoder slots on the left side (due to unaligned branch target addresses) and on the right side (due to a branch-taken instruction that is not at the end slot). Aligning shifts branch target instructions to the left most slot to utilize all decoder slots. Merging fills the slots to the right of a branch-taken instruction with the branch target instructions, combining different instruction runs into one dynamic instruction stream. Efficient data bandwidth can be achieved by load bypassing and load forwarding (M. Johnson, Superscalar Microprocessor Design, Prentice-Hall, 1991), a relaxed or weak-memory ordering model. Relaxed ordering allows an out-of-order sequence of reads and writes, to optimize the use of the data bus. Stores to memory cannot commit until they are safe (retire step). Forcing loads and stores to commence in order will delay the loads significantly and stall other instructions that wait on the load data. Load bypassing allows a load to bypass stores in front of it (out-of-order execution), provided there is no read-after-write hazard. Load forwarding allows a load to be satisfied directly from the store buffer when there is a read-after-write dependency. Executing loads early is safe because load data is not written directly to the register file.
Classic superscalar architectures accomplish fine-grain parallel processing at the instruction level, which is limited to a single flow of control. They cannot execute independent regions of code concurrently (multiple flows of control). An instruction stream external to superscalar processors appears the same as in CISC or RISC uniprocessors; a sequential, single-flow instruction stream. It is internally that instructions are distributed to multiple processing units. There are complexities and limitations involved in parallelizing a sequential, single-flow instruction stream. The following six superscalar features--multi-instruction issue, decoupled dataflow scheduling, out-of-order execution, register renaming, speculative execution, and precise interrupts--are key in achieving this goal. They help improve performance and ensure correctness in superscalar processors.
Multi-instruction issue is made possible by widening a conventional, serial processing pipeline in the "horizontal" direction to have multiple pipeline streams. In this manner multiple instructions can be issued simultaneously per clock cycle. Thus, superscalar microprocessors must have multiple execution/functional units with independent pipeline streams. Also, to be able to sustain multi-instruction issue at every cycle, superscalar microprocessors fetch and decode multiple instructions at a time.
Decoupled dataflow scheduling is supported by buffering all decoded instructions into an instruction window(s), before they are scheduled for execution. The instruction window(s) essentially "decouples" the decode and execute stage. There are two primary objectives. The first is to maintain the flow of instruction fetching and decoding by not forcing a schedule of the decoded instructions right away. This reduces unnecessary stalls. Instructions are allowed to take time to resolve data dependencies and/or resource conflicts. The second is to improve the look-ahead capability of the processor. With the instruction window, a processor is now able to look ahead beyond the stalled instructions to discover others that are ready to execute. The issue logic includes a dependency check to allow an instruction to "fire" or execute as soon as its operands are available and its resource conflicts are resolved. Unlike sequential Von Neumann machines, the control hardware does not have to sequence each instruction and decide explicitly when it can execute. This is the essence of dataflow scheduling.
Out-of-order execution helps reduce instruction stalls due to data dependencies, bypassing the stalled or incomplete instructions. There are three types of out-of-order execution, categorized by their aggressiveness: (a) in-order issue with out-of-order completion, (b) partial out-of-order issue with out-of-order completion, and (c) full out-of-order issue with out-of-order completion. The first type always issues instructions sequentially, in the original program order, but they can complete out-of-order due to different latencies or stages in some functional units' pipelines. The second type restricts instruction issue to be in order only within a functional unit, but can be out of order amongst multiple functional units. The third type allows full out-of-order issue within a functional unit as well as amongst multiple functional units.
Register renaming is necessary to eliminate the side effects of out-of-order execution, i.e., artificial dependencies on registers--those dependencies other than true date dependency (read-after-write hazard). There are two types of artificial dependencies, anti dependency (write-after-read hazard) and output dependency (write-after-write hazard) (M. Johnson, Superscalar Microprocessor Design, Prentice-Hall, 1991). They are caused by register-set limitations. The compiler's register allocation process minimizes the register usage by reusing registers as much as possible. This action blurs the distinction between register and value. Register renaming effectively reintroduces the distinction by renaming the registers in hardware, creating a new instance of a register for each new register assignment.
Speculative execution avoids stalls and reduces the penalty due to control dependencies. For every conditional branch, a superscalar processor predicts the likely branch direction, with help from software (static branch prediction) or hardware (dynamic branch prediction). Instructions from the predicted path are fetched and executed speculatively, without waiting for the outcome of the branch test. By scheduling instructions across multiple, unresolved conditional branches (multi-level speculative execution), more instruction parallelism is potentially extracted, improving the processor's performance. Due to the speculative nature, some conditional branches may be incorrectly predicted. A mechanism to recover and restart must be provided so that correct results can still be produced in the event of mispredicted branches. Recovery cancels the effect of instructions processed under false predictions, and restart reestablishes the correct instruction sequence.
Precise interrupts are supported to guarantee the correct processor state before servicing the interrupt. Out-of-order execution complicates the restarting of an interrupted program. At the time an exception is detected, some instructions beyond the exception instruction might have been executed, as a result of allowing out-of-order execution. The effects on registers and memory by any instructions beyond the precise-repair point [?] must be nullified or repaired before going to the interrupt handler routine. The hardware support for precise interrupts should not be too costly if there is already hardware support for speculative execution.
There are two key microarchitecture elements in superscalar hardware that determine the success in achieving the above goal, result shelving and instruction shelving. Result shelving is the key to support register renaming, out-of-order execution, speculative execution, and precise interrupts. Instruction shelving is the key to support multi-instruction issue, decoupled dataflow scheduling, and out-of-order execution. Review of the literature suggests that the reorder buffer (RB) is the most complete result shelving technique (see, for example U.S. Pat. No. 5,136,697 to Johnson and U.S. Pat. No. 5,345,569 to Tran for discussions of conventional reorder buffers), and the reservation station (RS) is the best instruction shelving technique to give maximum machine parallelism. However, these two techniques have implementation drawbacks. The RB requires associative lookup that must be prioritized during each operand read. This results in relatively complex and slow circuit implementation. Also, the RB requires substantial shared-global buses for its operand and result buses, and the need to use dummy branch entries to support speculative execution which increases the RB entry usage. The RS requires tremendous amounts of shared (heavily-loaded), global (chip-wide) wires to support its operand value copying and result value forwarding. With increasingly smaller transistor sizes, the dominant factors in determining silicon area and propagation delays is not the transistor, but metal wire, especially the ones that run across or all over the chip.
With the many promises that lie ahead, the research challenges in superscalar architecture design are to find: an efficient utilization of the vast chip real-estate, the high-speed transistors, and the available instruction parallelism. The hardware improvements that lead to enhanced performance must be coupled with compiler/software scheduling improvements, however. There is a need for these improvements to be cost effective, or, at best, to actually reduce the cost of a superscalar microprocessor while increasing efficiency. In accordance with the above, we should avoid the tendency to design an overly complex superscalar architecture that produces mediocre gains which could have been easily achieved by compiler optimizations or that are cost limiting.