1. Field of the Invention
This invention relates generally to an apparatus and a method for improving processor microarchitecture in superscalar microprocessors. In particular, the invention relates to an apparatus and a method for a modified reorder buffer and a distributed instruction queue that increases the efficiency by reducing the hardware complexity, execution time, and the number of global wires in superscalar microprocessors that support multi-instruction issue, decoupled dataflow scheduling, out-of-order execution, register renaming, multi-level speculative execution, load bypassing, and precise interrupts.
2. Background of the Related Art
The main driving force in the research and development of microprocessor architectures is improving performance/unit cost. The true measure of performance is the time (seconds) required to execute a program. The execution time of a program is basically determined by three factors (see Patterson and Hennessey, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, 1990); the number of instructions executed in the program (dynamic Inst.sub.-- Count), the average number of clock cycles per instruction (CPI), and the processing cycle time (Clock.sub.-- Period), or EQU T.sub.program =Inst.sub.-- Count.times.CPI.times.Clock.sub.-- Period.(1)
To improve performance (reduce execution time), it is necessary to reduce one or more factors. The obvious one to reduce is Clock.sub.-- Period, by means of semiconductor/VLSI technology improvements such as device scaling, faster circuit structures, better routing techniques, etc. A second approach to performance improvement is architecture design. CISC and VLIW architectures take the approach of reducing Inst.sub.-- Count. RISC and superscalar architectures attempt to reduce the CPI. Superpipelined architectures increase the degree of pipelining to reduce the Clock.sub.-- Period.
The true measure of cost is dollars/unit to implement and manufacture a microprocessor design in silicon. This hardware cost is driven by many factors such as die size, die yield, wafer cost, die testing cost, packaging cost, etc. The architectural choices made in a microprocessor design affect all these factors.
It is desirable to focus on finding microarchitecture techniques/alternatives to improve the design of superscalar microprocessors. The term microprocessor refers to a processor or CPU that is implemented in one or a small number of semiconductor chips. The term superscalar refers to a microprocessor implementation that increases performance by concurrent execution of scalar instructions, the type of instructions typically found in general-purpose microprocessors. It should be understood that hereinafter, the term "processor" also means "microprocessor".
A superscalar architecture can be generalized as a processor architecture that fetches and decodes multiple scalar instructions from a sequential, single-flow instruction stream, and executes them concurrently on different functional units. In general, there are seven basic processing steps in superscalar architectures; fetch, decode, dispatch, issue, execute, writeback, and retire. FIG. 1 illustrates these basic steps.
First, multiple scalar instructions are fetched simultaneously from an instruction cache/memory or other storage unit. Current state-of-the-art superscalar microprocessors fetch two or four instructions simultaneously. Valid fetched instructions (the ones that are not after a branch-taken instruction) are decoded concurrently, and dispatched into a central instruction window (FIG. 1a) or distributed instruction queues or windows (FIG. 1b). Shelving of these instructions is necessary because some instructions cannot execute immediately, and must wait until their data dependencies and/or resource conflicts are resolved. After an instruction is ready it is issued to the appropriate functional unit. Multiple ready instructions are issued simultaneously, achieving parallel execution within the processor. Execution results are written back to a result buffer first. Because instructions can complete out-of-order and speculatively, results must be retired to register file(s) in the original, sequential program order. An instruction and its result can retire safely if it completes without an exception and there are no exceptions or unresolved conditional branches in the preceding instructions. Memory stores wait at a store buffer until they can commit safely.
The parallel executions in superscalar processors demands high memory bandwidth for instructions and data. Efficient instruction bandwidth can be achieved by aligning and merging the decode group. Branching causes wasted decoder slots on the left side (due to unaligned branch target addresses) and on the right side (due to a branch-taken instruction that is not at the end slot). Aligning shifts branch target instructions to the left most slot to utilize all decoder slots. Merging fills the slots to the right of a branch-taken instruction with the branch target instructions, combining different instruction runs into one dynamic instruction stream. Efficient data bandwidth can be achieved by load bypassing and load forwarding (M. Johnson, Superscalar Microprocessor Design, Prentice-Hall, 1991), a relaxed or weak-memory ordering model. Relaxed ordering allows an out-of-order sequence of reads and writes, to optimize the use of the data bus. Stores to memory cannot commit until they are safe (retire step). Forcing loads and stores to commence in order will delay the loads significantly and stall other instructions that wait on the load data. Load bypassing allows a load to bypass stores in front of it (out-of-order execution), provided there is no read-after-write hazard. Load forwarding allows a load to be satisfied directly from the store buffer when there is a read-after-write dependency. Executing loads early is safe because load data is not written directly to the register file.
Classic superscalar architectures accomplish fine-grain parallel processing at the instruction level, which is limited to a single flow of control. They cannot execute independent regions of code concurrently (multiple flows of control). An instruction stream external to superscalar processors appears the same as in CISC or RISC uniprocessors; a sequential, single-flow instruction stream. It is internally that instructions are distributed to multiple processing units. There are complexities and limitations involved in parallelizing a sequential, single-flow instruction stream. The following six superscalar features--multi-instruction issue, decoupled dataflow scheduling, out-of-order execution, register renaming, speculative execution, and precise interrupts--are key in achieving this goal. They help improve performance and ensure correctness in superscalar processors.
Multi-instruction issue is made possible by widening a conventional, serial processing pipeline in the "horizontal" direction to have multiple pipeline streams. In this manner multiple instructions can be issued simultaneously per clock cycle. Thus, superscalar microprocessors must have multiple execution/functional units with independent pipeline streams. Also, to be able to sustain multi-instruction issue at every cycle, superscalar microprocessors fetch and decode multiple instructions at a time.
Decoupled dataflow scheduling is supported by buffering all decoded instructions into an instruction window(s), before they are scheduled for execution. The instruction window(s) essentially "decouples" the decode and execute stage. There are two primary objectives. The first is to maintain the flow of instruction fetching and decoding by not forcing a schedule of the decoded instructions right away. This reduces unnecessary stalls. Instructions are allowed to take time to resolve data dependencies and/or resource conflicts. The second is to improve the look-ahead capability of the processor. With the instruction window, a processor is now able to look ahead beyond the stalled instructions to discover others that are ready to execute. The issue logic includes a dependency check to allow an instruction to "fire" or execute as soon as its operands are available and its resource conflicts are resolved. Unlike sequential Von Neumann machines, the control hardware does not have to sequence each instruction and decide explicitly when it can execute. This is the essence of dataflow scheduling.
Out-of-order execution helps reduce instruction stalls due to data dependencies, bypassing the stalled or incomplete instructions. There are three types of out-of-order execution, categorized by their aggressiveness: (a) in-order issue with out-of-order completion, (b) partial out-of-order issue with out-of-order completion, and (c) full out-of-order issue with out-of-order completion. The first type always issues instructions sequentially, in the original program order, but they can complete out-of-order due to different latencies or stages in some functional units' pipelines. The second type restricts instruction issue to be in order only within a functional unit, but can be out of order amongst multiple functional units. The third type allows full out-of-order issue within a functional unit as well as amongst multiple functional units.
Register renaming is necessary to eliminate the side effects of out-of-order execution, i.e., artificial dependencies on registers--those dependencies other than true date dependency (read-after-write hazard). There are two types of artificial dependencies, anti dependency (write-after-read hazard) and output dependency (write-after-write hazard) (M. Johnson, Superscalar Microprocessor Design, Prentice-Hall, 1991). They are caused by register-set limitations. The compiler's register allocation process minimizes the register usage by reusing registers as much as possible. This action blurs the distinction between register and value. Register renaming effectively reintroduces the distinction by renaming the registers in hardware, creating a new instance of a register for each new register assignment.
Speculative execution avoids stalls and reduces the penalty due to control dependencies. For every conditional branch, a superscalar processor predicts the likely branch direction, with help from software (static branch prediction) or hardware (dynamic branch prediction). Instructions from the predicted path are fetched and executed speculatively, without waiting for the outcome of the branch test. By scheduling instructions across multiple, unresolved conditional branches (multi-level speculative execution), more instruction parallelism is potentially extracted, improving the processor's performance. Due to the speculative nature, some conditional branches may be incorrectly predicted. A mechanism to recover and restart must be provided so that correct results can still be produced in the event of mispredicted branches. Recovery cancels the effect of instructions processed under false predictions, and restart reestablishes the correct instruction sequence.
Precise interrupts are supported to guarantee the correct processor state before servicing the interrupt. Out-of-order execution complicates the restarting of an interrupted program. At the time an exception is detected, some instructions beyond the exception instruction might have been executed, as a result of allowing out-of-order execution. The effects on registers and memory by any instructions beyond the precise-repair point [?] must be nullified or repaired before going to the interrupt handler routine. The hardware support for precise interrupts should not be too costly if there is already hardware support for speculative execution.
There are two key microarchitecture elements in superscalar hardware that determine the success in achieving the above goal, result shelving and instruction shelving. Result shelving is the key to support register renaming, out-of-order execution, speculative execution, and precise interrupts. Instruction shelving is the key to support multi-instruction issue, decoupled dataflow scheduling, and out-of-order execution. Review of the literature suggests that the reorder buffer (RB) is the most complete result shelving technique (see, for example U.S. Pat. Nos. 5,136,697 to Johnson and No. 5,345,569 to Tran for discussions of conventional reorder buffers), and the reservation station (RS) is the best instruction shelving technique to give maximum machine parallelism. However, these two techniques have implementation drawbacks. The RB requires associative lookup that must be prioritized during each operand read. This results in relatively complex and slow circuit implementation. Also, the RB requires substantial shared-global buses for its operand and result buses, and the need to use dummy branch entries to support speculative execution which increases the RB entry usage. The RS requires tremendous amounts of shared (heavily-loaded), global (chip-wide) wires to support its operand value copying and result value forwarding. With increasingly smaller transistor sizes, the dominant factors in determining silicon area and propagation delays is not the transistor, but metal wire, especially the ones that run across or all over the chip.
With the many promises that lie ahead, the research challenges in superscalar architecture design are to find: an efficient utilization of the vast chip real-estate, the high-speed transistors, and the available instruction parallelism. The hardware improvements that lead to enhanced performance must be coupled with compiler/software scheduling improvements, however. There is a need for these improvements to be cost effective, or, at best, to actually reduce the cost of a superscalar microprocessor while increasing efficiency. In accordance with the above, we should avoid the tendency to design an overly complex superscalar architecture that produces mediocre gains which could have been easily achieved by compiler optimizations or that are cost limiting.
The present invention is discussed at length in the doctoral dissertation entitled "Microarchitecture Techniques to Improve Design of Superscalar Microprocessors," Copyright.COPYRGT. 1995, Georgia Institute of Technology, of one of the co-inventors, Joseph I. Chamdani, the subject matter of which is incorporated herein by reference. Hereinafter, the above dissertation will be referred to as Chamdani's dissertation.
This invention addresses architectural improvements to the design of superscalar processors that support the six key superscalar features. The primary objective of the invention was to find a better design alternative to the reservation station technique (considered the best known distributed instruction shelving technique to give maximum machine parallelism). The superscalar technique invented is: the Distributed Instruction Queue (DIQ). The DIQ is a new distributed instruction shelving technique that offers a significantly more efficient (i.e., better performance/cost) implementation than the reservation station (RS) technique by eliminating operand value/copying and result value forwarding.
The DIQ shelving technique offers a more efficient (i.e., good performance/cost) implementation of distributed instruction windows by eliminating the two major implementation drawbacks in the RS technique, operand value copying and result forwarding. The DIQ can support in-order issue as well as out-of-order issue within its functional unit. The cost analysis suggests an improvement in almost every hardware component, with major reductions in the use of global wires, comparators, and multiplexers (see Chamdani's dissertation). The expensive shared-global wires are mostly replaced by private-local wires that are easier to route, have less propagation delay, and occupy much smaller silicon area. The DIQ's number of global wires remains the same as the number of DIQ entries and data size increase. A performance analysis using cycle-by-cycle simulators confirms that the good characteristics of the RS technique in achieving maximum machine parallelism have been maintained in the DIQ technique (see Chamdani's dissertation). The out-of-order DIQ technique is at par with the RS technique in terms of cycle-count performance, but higher in terms of overall performance if the improved clock frequency is factored in. The in-order issue DIQ sacrifices slightly on the cycle-count performance, which can easily be recovered through faster and simpler circuit implementation. In the end, the actual speed or performance of a processor using the DIQ technique is faster due to reduced cycle time or more operations executed per cycle.
One object of the invention is to provide an improved superscalar processor.
Another object of the invention is to provide a distributed instruction queue that does not store register values.
A further object of the invention is to eliminate the need for operand value copying in a superscalar microprocessor.
Yet another object of the invention is to eliminate the need for result value forwarding in a superscalar processor.
One other object of the invention is to provide a processor having reduced global buses.
One advantage of the invention is that it can improve the speed of a superscalar processor.
Another advantage of the invention is that it can reduce the amount of global buses required in a superscalar processor.
A further advantage of the invention is that it can allow for issuing of instructions in any order.
Still another advantage of the invention is that it can support multi-level speculative execution.
One feature of the invention is that it includes local bus architecture between register units and functional units.
These and other objects, advantages, and features are provided by a distributed instruction queue, comprising: at least one entry cell having at least one entry field; at least one allocate port, each of the at least one allocate port connected to each of the at least one entry cell for allocation of a decoded instruction to the at least one entry cell; an issue port connected to a predetermined one of the at least one entry cell, wherein instructions are issued through the issue port under logic control in any order from one of the at least one entry cell and the distributed instruction queue stores no register value.
Implementations of the invention may include one or more of the following features: a tail pointer logic unit to determine the correct tail pointer position of each of the at least one allocate port; a head pointer logic unit to adjust a head pointer to point to a predetermined one of the at least one entry cell; an issue pointer logic unit to adjust an issue pointer to point to the one of the at least one entry cell for issuing of the instructions; the distributed instruction queue eliminates operand value copying and result value forwarding; the distributed instruction queue is operated independently of any other distributed instruction queue; the instructions are issued in-order; the instruction are issued out-of-order; and the instructions are issued in some form of limited out-of-order issue.
These above and other objects, advantages, and features of the invention will become more apparent from the following description thereof taken in conjunction with the accompanying drawings.