1. Field of the Invention
The present invention relates to the field of computer systems, in particular, computer systems having pipelined processors. More specifically, the present invention relates to grouping multiple instructions, issuing them at the same time, and executing them in a pipelined processor.
2. Background
The execution time required for a given computer program is the product of three factors: the dynamic assembly language instruction count of the program, the number of cycles per instruction, and the basic clock rate of the processor. The first factor is determined by compiler technology in reducing the total dynamic instruction count for the program of interest. The last factor is determined by the speed limits of integrated circuit fabrication technology in creating low capacitance interconnect between fast transistors. The second factor is determined by the architecture of the processor, in particular, the architecture for instruction issuance and execution, which is the focus of the present invention,
Today, to improve execution performance, many computer systems offer pipelined processors. Pipelining is a processor implementation technique whereby multiple instructions are simultaneously overlapped in execution. Pipelining increases the processor instruction execution throughput, that is the rate of instructions exiting the pipeline, notwithstanding the slight increase in the execution time of an individual instruction due to the added pipeline control.
A pipelined processor is analogous to an assembly line having multiple stages with each pipe stage completing a part of an instruction being executed. Typically, the pipeline can be broken down into six stages, instruction fetching, instruction decoding, data memory addresses generation, processor resident operand fetching, instruction execution, and results writing. Multiple instructions are moved through the pipe stages in an overlapping manner.
Traditionally, all pipe stages must be ready and proceed at the same time. As a result, the machine cycle, the time required to move an instruction one step down the pipeline, and therefore the throughput of a pipelined processor was determined and limited by the slowest pipe stage. Thus, beside the obvious approach of using faster pipe stages, many modern pipelined processors are designed to allow the function units to proceed independently at their own pace.
However, by allowing the function units to proceed independently at their own pace, various pipeline hazards are introduced. When encountered, the executing and subsequent instructions will have to stalled. There are three classes of pipeline hazards:
1. Structural hazards due to resource conflicts when the processor is not fully pipelined to support all possible combinations of instructions in simultaneous overlapped execution, e.g. two simultaneous register writes on a pipelined processor having only one register file write port;
2. Data hazards due to an instruction's dependency on the result of an earlier instruction which is not yet available, e.g. a subsequent ADD instruction depending on the result of an earlier SUBTRACT instruction which is not yet available; and
3. Control hazards due to pipelining of branches and other instructions that change the program counter.
Thus, modern pipelined processors are typically optimized to reduce the likelihood of pipeline hazard occurrence. Additionally, one of various synchronization approaches is employed to deal with pipeline hazards. Particular examples of synchronization approaches include:
1) Forwarding or By-passing, which is a simple hardware technique where the arithmetic logic unit (ALU) result is always fed back to the ALU input latches. If an earlier ALU operation has written to the register corresponding to a source for the current ALU operation, the forwarded result is selected as the ALU input instead of the value read from the register file.
2) Scoreboard, which is a more sophisticated hardware technique, where centralized information are maintained to facilitate dynamic schedule around data hazards, that is to allow out of order execution of instructions with sufficient resources and no data dependencies. Typically, the scoreboard includes an instruction status table that tracks the current status of each issued or pending instruction, a function unit status table that tracks the current status of each function unit, and a register result status table that tracks which function unit will write to a register.
3) The Tomasulo algorithm, which is a decentralized variant of the scoreboard approach, but differs in two significant ways;
a. Hazard detection and execution control are distributed to function units by the use of reservation stations in the function units where dispatched instructions are queued for execution pending resolution of all dependencies and availability of the function units; PA1 b. Execution results are passed directly from the function units rather than going through the registers.
Traditionally, instructions were variable in length and they were fetched and dispatched into the pipeline one at a time. Instructions were made variable in length because a high premium was placed on efficient compaction of instructions in memory due to the relative high cost of memory. As a result, the instruction decoding stage required a lot of speculative hardware. The instructions were decoded sequentially, because until the first instruction was decoded, the starting byte position of the next instruction could not be determined. Furthermore, each instruction took multiple clock cycles to decode.
However, since the cost of memory has become relatively inexpensive, many modern pipelined processors, particularly reduced instruction set based pipelined processors, now offer fixed length instructions. As a result, the instruction decoding stage no longer requires a lot of speculative hardware. Multiple instructions may be decoded at the same time, since the starting position of each instruction is determinable. Furthermore, with sufficient resources, multiple instructions may be decoded in one clock cycle.
Thus, fixed length instructions offer a new opportunity to increase the execution rate, and therefore the throughput of pipelined processors. It is therefore desirable to be able to fetch, decode and issue multiple instructions to independent function units in one clock cycle. Furthermore, it is particularly desirable if the multiple instructions are issued with minimum increase in hardware for either centralized or decentralized synchronization.
As will be obvious from the disclosure to follow, these objects and desired results are among the objects and desired results of the present invention, which provides a new approach to allow multiple instructions to be fetched, decoded and issued at the same time, thereby increasing the execution rate and throughput of a pipelined processor.
For further description of pipelining, see J. L. Hennessy and D. A. Patterson, Computer Architecture A Quantitative Approach, Morgan Kaufmann Publishers, Inc., 1990.