Processor designers have always attempted to reduce the amount of time a processor takes to execute a program. One technique for increasing a processor's performance is by overlapping the steps involved in executing multiple instructions. This technique is known as pipelining. Each step in the pipeline, or a pipe stage, completes a part of an instruction. The pipe stages are connected one to the next to form a pipe, where instructions enter at one end, are processed through the stages, and exit at the other end. Because pipelining exploits parallelism among the instructions in a sequential instruction stream, it yields a reduction in the average execution time per instruction.
Since instruction execution proceeds at different rates in various pipe stages, in order to ensure the optimal performance of a pipeline, one method involves inserting queues in the pipeline to decouple these pipe stages. For example, when a queue is inserted between a fetch stage and a decode stage of a pipeline, a processor can retrieve the instructions during the fetch stage and store them in the inserted queue. Instead of waiting for the output of the fetch stage, the processor can obtain instructions directly from the queue and proceed with its decode stage. As a result, the execution of the fetch stage and the decode stage have been decoupled. In other words, the two stages can carry our their own tasks independently.
Although the described decoupling queue mechanism enhances the performance of a pipeline, its limitations still may hamper continued design efforts in speeding up processors. As a processor further splits up its pipe stages and increases its clock speed, the duration of the decoupling queue's setup time and its delay to perform either a read or a write operation can become prohibitively lengthy in view of a diminishing clock cycle. When such overhead equals to the processor clock cycle, no further pipelining is useful in enhancing a processor's performance.
As has been demonstrated, an improved method and apparatus is needed to meet the aggressive timing goals and to design a processor at a target frequency without impacting its performance.