1. Technical Field
The present invention relates generally to a superscalar processor and in particular to increasing completion of more instructions per cycle. Still more particularly, the invention relates to improving the instruction completion scheme and completion unit of the superscalar processor.
2. Description of the Related Art
Microprocessors are classified as superscalar if the microprocessor ("processor") is capable of completing multiple instructions per clock cycle. The architecture of a superscalar processor utilizes multiple parallel processing units within the processor to allow completion of more multiple instructions per clock cycle. These processing units generally include multiple execution units operating in parallel, a dispatch unit for sending instructions and data to the execution units, a completion unit containing a ("completion table") for tracking and retiring the instructions and rename buffers (rename registers) for preloading instructions for the execution units. The tracking and retiring feature of the completion table provides for completing instructions out of order.
Utilizing multiple parallel processing units requires, for efficiency and speed, that instructions be "pipelined." Pipelining is a method of fetching and decoding instructions so that an execution unit does not have to wait for instructions; the execution unit begins executing a second instruction before the first has been completed. Additionally, current architecture uses speculative execution (executing instructions from different branch paths) and branch prediction to increase performance of the processor. Branch prediction is utilized to predict the way an instruction will branch the next time it is executed and is generally correct 90 percent of the time.
FIG. 4 is a simplified block diagram illustrating instruction flow in a superscalar processor. Multiple instructions are retrieved from the instruction cache by the fetcher 402 and placed in either the Branch Processing unit 404 or the Instruction Queue 406. After entering the Instruction Queue 406, instructions are issued to the various execution units by the dispatch unit 408. The dispatch rate is contingent on, among other things, the execution unit busy status, rename buffer availability (not shown) and completion table buffer availability. In current processors instruction completion logic is performed in a single unit, the completion unit, within the processor. Completion unit 418, tracks instructions from dispatch through execution and finish, allowing for out of order execution of instructions. Status of an instruction is transmitted by execution unit 410, 412, 414 or 416 to completion table 418 when that execution unit finishes with the instruction.
The completion unit then completes, or retires, the instruction and sends a completion signal to the remaining execution units, allowing write-back of finished data into architected registers.
However, there is a fixed number of instructions that may be completed or retired per cycle by the completion unit; a limiting factor that may lead to a bottle-neck in the instruction dispatch logic. All instructions are tracked, completed and committed to specific architected registers by completion unit 418. One reason that the processor 400 may drop in efficiency and speed if the completion unit 418 stalls is that execution units 410, 412, 414, and 416 cannot send a finish signal to update a corresponding instruction status in completion unit 418. The execution units 410, 412, 414, and 416 stall because finish signals cannot be transmitted to a full completion unit 418 queue. The backup continues because rename buffers (not shown) cannot now transfer instructions to execution units 410, 412, 414, and 416. The dispatch unit 408 makes the determination that the rename buffers are full and there is no room for additional instructions. So, the dispatch unit 408 will not dispatch an instruction, unless there is space available in the rename buffers, causing the processor to stall.
A second reason for a completion unit 418 bottleneck is that the completion unit 418, though it is capable of retiring multiple instructions at the same time, must retire instructions in program order. If the instruction in completion unit 418, entry 0 (the first instruction retiring position), is unable to be retired because an instruction is still being executed, instruction completion can be stalled.
The speed of processors capable of executing multiple instructions per clock cycle is limited by the ability of the processor's completion table to complete or retire instructions.
It would be desirable therefore, to provide a method and apparatus for completing instructions in a manner that would eliminate the bottleneck posed by conventional completion units.
It would also be desirable to complete more instructions per cycle. It would further be desirable to improve the efficiency of the completion table in retiring instructions.