The present invention relates to a decoder for use in a processing system. In particular, the present invention provides a low cost four-instruction decoder for a super-scalar processor.
The time taken by a computing system to perform a particular application is determined by three basic factors, namely, the processor cycle time, the number of processor instructions required to perform the application, and the average number of processor cycles required to execute an instruction. Overall system performance can be improved by reducing one or more of these factors. For example, the average number of cycles required to perform an application can be significantly reduced by employing a multi-processor architecture, i.e., providing more than one processor to execute separate instructions concurrently.
There are disadvantages, however, associated with the implementation of a multi-processor architecture. In order to be effective, multi-processing requires an application that can be easily segmented into independent tasks to be performed concurrently by the different processors. The requirement for a readily segmented task limits the effective applicability of multi-processing. Further, the increase in processing performance attained via multi-processing in many circumstances may not offset the additional expense incurred by requiring multiple processors.
Single-processor hardware architectures that avoid the disadvantages associated with multi-processing have been proposed. These so called "super-scalar" processors permit a sustained execution rate of more than one instruction per processor cycle, as opposed to conventional scalar processors which--while capable of handling multiple instructions in different pipeline stages in one cycle--are limited to a maximum pipeline capacity of one instruction per cycle. In contrast, a super-scalar pipeline architecture achieves concurrency between instructions both in different pipeline stages and within the same pipeline stage.
A super-scalar processor that executes more than one instruction per cycle, however, can only be effective when instructions can be supplied at a sufficient rate. It is readily apparent that instruction fetching can be a limiting factor in overall system performance if the average rate of instruction fetching is less than the average rate of instruction execution. Although the amount of instruction-level concurrency in most applications is sufficient to support an execution rate of two instructions per cycle, it is difficult to provide the required instruction bandwidth. For example, branches disrupt the sequentiality of instruction addressing, causing instructions to be misaligned with respect to an instruction decoder. This in turn causes some otherwise valid fetch and decode cycles to be only partially effective in supplying the processor with instructions, because the entire width of the instruction fetcher is not occupied by valid instructions.
The sequentially-fetched instructions between branches is called a run, and the number of instructions fetched sequentially is called the run length. FIG. 1 illustrates two instruction runs consisting of a number of instructions occupying four instruction-cache blocks (assuming a four-word cache block) in an instruction cache memory. The first instruction run consists of instructions S1-S5 that contain a branch to a second instruction run T1-T4. FIG. 2 illustrates how these instruction runs are sequenced through a four-instruction decoder and a two-instruction decoder, assuming for purposes of illustration that two cycles are required to determine the outcome of a branch.
FIG. 3 demonstrates the benefit of a four-instruction decoder measured during the execution of selected sample applications. For these programs, the average fetch efficiency is 1.72 instructions per cycle for a two-instruction decoder and 2.75 instructions per cycle for a four-instruction decoder. As would be expected, the four-instruction decoder always out-performs a two-instruction decoder, as the four-instruction decoder has twice the potential instruction bandwidth of the two-instruction decoder.
In actual implementation, however, directly modeling a four-instruction decoder after a single-instruction decoder is not cost-effective. In a straightforward implementation, decoding four instructions per cycle would require eight read ports on both the processor's register file and result buffer, and eight buses for distributing operands. Other problems include the requirement for a tremendous number of comparators in the processor execution hardware for dependency analysis. Thus, the increased hardware requirements of a four-instruction decoder generally outweigh the performance benefits gained by its implementation.