The concept of parallel execution of instructions has helped to increase the performance of computer systems. Parallel execution is based on having separate functional units which can execute two or more of the same or different instructions simultaneously.
Another technique used to increase the performance of computer systems is pipelining. In general, pipelining is achieved by partitioning a function to be performed by a computer into independent subfunctions and allocating a separate piece of hardware, or stage, to perform each subfunction. Each stage is defined to occupy one basic machine cycle in time. Pipelining does provide a form of parallel processing since it is possible to execute multiple instructions concurrently. Ideally, one new instruction can be fed into the pipeline per cycle, with each instruction in the pipeline being in a different stage of execution. The operation is analogous to a manufacturing assembly line, with a number of instances of the manufactured product in varying stages of completion.
However, many times the benefits of parallel execution and/or pipelining are not achieved because of delays like those caused by data dependent interlocks and hardware dependent interlocks. An example of a data dependent interlock is a so-called write-read interlock where a first instruction must write its result before the second instruction can read and subsequently use it. An example of hardware dependent interlock is where a first instruction must use a particular hardware component and a second instruction must also use the same particular hardware component.
One of the techniques previously employed to avoid interlocks (sometimes called pipeline hazards) is called dynamic scheduling. Dynamic scheduling is based on the fact that with the inclusion of specialized hardware, it is possible to reorder instruction sequences after they have been issued into the pipeline for execution.
There have also been some attempts to improve performance through so-called static scheduling which is done before the instruction stream is fetched from storage for execution. Static scheduling is achieved by moving code and thereby reordering the instruction sequence before execution. This reordering produces an equivalent instruction stream that will more fully utilize the hardware through parallel processing. Such static scheduling is typically done at compile time. However, the reordered instructions remain in their original form and conventional parallel processing still requires some form of dynamic determination just prior to execution of the instructions in order to decide whether to execute the next two instructions serially or in parallel.
Such scheduling techniques can improve the overall performance of a pipelined computer, but cannot alone satisfy the ever present demands for increased performance. In that regard, many of the recent proposals for general purpose computing are related to the exploitation of parallelism at the instruction level beyond that attained by pipelining. For example, further instruction level parallelism has been achieved explicitly by issuing multiple instructions per cycle with so-called superscalar machines, rather than implicitly as with dynamic scheduling of single instructions or with vector machines. The name superscalar for machines that issue multiple instructions per cycle is to differentiate them from scalar machines that issue one instruction per cycle.
In a typical superscalar machine, the opcodes in a fetched instruction stream are decoded and analyzed dynamically by instruction issue logic in order to determine whether the instructions can be executed in parallel. The criteria for such last-minute dynamic scheduling are unique to each instruction set architecture, as well for the underlying implementation of that architecture in any given instruction processing unit. Its effectiveness is therefore limited by the complexity of the logic to determine which combinations of instructions can be executed in parallel, and the cycle time of the instruction processing unit is likely to be increased. The increased hardware and cycle time for such superscalar machines become even a bigger problem in architectures which have hundreds of different instructions.
There are other deficiencies with dynamic scheduling, static scheduling, or combinations thereof. For example, it is necessary to review each scalar instruction anew every time it is fetched for execution to determine its capability for parallel execution. There has been no way provided to identify and flag ahead of time those scalar instructions which have parallel execution capabilities.
Another deficiency with dynamic scheduling of the type implemented in super scalar machines is the manner in which scalar instructions are checked for possible parallel processing. Superscalar machines check scalar instructions based on their opcode descriptions, and no way is provided to take into account hardware utilization. Also, instructions are issued in FIFO fashion thereby eliminating the possibility of selective grouping to avoid or minimize the occurrence of interlocks.
There are some existing techniques which do seek to consider the hardware requirements for parallel instruction processing. One such system is a form of static scheduling called the Very Long Instruction Word machine in which a sophisticated compiler rearranges instructions so that hardware instruction scheduling is simplified. In this approach the compiler must be more complex than standard compilers so that a bigger window can be used for purposes of finding more parallelism in an instruction stream. But the resulting instructions may not necessarily be object code compatible with the pre-existing architecture, thereby solving one problem while creating additional new problems. Also, substantial additional problems arise due to frequent branching which limits its parallelism.
Therefore, none of these prior art approaches to parallel processing have been sufficiently comprehensive to minimize all possible interlocks, while at the same time avoiding major redesign of the architected instruction set and avoiding complex logic circuits for dynamic decoding of fetched instructions.
Accordingly, what is needed is an improvement in digital data processing which facilitates the execution of existing machine instructions in parallel in order to increase processor performance. Since the number of instructions executed per second is a product of the basic cycle time of the processor and the average number of cycles required per instruction completion, what is needed is a solution which takes both of these parameters under consideration. More specifically, a mechanism is needed that reduces the number of cycles required for the execution of an instruction for a given architecture. In addition, an improvement is needed which reduces the complexity of the hardware necessary to support parallel instruction execution, thus minimizing any possible increase in cycle time. Additionally, it would be highly desirable for the proposed improvement to provide compatibility of the implementation with an already defined system architecture while introducing parallelism at the instruction level of both new and existing machine code.