The concept of parallel execution of instructions has helped to increase the performance of computer systems. Parallel execution is based on having separate functional units which can execute two or more of the same or different instructions simultaneously.
Another technique used to increase the performance of computer systems is pipelining. Pipelining does provide a form of parallel processing since it is possible to execute multiple instructions concurrently.
However, many times the benefits of parallel execution and/or pipelining are not achieved because of delays like those caused by data dependent interlocks and hardware dependent interlocks. An example of a data dependent interlock is a so-called write-read interlock where a first instruction must write its result before the second instruction can read and subsequently use it. An example of hardware dependent interlock is where a first instruction must use a particular hardware component and a second instruction must also use the same particular hardware component.
One of the techniques previously employed to avoid interlocks (sometimes called pipeline hazards) is called dynamic scheduling. Dynamic scheduling means that shortly before execution, the opcodes in an instruction stream are decoded to determine whether the instructions can be executed in parallel. Computers which practice one type of such dynamic scheduling are sometimes called superscalar machines. The criteria for dynamic scheduling are unique to each instruction set architecture, as well for the underlying implementation of that architecture in any given instruction processing unit. The effectiveness of dynamic scheduling is therefore limited by the complexity of the architecture which leads to extensive logic to determine which combinations of instructions can be executed in parallel, and thus may increase the cycle time of the instruction processing unit. The increased hardware and cycle time for such dynamic scheduling become even a bigger problem in architectures which have hundreds of different instructions.
There have also been some attempts to improve performance through so-called static scheduling which is done before the instruction stream is fetched from storage for execution. Static scheduling is achieved by moving code and thereby reordering the instruction sequence before execution. This reordering produces an equivalent instruction stream that will more fully utilize the hardware through parallel processing. Such static scheduling is typically done at compile time. However, the reordered instructions remain in their original form and conventional parallel processing still requires some form of dynamic determination just prior to execution of the instructions in order to decide whether to execute the next two instructions serially or in parallel.
There are other deficiencies with dynamic scheduling, static scheduling, or combinations thereof. For example, it is necessary to review each scalar instruction anew every time it is fetched for execution to determine its capability for parallel execution. There has been no way provided to identify and flag ahead of time those scalar instructions which have parallel execution capabilities.
Another deficiency with dynamic scheduling of the type implemented in superscalar machines is the manner in which scalar instructions are checked for possible parallel processing. Super scalar machines check scalar instructions based on their opcode descriptions, and no way is provided to take into account hardware utilization. Also, instructions are issued in FIFO fashion thereby eliminating the possibility of selective grouping to avoid or minimize the occurrence of interlocks.
There are some existing techniques which do seek to consider the hardware requirements for parallel instruction processing. One such system is called the Very Long Instruction Word machine in which a sophisticated compiler rearranges instructions so that hardware instruction scheduling is simplified. In this approach the compiler must be more complex than standard compilers so that a bigger window can be used for purposes of finding more parallelism in an instruction stream. But the resulting instructions may not necessarily be object code compatible with the pre-existing architecture, thereby solving one problem while creating additional new problems. Also, substantial additional problems arise due to frequent branching which limits its parallelism.
A recent innovation which seeks to more fully exploit parallel execution of instructions is called Scalable Compound Instruction Set Machines (SCISM). A compound instruction is created by pre-processing an instruction stream in order to look for sets of two or more adjacent scalar instructions that can be executed in parallel. In some instances certain types of interlocked instructions can be compounded for parallel execution where the interlocks are collapsible in a particular hardware configuration. In other configurations where the interlocks are non-collapsible, the instructions having data dependent or hardware dependent interlocks are excluded from groups forming compound instructions. Each compound instruction is identified by control information such as tags associated with the compound instruction, and the length of a compound instruction is scalable over a range beginning with a set of two scalar instructions up to whatever maximum number of individual scalar instructions can be processed together by the specific hardware implementation.
When an instruction is fetched for execution, the instruction boundaries must be known in order to allow proper execution. However, where an instruction stream is pre-processed for purposes of creating compound instructions, the instruction boundaries are often not evident merely by examining a byte string. This is particularly true with architectures which allow variable length instructions. Further complications arise when the architecture allows data and instructions to be intermixed.
For example, in the IBM System 370 architecture, both of these difficulties make the pre-processing of an instruction stream to locate suitable scalar instruction groupings a very complex problem. First, the instructions have three possible lengths--two bytes or four bytes or six bytes. Even though the actual length of a particular instruction is indicated in the first two bits of the opcode of the instruction, the beginning of an instruction in a string of bytes cannot be readily identified by mere inspection. Second, instructions and data can be intermixed. Accordingly, the existence or non-existence of a reference point in an instruction byte stream is of critical importance for this invention. A reference point is defined as the knowledge of where instructions begin or where instruction boundaries are. Unless additional information has been added to the instruction stream, instruction boundaries are usually known only at compile time or at execution time when the instructions are fetched by a CPU.