Field of the Invention
The present invention generally relates to stored program digital computers and, more particularly, to a computer system capable of exploiting concurrent processing techniques when executing a sequential program.
Parallel processing has become increasingly popular in recent years as a means of achieving high performance in digital computers. Attention has been focused primarily on finding large and distinct instruction sequences in a user program which can be executed simultaneously while requiring minimal inter-sequence communication or synchronization. Most multiprocessor and parallel processor designs which have been proposed in the past can be differentiated by the choices that were made for these sequences and the communication and synchronization facilities that were provided. As this parallel processing approach is pushed further, the advantages of adding extra processors to a high performance parallel processor diminish because of the increased communication and synchronization requirements between the concurrently executing instruction sequences.
In general, the execution of an instruction can be partitioned into two separate actions. The first action is to select the instruction and dispatch it to an ALU for execution. The second action is to actually execute the selected instruction. According to established conventions, an instruction should not be selected for execution until all instructions that either provide an intermediate result used by this instruction as an input operand or that reference or update a value which is also updated by this instruction, have completed execution. In early processor designs, these constraints (also known as data dependency constraints) were met by issuing an instruction only when all preceding instructions had completed execution. This approach is restrictive because several types of instructions, such as floating point operations, may be executed over several cycles by a dedicated floating-point processor. Limiting the issuance of a non-floating-point instruction until after the floating-point instruction has completed may leave portions of the processor needlessly idle.
In current processor designs which use pipelined Execution Units, an instruction can be dispatched before the preceding instructions have completed execution. If there is a data dependency between such an instruction and another instruction being processed in the execution pipeline, hardware interlock mechanisms are used to block the actual execution of this dispatched instruction until the data dependencies have been resolved. In this instance, however, instructions that follow this blocked instruction are also blocked even though they may not be dependent on any of the currently executing or blocked instructions. Because of the above mentioned blocking phenomenon, the instruction dispatch rate in pipelined processors is often less than one instruction per machine cycle. Moreover, the complexity of the hardware interlock mechanism interferes with the extension of this design to a design that allows the dispatch of several instructions concurrently.
In Very Long Instruction Word architectures, several instructions (up to a fixed maximum) may be dispatched in each machine cycle. In these machines, compile time analysis is used to combine successive instructions into groups. All instructions in a group may be dispatched simultaneously. For This type of grouping to work properly, a result produced by an instruction in a group may not used by subsequent instructions in the same group. This restriction limits the number of instructions that can be dispatched simultaneously. Another consideration which may limit the number of instructions that can be dispatched simultaneously is conditional branch instructions. These instructions may change the sequence of instructions executed by the processor based on a logical condition. The instructions following a conditional branch should not be executed until the conditional branch has been evaluated.
In addition to the considerations set forth above, one group of instructions to be executed on a Very Long Instruction Word machine should not be dispatched until all the instructions in the previous group have completed execution. This restriction exists because of data dependency constraints and because the pipelining of an entire instruction group may require prohibitively expensive processing hardware.
A third type of processor design is the data flow computer. In a computer of this type, instructions which manipulate data are allowed to execute concurrently. An instruction cannot be executed, however, until all of its input operands are available. Since the output operand of one instruction is an input operand for a subsequent instruction, sequencing of the instructions is automatically controlled. This type of computer is inefficient in handling control flow instructions, such as conditional branch operations, which generate unstructured control flow graphs. For these machines, the most effective way to handle such control flow instructions is to switch to a serial processing mode until the control flow problem has been resolved.
A paper by W. HWU et al., entitled "HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality", Proc. 13th Annual International Symp. on Computer Architecture, 1986, pp 297-306 relates to a system in which both control information and data to be manipulated are stored in a single memory. A single instruction decoder processes control flow instructions, using branch prediction, and generates data flow instructions and for each data manipulation instruction it encounters. These instructions are merged into a data flow graph which includes existing data flow instructions in a centralized node table. The instructions in the node table are awaiting execution by a group of parallel data driven processors. When the input operands of an instruction in the node table are available, the instruction is selected for execution by one of the data flow processors.
U.S. Pat. No. 4,476,525 to Ishii concerns a pipeline-controlled data processing system in which instructions are fetched and decoded prior to execution. As instructions are decoded and evaluated, arithmetic operations and memory storage instructions are combined for simultaneous execution, thus decreasing total execution time.
U.S. Pat. No. 4,295,193 to Pomerene, assigned to the assignee of the present invention, relates to a processor that is designed to simultaneously execute two or more instructions. The instructions to be executed are divided into groups having, at most, N instructions each. This may be done, for example, during compilation. Each group may have only a predetermined number of data accesses (less than the number of accesses used to execute N instructions), and furthermore, each data access is to a different data value. Each instruction in a group uses separate instruction execution hardware.
U.S. Pat No. 3,573,854 to Watson et al. relates to a pipelined architecture in which operands are fetched from memory before an arithmetic expression which uses the operands is evaluated by the ALU. Because the operands are prefetched, time spent by the ALU waiting for operands to be retrieved from memory is greatly reduced. This prefetching feature is useful for evaluating branch conditions before they are encountered to increase program execution speed.