The present invention relates generally to methods of synchronizing multiprocessors having parallel processors for execution of parallel related instruction streams and also to compiling methods for generating the streams. In its particular aspects, the present invention relates to techniques for substantially minimizing cross-stream event or data dependencies and to the enforcement of event or data dependencies between processors utilizing channels for communicating between the processors in a synchronizing fashion.
The increase in execution speed over that of a uniprocessor by utilizing parallel processors to cooperatively perform sequential programs depends on the effective exploitation of the program's fine-grained parallelism to enable parallel operations to be performed contemporaneously on different processors. While loop-level parallelism is generally exploited effectively in commercially available multiprocessor systems, the extra-loop (or non-loop) parallelism present in the sequential parts of the program is more difficult to exploit effectively.
THE VERY LONG INSTRUCTION WORD (VLIW) family of architectures can exploit the fine-grained parallelism present in the sequential parts of a program. Known machines of this type utilize a compiler based upon trace scheduling (see, for example, J. A. Fisher, "TRACE SCHEDULING: A TECHNIQUE FOR GLOBAL MICROCODE COMPACTION", IEEE Trans. on Computers, vol. 7, no. C-30, p.p. 478-490, July 1981) to detect and schedule extra-loop parallelism in sequential parts of the program and also exploit loop-level parallelism by unrolling the loops and converting loop-level parallelism into extra-loop parallelism.
However, a VLIW machine consists of multiple processors that operate in lockstep executing instructions fetched from a single instruction stream. The long instruction word allows initiation of several fine-grained operations in each instruction, allowing operations to be scheduled for parallel execution by different processors. While the lockstep operation of the processors implicitly guarantees that the processors are synchronized, the speed of a VLIW machine is severely compromised by run-time events which are unpredictable at compile-time. For example, memory bank access conflicts cannot always be avoided as the operands required for an operation may not be known at compile-time. Such run-time events can cause delay in completion of one of the operations in a long instruction which will delay completion of the entire instruction.
The extention of VLIW architecture to multiple instruction stream architecture requires means to synchronize the parallel processors to assure that data developed by one processor and needed by another has been already written by the one processor to shared storage means when the other processor attempts to read the storage means. The use of shared storage means for enabling data to be passed between processors also requires additional synchronization to assure that the data in a memory location has been read by all processors needing it before other data is written thereto.
One method of synchronizing processors is by the provision of barriers in the instruction streams, as explained in my co-pending application Ser. No. 227,276, filed Aug. 2, 1988, to assure the temporal order of cross-processor events such as the writing to and reading from shared memory. In the present invention, parallel processors are synchronized in a different way; namely by communicating cross-stream dependent data between processors in a synchronizing fashion.
The HEP multiprocessor (see B. J. Smith, "ARCHITECTURE AND APPLICATIONS OF THE HEP MULTIPROCESSOR COMPUTER SYSTEM", REAL-TIME SIGNAL PROCESSING, vol. 298, p.p. 241-248, August, 1981 and J. S. Kowalik, Editor, "PARALLEL MIMD COMPUTATION: HEP SUPERCOMPUTER AND ITS APPLICATIONS", MIT Press, 1985) implements a large number of channels capable of synchronizing communication of data between processors by adding a synchronization bit to every location in shared memory and in the register set. Control bits in each instruction indicate whether a read operation is unconditional or must wait until the location is "full". However, in the HEP multiprocessor, the synchronization bit does not generally cause a processor to stall if the bit is not in the proper state. Rather, a process stalls leaving the program counter and the instruction unchanged in the process. Execution shifts to another process and the unexecuted instruction is reattempted only when its process makes the next trip through the pipeline. Meanwhile, instructions from other streams are issued to keep the pipeline full. Consequently, the HEP approach is not particularly useful unless the number of streams exceed the number of processors in the system. Also, the potentially infinite number of channels implemented in shared memory are not useful for communicating processor synchronizing events or data because of their relative slowness compared to channels implemented in registers. On the other hand, channels implemented in registers, though useful for communicating synchronizing events, are inherently limited in number in order to be able to be addressed in typical instructions referring to one or more registers or functions. Consequently, the number of cross-processor dependent events or data requiring synchronization as a result of the application of known compilation techniques for VLIW architecture for typical sequential programs may exceed the number of useful channels available to enforce such dependencies.