1. Field of the Invention
This invention relates to data processing and, more particularly, to architectural improvements in multiple instruction stream, multiple data stream computers for directly and efficiently executing highly parallel user programs or a plurality of user programs.
2. Background of Related Art
The current state of the art in high performance parallel processing is generally limited to super computers and array processors optimized to execute single instruction multiple data stream vectorized programs. These computers are further optimized to execute code vectorized to specific lengths and are not designed to execute multiple instruction stream, multiple data stream (MIMD) programs.
Multiple instruction stream, multiple data stream computers have been proposed and in a few cases implemented. U.S. Pat. Nos. 4,153,932 and 4,145,733 describe data flow computers. Practical implementations of data flow computers are difficult to achieve, in part, because of the difficulty in designing a computer organization which efficiently executes a data flow language.
Multiple instruction stream, multiple data stream computers have been implemented. An example being Denelcor's heterogeneous element processor, an embodiment of which is described in U.S. Pat. No. 4,229,790 to Gilliand et al. The heterogeneous element processor is an interconnected multi-processor computer. A processor utilizes pipelined control and function units. Penalties due to precedence constraints are reduced by switching instruction context among parallel processes at the pipeline cycle rate. The heterogeneous element processor architecture is not sufficiently cost efficient nor sufficiently extensible for many applications. This is largely due to a complex processor organization and low function unit utilization. The use of multiple data streams, as proposed by Gilliand et al, connected to the processor pipeline via task snapshot registers, still results in processing speeds which are limited by the availability of the resources of a single processor.
Other earlier types of multiprocessor computers involved two types of implementation. The first type uses separate processing units, one for each data stream. The second type uses one central processing unit which is in effect multiplexed among the several data streams. The use of separate processing units is costly and results in a single instruction stream, single data stream architecture which is subject to precedence problems. By employing the second type of implementation, the central processing unit may be multiplexed among the several data streams in a way to reduce precedence constraints.
Still other attempts to improve processing speeds and to minimize contention problems are described in "Parallel Processor Systems, Technologies, and Applications," Symposium, June 25-27, 1969, Chapter 13 entitled "A Multiple Instruction Stream Processor with Shared Resources," M. J. Flynn, A. Podvin and K. Shimizu. Here, a parallel computer system organization is described using individual computers, each of which contains its own data and control registers but lacks the more substantial execution facilities which, in turn, are shared by all machines. Sharing is accomplished by closely synchronized time-phased switching. Heavy pipelining of the execution resources is used in an effort to provide maximum operational bandwidths. The pipelining factor for each of the execution functions of the execution resources is necessarily closely related to the synchronizing factor of the individual computers. The system based upon this organizational concept is claimed to avoid many of the contention problems associated with shared resource systems.
In the paper by Flynn et al, the individual computers are described as processors, each of which is responsible for fetching its own operands and preparing its own instructions. The processor does not execute its instruction, but rather requests the execution resource or unit to do so. The execution unit is shared by 4 time-phased arrays or rings of processors, each ring contains 8 processors. The arrangement requires close time synchronization of the processors and no two processors within a ring are in the same phase of instruction, preparation or execution. Since individual processors from different rings can contend for execution resources at a particular time slot, it is necessary that the contention be time overlapped with operand fetch and so forth. When two or more processors request a resource which can accept only one operation, a priority system is used to resolve the conflict.