A parallel computer conventionally includes a large number of processors together with information-transfer channels to exchange information among the processors, and to pass information between the processors and input/output (I/O) ports of the computer. Each processor of such a computer has associated with it dedicated local storage which the processor can access in response to instructions executed by the processor.
In a parallel computer of the dedicated local storage type, accessing the dedicated local storage associated with a given processor involves execution of a local instruction by the given processor. Execution of one of these steps can delay running of a program by the given processor. In addition, other information resources, such as a host processor, also require access to local storage which can also result in delayed execution by a given processor. Such delays to accommodate access to the dedicated local storage can represent a significant computational burden in a parallel computer, and tends to reduce its computing efficiency. In addition, synchronization of the running of programs by the various processors is often a critical requirement to achieve computing efficiency in the machine. Programming the machine to achieve the desired synchronization between the various processors, while accommodating access to the dedicated local storage is in general a time-consuming and tedious task. In some cases, this is practically impossible, since the time to run the programs on the various processors can depend upon the data being processed which may be unknown at the time the computer is programmed.
The problem of efficiently accessing dedicated local storage by a plurality of parallel processors has been addressed in certain prior art patents.
For example, in U.S. Pat. No. 4,837,676 a computer architecture is described which attempts to achieve highly parallel execution of programs in instruction flow form. In this patent, individual units, such as process control units, programmable function units, and memory units are individually coupled together by an interconnection network as self-contained units. Each process control unit initiates its assigned processors in sequence, routing instruction packets for each process through the computer network and an address network in order to provide time share of a single communications highway by multiple instruction packet streams.
Similarly, in U.S. Pat. No. 4,344,134 a parallel processing array is described in which each processor issues a ready signal to signify that it is ready to begin a parallel processing task, and institutes the task upon receipt of an initiative signal. Parallel processing is enhanced by partitioning processing functions into plural process sub-arrays, via a control node tree, the node tree having a plurality of control nodes connected to the plurality of processors.
In U.S. Pat. No. 5,121,502 a computer is described which includes a processing unit, an instruction unit, and means for communicating instructions from the instruction unit to the process unit, wherein the processing unit includes a plurality of processors. The instruction unit includes a plurality of storage locations, and means are provided which include a first connection circuit for providing a plurality of parallel connection circuits between storage locations and the processor, and a second connection circuit for providing a single serial communication channel between the storage locations and the processors. The system further includes a control circuit for selecting between the first and second connection circuits, where the first connection circuit is selected when a multi-operation instruction is to be executed, and the second circuit is selected if such an instruction is not present.
Similar systems are shown in U.S. Pat. Nos. 4,965,718 and 5,113,523.
Although the foregoing patents have attempted to increase the operating efficiency of massively parallel computing machines, they have not accomplished the increased efficiency anticipated by the instant invention. More particularly, the instant invention provides predetermined additional hardware to monitor memory ports of a plurality of processors to determine instruction cycles, during which the memory ports are inactive, and to use those cycles to store asynchronously arriving data into local memory. In so doing, the instant invention provides additional efficiency in a parallel computer by providing: (1) an arbitration method for selecting an active data source; (2) an asynchronously and autonomous method for transferring data between a data source and memory while a program is being simultaneously executed; (3) conditional means of data transfer; and (4) selective hardware to transfer MIMD programs from local memory to on-chip instruction.