A computer system generally includes one or more processors, a memory and an input/output system. The memory stores data and instructions for processing the data. The processor(s) process the data in accordance with the instructions, and store the processed data in the memory. The input/output system facilitates loading of data and instructions into the system, and obtaining processed data from the system.
Most modern computer systems have been designed around a "von Neumann" paradigm, under which each processor has a program counter that identifies the location in the memory which contains its (that is, the processor's) next instruction. During execution of an instruction, the processor increments the program counter to identify the location of the next instruction to be processed. Processors in such a system may share data and instructions; however, to avoid interfering with each other in an undesirable manner, such systems are typically configured so that the processors process separate instruction streams, that is, separate series of instructions, and sometimes complex procedures are provided to ensure that processors' access to the data is orderly. Instruction sequences may also be shared among processors, which may require similar procedures to regulate use among the processors.
In Von Neumann machines, instructions in one instruction stream are used to process data in a single data stream. Such machines are typically referred to as SISD (single instruction/single data) machines if they have one processor, or MIMD (multiple instruction/multiple data) machines if they have multiple processors. In a number of types of computations, such as processing of arrays of data, the same instruction stream may be used to process data in a number of data streams. For these computations, SISD machines would iteratively perform the same operation or series of operations on the data in each data stream. Recently, single instruction/multiple data (SIMD) machines have been developed which process the data in all of the data streams in parallel. Since SIMD machines process all of the data streams in parallel, such problems can be processed much more quickly than in SISD machines, and generally at lower cost than with MIMD machines providing the same degree of parallelism.
The aforementioned Hillis patents and Hillis, et al., patent application disclose an SIMD machine which includes a host computer, a sequencer and an array of processing elements, each including a bit-serial processor and a memory. The host computer, inter alia, generates commands which are transmitted to the sequencer. In response to a command, the sequencer transmits one or more SIMD instructions to the array and global router. In response to the SIMD instructions, the processing elements perform the same operation in connection with data stored in their respective memories.
The array disclosed in the Hillis patents and Hillis, et al., patent application also includes two communications mechanisms which facilitate transfer of data among the processing elements. In one mechanism, the processing elements are interconnected in a two-dimensional mesh which enables each processing element to selectively transmit data to one of its nearest-neighbor processing elements. This mechanism, termed "NEWS" (for the North, East, West, and South directions in which a processing element may transmit data), the sequencer enables all of the processing elements to transmit, and to receive, bit-serial data in unison, from the selected neighbor.
The second mechanism is a global router, comprising a plurality of router nodes interconnected by communications links in the form of an N-dimensional hypercube. Each router node is connected to one or more of the processing elements. The global router transmits data in the form of messages provided by the processing elements. In one form of communication, each message contains an address that identifies the processing element that is to receive the message. The sequencer enables the processing elements to transmit messages, in bit serial format, from particular source locations in their respective memories to the router nodes. Each router node, also under control of the sequencer, upon receipt of a message, examines the address and determines therefrom whether the destination of the message is a processing element connected thereto, or a processing element connected to another router node. If the message is intended for a processing element connected to the router node, it delivers it to the processing element. If not, the router node determines from the address an appropriate communications link connected thereto over which it can transmit the message to a router node closer to the destination.
The global router can also transfer messages between router nodes without the use of addresses. This can permit the global router to emulate a mesh interconnection pattern of any selected number of dimensions, as described in U.S. patent application Ser. No. 07/042,761, filed Apr. 27, 1987, by W. Daniel Hillis, et al., and entitled "Method And Apparatus For Simulating M-Dimensional Connection Network In An N-Dimensional Network, Where M Is Less Than N" and assigned to the assignee of the present application. In such an emulation, for any mesh interconnection pattern having a particular number of dimensions, some router nodes connected to each router node, as selected according to a pattern described in the aforementioned application, are identified as "neighboring" router nodes in the mesh, with each of the identified router nodes being associated with a particular dimension of the mesh.
In addition, the global router can be used to generally transfer messages among router nodes without the use of addresses. In this operation, which is generally described in the aforementioned Bromley patent application, each router node, or the processing elements connected thereto, stores tables associating incoming messages with particular outgoing communications links. Using the tables, the router nodes pass messages, from node to node, until they reach the intended destinations.
SIMD machines are often used to perform mathematical operations on vectors or matrices of data values. In many algorithms involving matrices, it is typically advantageous to have each processing element process data items representing a column of a matrix, with successive processing elements in the processing array processing the successive columns of the matrix. That is, if "a.sub.i,j " represents a location of a data item in a matrix, with "i" and "j" comprising row and column indices, respectively, then processing element "X" of the processing array processes all of the data items "a.sub.X,j " of the matrix. Typically, each processing element will have a memory, with the data items "a.sub.X,0 " through "a.sub.X,Y " of the successive rows zero through "Y" in the column "X" it is processing being stored in successive locations in its memory.
In matrix algorithms, it is often necessary to perform a transpose operation, in which the data items of the columns are reorganized into rows. Otherwise stated, in a transpose operations the data items in matrix locations "a.sub.i,j " are transferred to matrix locations "a.sub.j,i," that is, the data item in the "j-th" memory location of the "i-th" processing element is moved to the "i-th" memory location of the "j-th" processing element. If a matrix is large, the time required to determine an optimal sequence for moving the data items among processing elements can be quite large.
Similar problems arise in other types of computations, such as Fast Fourier Transform (FFT) computations. In performing an FFT, the data items are stored in vectors, which are divided among the processing elements in a similar manner. At various points in an FFT computation, the data items are transferred among the processing elements in a similar manner.