1. Field of the Invention
The invention relates generally to the field of digital data processing systems, and more particularly to parallel processing systems which incorporate a large number of processors that are interconnected in a regular connection structure and in which all of the processors receive the same instruction from a common control structure.
2. Description of the Prior Art
A digital data processing system comprises three basic elements, namely, a memory element, an input/output element and a processor element. The memory element stores information in addressable storage locations. This information includes data and instructions for processing the data. The processor element fetches information from the memory element, interprets the information as either an instruction or data, processes the data in accordance with the instructions, and returns the processed data to the memory element. The input/output element, under control of the processor element, also communicates with the memory element to transfer information, including instructions and data to be processed, to the memory, and to obtain processed data from the memory.
Most modern data processing systems are considered "von Neumann" machines, since they are constructed according to a paradigm attributed to John von Neumann. Von Neumann machines are characterized by having a processing element, a global memory which stores all information in the system, and a program counter which identifies the location in the global memory of the instruction being executed. The processing element is executing one instruction at a time, that is, the instruction that is identified by the program counter. When that instruction has been executed, the program counter is advanced to identify the location of the next instruction to be executed. (In most modern systems, the program counter is actually advanced before the processor has finished processing the current instruction).
Von Neumann systems are conceptually uncomplicated to design and program, since they do only one operation at a time, but they are also relatively slow. A number of advancements have been made to the original von Neumann paradigm to permit the various parts of the system, most particularly the various components of the processor, to operate relatively independently and achieve a significant increase in processing speed. The first such advancement was pipelining of the various steps in executing an instruction, including instruction fetch, operation code decode (a typical instruction includes an operation code which identifies the operation to be performed and in most cases one or more operand specifiers which identify the operands, or data, to be used in executing the instruction), operand fetch, execution (that is, performing the operation set forth in the operation code on the fetched operands), and storing of processed data, which are performed relatively independently by separate hardware in the processor. In a pipelined processor, the processor's instruction fetch hardware may be fetching one instruction while other hardware is decoding the operation code of another, fetching the operands of another, executing yet another instruction and storing the processed data of a fifth instruction. Pipelining does not speed up processing of an individual instruction, but since the processor begins processing a second instruction before it has finished processing the first, it does speed up processing a series of instructions.
Pipelining has also been used within several of the circuits comprising the processor, most notably the circuits which perform certain arithmetic operations, to speed processing of a series of calculations. Like pipelining of instruction processing, pipelining arithmetic operations does not speed up an individual calculation, but it does speed up processing of a series of calculations.
A pipelined processor is obviously much more complicated than a non-pipelined processor in a von Neumann system, as it requires not only the various circuits to perform each of the operations (in a simple von Neumann processor, many circuits could be used to perform several operations), but also control circuits to coordinate the activities of the various circuits. However, the speedup of the system can be dramatic.
More recently, some processors have been provided with execution hardware which include multiple functional units each being designed to perform a certain type of mathematical operation. For example, some processors have separate functional units for performing integer arithmetic and floating point arithmetic, since floating point arithmetic requires handling two parts of a floating point number, namely the fraction and the exponent, while numbers in integer arithmetic have only one part. Some processors, for example the CDC 6600 manufactured by Control Data Corporation, included a number of separate hardware functional units each of which performs one or only several types of mathematical operations, including addition, multiplication, division, branch, and logical operations, all of which may be executing at once. This can be helpful in speeding up certain calculations, most particularly those in which several functional units may be used at one time for performing part of the calculation.
In a processor which incorporates pipelining or multiple functional units (or both, since both may be incorporated into a processor), a single instruction stream operates on a single data stream. That is, each instruction operates on data to produce one calculation at a time. Such processors have been termed "SISD" for "single instruction-single data". However, if a program requires a segment of a program to be used to operate on a number of diverse elements of data to produce a number of calculations, the program causes the processor to loop through that segment for each calculation. In some cases, in which the program segment is short or there are only a few data elements, the time required to perform the calculations on the data is not unduly long.
However, for many types of such programs, SISD processors would require a very long time to perform all of the calculations that are required. Accordingly, processors have been developed which incorporate a large number of processing elements, all operating concurrently on the same instruction, with each processing element processing a separate data stream. These processors have been termed "SIMD" processors, for "single instruction-multiple data".
SIMD processors are useful in a number of applications, including image processing, signal processing, artificial intelligence, database operations and computer simulation of a number of things such as electronic circuits and fluid dynamics. In image processing, each processor performs processing on a pixel ("picture element") to enhance the overall image. In signal processing, the processors concurrently perform a number of the calculations required to produce the Fast Fourier Transform of the signal. In artificial intelligence, the processors perform searches on extensive databases representing the stored knowledge of the application. In database operations, the processors perform searches, as in the artificial intelligence applications, and they also perform sorting operations. In computer simulation of, for example, electronic circuits, each processor represents one part of the circuit, and the processor's calculations indicates the response of the part to signals from other parts of the circuit. Similarly, in simulating fluid dynamics, which can be useful in a number of applications such as weather prediction and the design of airplanes, each processor is associated with one point in space, and the calculations performed provide information about various factors such as fluid flow, temperature, pressure, and so forth, occurring at that point in space.
Typical SIMD processors include two primary components, namely an array of processor elements and a routing network over which the processor elements may communicate the results of a calculation to other processor elements for use in future calculations. In addition, SIMD processors include a control processor for controlling the operations of the processor elements and routing network in response to instructions and data from a host computer system.
Another system architecture has been proposed, namely a multiple instruction-multiple data architecture. A MIMD system is similar to SIMD systems in that it has multiple processor, but it differs from them in that each processor is free to operate on a different program from the others. Under some circumstances, it may be desirable to allow a parallel processing array to operate in both a SIMD mode and a MIMD mode. This may be used, for example, in matrix arithmetic to allow the different processing elements to calculate inner products, which calculations will differ depending on the inner products being calculated by the processing element. In H. J. Siegel, et al., PASM: A Partitionable SIMD/MIMD System For Image Processing And Pattern Recognition, IEEE Transactions On Computers, Vol. C-30, No. 12, Dec. 1981 at pages 934-947, a system is described in which the processing array may execute MIMD programs in response to a SIMD instruction from the control processor. The control processor is signalled when all of the processing elements have completed their MIMD operations to allow the control processor to issue a new SIMD instruction.
In prior highly-parallel array processing systems, the processors in the processing array have been interconnected by a communications network which allows them to transmit data, in the form of messages, among themselves. A number of interconnection patterns, or network topologies, have been used, and others have been proposed. For example, in the MPP by the Goodyear Aerospace Corporation, the processing elements are interconnected in a mesh pattern of a plurality of rows and columns. Each of the processing elements may transmit data only to their four nearest neighbors in the mesh pattern. In the connection machine from the Thinking Machines Corporation, the processing elements are interconnected in a hypercube, or Boolean N-cube pattern to twelve other processing elements. If a message is destined for a processing element to which the transmitting element is not connected, it is transmitted to a processing element which acts as an intermediary, passing it toward the intended recipient. The message may pass through a number of processing elements before it reaches the intended recipient.