1. Field of the Invention
The invention relates generally to the field of digital data processing systems, and more particularly to array processing systems which incorporate a large number of processors that are interconnected in a regular connection structure and in which all of the processors receive the same instruction from a common control structure.
2. Description of the Prior Art
A digital data processing system comprises three basic elements, namely, a memory element, an input/output element and a processor element. The memory element stores information in addressable storage locations. This information includes data and instructions for processing the data. The processor element fetches information from the memory element, interprets the information as either an instruction or data, processes the data in accordance with the instructions, and returns the processed data to the memory element. The input/output element, under control of the processor element, also communicates with the memory element to transfer information, including instructions and data to be processed, to the memory, and to obtain processed data from the memory.
Most modern data processing systems are considered "von Neumann" machines, since they are constructed according to a paradigm attributed to John von Neumann. Von Neumann machines are characterized by having a processing element, a global memory which stores all information in the system, and a program counter which identifies the location in the global memory of the instruction being executed. The processing element is executing one instruction at a time, that is, the instruction that is identified by the program counter. When that instruction has been executed, the program counter is advanced to identify the location of the next instruction to be executed. (In most modern systems, the program counter is actually advanced before the processor has finished processing the current instruction).
Von Neumann systems are conceptually uncomplicated to design and program, since they do only one operation at a time, but they are also relatively slow. A number of advancements have been made to the original von Neumann paradigm to permit the various parts of the system, most particularly the various components of the processor, to operate relatively independently and achieve a significant increase in processing speed. The first such advancement was pipelining of the various steps in executing an instruction, including instruction fetch, operation code decode (a typical instruction includes an operation code which identifies the operation to be performed and in most cases one or more operand specifiers which identify the operands, or data, to be used in executing the instruction), operand fetch, execution (that is, performing the operation set forth in the operation code on the fetched operands), and storing of processed data, which are performed relatively independently by separate hardware in the processor. In a pipelined processor, the processor's instruction fetch hardware may be fetching one instruction while other hardware is decoding the operation code of another, fetching the operands of another, executing yet another instruction and storing the processed data of a fifth instruction. Pipelining does not speed up processing of an individual instruction, but since the processor begins processing a second instruction before it has finished processing the first, it does speed up processing a series of instructions.
Pipelining has also been used within several of the circuits comprising the processor, most notably the circuits which perform certain arithmetic operations, to speed processing of a series of calculations. Like pipelining of instruction processing, pipelining arithmetic operations does not speed up an individual calculation, but it does speed up processing of a series of calculations.
A pipelined processor is obviously much more complicated than a simple processor in a von Neumann system, as it requires not only the various circuits to perform each of the operations (in a simple von Neumann processor, many circuits could be used to perform several operations), but also control circuits to coordinate the activities of the various circuits. However, the speed-up of the system can be dramatic.
More recently, some processors have been provided with execution hardware which include multiple functional units each being designed to perform a certain type of mathematical operation. For example, some processors have separate functional units for performing integer arithmetic and floating point arithmetic, since floating point arithmetic requires handling two parts of a floating point number, namely the fraction and the exponent, while numbers in integer arithmetic have only one part. Some processors, for example the CDC 6600 manufactured by Control Data Corporation, included a number of separate hardware functional units each of which performs one or only several types of mathematical operations, including addition, multiplication, division, branch, and logical operations, all of which may be executing at once. This can be helpful in speeding up certain calculations, most particularly those in which several functional units may be used at one time for performing part of the calculation.
In a processor which incorporates pipelining or multiple functional units (or both, since both may be incorporated into a processor), a single instruction stream operates on a single data stream. That is, each instruction operates on data to produce one calculation at a time. Such processors have been termed "SISD", for "Single Instruction-Single Data". However, if a program requires a segment of a program to be used to operate on a number of diverse elements of data to produce a number of calculations, the program causes the processor to loop through that segment for each calculation. In some cases, in which the program segment is short or there are only a few data elements, the time required to perform the calculations on the data is not unduly long.
However, for many types of such programs, SISD processors would require a very long time to perform all of the calculations that are required. Accordingly, processors have been developed which incorporate a large number of processing elements, all operating concurrently on the same instruction, with each processing element processing a separate data stream. These processors have been termed "SIMD" processors, for "Single Instruction-Multiple Data".
SIMD processors are useful in a number of applications, including image processing, signal processing, artificial intelligence, database operations and computer simulation of a number of things such as electronic circuits and fluid dynamics. In image processing, each processor performs processing on a pixel ("picture element") to enhance the overall image. In signal processing, the processors concurrently perform a number of the calculations required to produce the Fast Fourier Transform of the signal. In artificial intelligence, the processors perform searches on extensive databases representing the stored knowledge of the application. In database operations, the processors perform searches, as in the artificial intelligence applications, and they also perform sorting operations. In computer simulation of, for example, electronic circuits, each processor represents one part of the circuit, and the processor's calculations indicates the response of the part to signals from other parts of the circuit. Similarly, in simulating fluid dynamics, which can be useful in a number of applications such as weather prediction and the design of airplanes, each processor is associated with one point in space, and the calculations performed provide information about various factors such as fluid flow, temperature, pressure, and so forth.
Typical SIMD processors include two primary components, namely an array of processor elements and a routing network over which the processor elements may communicate the results of a calculation to other processor elements for use in future calculations. In addition, SIMD processors include a control processor for controlling the operations of the processor elements and routing network in response to instructions and data from a host computer system.
Several routing networks have been used in SIMD processors and a number of others have been proposed. In one routing network, which has been used in the Massively Parallel Processor, manufactured by Goodyear Arrowspace Corporation ("Goodyear MPP"), the processor elements are interconnected in a matrix, or mesh, arrangement. In such an arrangement, the processor elements are connected in rows and columns and directly communicate only with their four nearest neighbors. This arrangement can be somewhat slow if communications may be to random processor elements, but the number of wires which are required to make the interconnections is lower than in most other arrangements, on the order of 4n, where "n" is the number of processor elements, assuming only unidirectional transfer of messages over each wire. If each wire can transfer bidirectionally, the number of wires is reduced by half, with a possible reduction in the message transfer rate.
The matrix network is also used on the "Connection Machine", manufactured by Thinking Machines Corporation, but that machine also includes a hypercube network allowing communications between random processor elements (that is, processor elements which are not nearest neighbors). In a hypercube network, each processor chip connects directly to twelve other processor chip. Each processor chip includes several processor elements and circuits which form part of the routing network. The routing circuits on each chip receive messages from the processor elements on the chip for transmission to processor elements on other processor chips. In addition, the routing circuits receive messages from other processor chips. If a message from another processor chip is to be received by a processor element on the chip, it forwards it to that element; however, if the message is to be received by a processor element on another chip, it transmits the message over a wire to another processor element on another chip. The procedure is repeated until the message reaches the intended recipient. Thus, the routing circuits on each chip must be able to handle not only messages from the processor elements on the chip, but also from messages from other chips which may or may not be addressed to processor elements on the chip.
A hypercube network handles communications fairly rapidly, but it requires a large number of wires, on the order of nlog.sub.2 n if messages are transferred unidirectionally over each wire. For example, if "n" were 4096 (4K, K=1024), the hypercube would require on the order of 48K wires. If the wires transfer messages bidirectionally, only 24K wires would be required, but the volume of message traffic that could be carried would also be reduced. Typically, the larger the number of wires in a routing network, the more expensive is the network, and the greater is the likelihood of failure resulting from broken wires or failed switching elements which interconnect the wires.
Another routing arrangement which has been proposed is a crossbar switch, through which each processor element can communicate with any of the other processor elements directly. The crossbar switch provides the most efficient communications of any of the routing networks proposed. However, a crossbar switch also has the most wires and switching elements, both on the order of n.sup.2, and thus is most expensive and most susceptible to failure due to broken wires and switching elements. Using the above example, in which "n" is 4K, the number of wires and switching elements required for the crossbar switch is 16M (M=1,048,576).
Yet another routing arrangement is an omega network, in which switching is performed through a number of serially-connected stages. Each stage has two inputs, each connecting to the outputs of two prior stages or processor chips and has two outputs. The "Butterfly" manufactured by Bolt Beranek and Newman, use an omega network.
The cost of a routing network is directly related to the number of wires, as is the likelihood of failure due to discontinuity in a communications path. To reduce the number of wires and achieve a significant fraction of the efficiency of the crossbar switch, a routing network has been proposed in which a multiple-stage omega network performs some portion of the switching. The output from the omega network is connected to a crossbar switch, which would require many fewer switching connections than would be necessary in the absence of the omega network. Depending on the number of stages in the omega network, the number of wires may be less than in a hypercube, while the transfer efficiency would be greater than a hypercube. For example, if "n" is 4K, and a seven-stage omega network is provided to a crossbar switch, 36K wires (again assuming unidirectional communications over each wire) would be required to form the routing network.
Using a routing network to transfer data does have a number of limitations. The mesh network is useful generally when transferring data only to the adjacent processors, as each transfer requires commands from the controlling program. A hypercube, crossbar switch, omega, or like network is most useful if message transfers are expected to be to random processors. Some array processors, the Thinking Machine, for example, have two mechanisms for transferring data, one for random transfers and the other for matrix transfers. Under some circumstances, however, it may be faster to provide a processor with direct access to memories associated with other processing elements. This may be useful, for example, when, after performing operations in parallel, a serial operation is to be performed on the just processed data. If one processing element has access to the data in at least some other processing elements' memories, the processor may perform serial operations using that data. Also, the processing element may use those memories to if a problem requires more storage capacity than a single processing element would have.
Typically, array processors are used in performing arithmetic operations on numerical values, which are expressed in "floating point" form. In that form, a floating point number has a fraction portion and an exponent portion, with the value of the number being the value contained in the fraction portion multiplied by the value two raised to the value contained in the exponent portion. When performing arithmetic operations such as addition and subtraction on such numbers, the numbers must be "aligned", that is, they must have the same value of the exponent. To achieve this, the value of the fraction portion of the smaller-magnitude floating point number must be reduced, which raises the effective value of the number's exponent portion, until the exponent values are equal. After the arithmetic operation, the fraction of the result must be normalized, that is, leading zeroes must be removed by decreasing (increasing) the value of the fraction portion, while at the same time increasing (decreasing) the value of the result's exponent. In both the alignment and normalization operations, the fractions are reduced and or increased by shifting their values in the locations in which they are stored.
However, in the alignment and normalization operations, since the values of the numbers processed by the various processing elements are all different, the number of shifts required to effect the alignment or normalization will also be different.