1. Cross-Reference to Related Applications
The following co-pending patent applications are assigned to the same assignee of the present application and are related to the present application: "Router Chip with Quad-Crossbar and Hyperbar Personalities" by John Zapisek filed concurrently herewith and assigned Ser. No. 07/461,551; "Parallel Processor Memory System" by Won Kim, David Bulfer, John Nickolls, Tom Blank and Hannes Figel filed concurrently herewith and assigned Ser. No. 07/461,567; and "Network and Method for Interconnecting Router Elements Within Parallel Computer System" by Stuart Taylor filed concurrently herewith and assigned Ser. No. 07/461,572. The disclosures of these concurrently filed applications are incorporated herein by reference.
2. Field of the Invention
The invention disclosed here is generally related to parallel processing systems and more specifically to the transmission of information through so-called massively-parallel Single Instruction Multiple Data (SIMD) computing machines.
3. Description of the Relevant Art
It has been a desire for a long time and continues to be such in the computer arts to produce a computing machine which can process large amounts of data in minimum time. Electronic computing machines have been generally designed within the confines of the so-called "von Neumann" architecture. In such an architecture, all instructions and data are forced to flow serially through a single, and hence central, processing unit (CPU). The bit width of the processor's address/data bus (i.e., 8, 16 or 32 bits wide) and the rate at which the processor (CPU) executes instructions (often measured in millions of instructions per second, "MIPS") tend to act as critical bottlenecks which restrict the flow rate of data and instructions. CPU execution speed and bus width must be continuously pushed to higher levels if processing time is to be reduced.
Von Neumann machines have previously enjoyed quantum reductions in data processing times (by factors of ten every decade or so), but artisans in the computing field are now beginning to suspect that the exponential growths previously witnessed in processor bandwidth (CPU bus width, W, multiplied by CPU instruction-execution speed, f) are about to come to an end. Von Neumann style architectures appear to be reaching the physical limits of presently known semiconductor technology. Attention is being directed, therefor, to a different type of computing architecture wherein problems are solved not serially but rather by way of the simultaneous processing of parallel-wise available data information) in a plurality of processing units. These machines are often referred to as parallel processing arrays. When large numbers of processing units are employed (i.e. 64, 128, 1024 or more) the machines are referred to as massively parallel computers. When all processors of a massively parallel machine simultaneously receive a single instruction, broadcast from a central array control unit (ACU), the machine is referred to as a SIMD machine (single instruction, multiple data).
The advantage of parallel processing is simple. Even though each processing unit (PU) may have a finite, and therefore speed-limiting, processor bandwidth (abbreviated hereafter as "pubw"), an array having a number N of such processors will have a total computation bandwidth of N times pubw, and from the purely conceptual point of view, because the integer N is unlimited, it should be possible to forever increase the resultant computing speed Npubw of an array simply by adding more processors. It should be possible to build massively parallel machines having thousands or even millions of processors which in unison provide computing power that eclipses today's standards.
The physical world is unfortunately not kind enough to allow for unchecked growth. It turns out that the benefits derived from increasing the size of a parallel array (scaling N upwardly to an arbitrarily large value) are countered by a limitation in the speed at which messages can be transmitted to and through the parallel array, i.e., from one processor to another or between one processor and an external I/O (input/output) device. Inter-processor messaging is needed so that intermediate results produced by one processing unit (PU.sub.1) can be passed on to another processing unit (PU.sub.2) within the array. Messaging between the array's parallel memory structure and external I/O devices such as high speed disks and graphics systems is needed so that problem data can be quickly loaded into the array and solutions can be quickly retrieved. The array's messaging bandwidth at the local level, which is the maximum rate in terms of bits per second that one randomly located processor unit (PU.sub.x) can send a message to any other randomly located processor unit (PU.sub.y) and/or to any randomly named external I/O device, will be abbreviated herein as "armbw" and referred to as the "serial" messaging bandwidth.
Hopefully, messaging should take place in parallel so that a multiple number, M, of processors are simultaneously communicating at one time thereby giving the array a parallel messaging bandwidth of M times the serial bandwidth armbw. Ideally, M should equal N so that all N processors in the array are simultaneously able to communicate with each other. Unfortunately, there are practical considerations which place limits on the values of M and armbw. Among these considerations are the maximum number of transistors and/or wires which can be defined on a practically-sized integrated circuit chip (IC), the maximum number of IC's and/or wires which can be placed on a practically-sized printed circuit board (PCB) and the maximum number of PCB's which can be enclosed within a practically-sized card cage. Wire density is typically limited to a finite, maximum number of wires per square inch and this tends to limit the value of M in practically-sized systems. Component density is similarly limited so as to require a finite distance between components which, because signals cannot propagate faster than the speed of light, limits the value of armbw. Thus there appears to be an upper bound on the parallel messaging bandwidth, Marmbw, of practical systems.
If the ultimate goal of parallel processing is to be realized (unlimited expansion of array size with concomitant improvement in solution speed and price/performance ratio), ways must be found to maximize both the serial random messaging bandwidth, armbw, of the array and the parallel messaging bandwidth Marmbw so that the latter factors do not become new bottlenecking limitations on the speed at which parallel machines can input problem data, exchange intermediate results within the array, and output a solution after processing is complete. If ways are not found to expand these messaging bottlenecks the messaging bandwidth limiting factors of parallel machines (M and armbw) can come to replace the so-called von Neumann bottleneck factors (f and W) that previously limited computing speed in non-parallel (scalar) machines and the advantage of scalability in massively parallel machines is lost.
Several inter-processor messaging schemes have been proposed By way of example, Thinking Machines Co. of Boston, Mass. has developed a hypercube structure referred to as the "Connection Machine" which is described in U.S. Pat. No. 4,805,091, issued to Thiel et al. Feb. 14, 1989 and also in U.S. Pat. No. 4,598,400 issued to Hillis, July 1, 1986; the disclosures of said patents being incorporated herein by reference. Goodyear Aerospace Corp. of Ohio has developed an X-Y grid for allowing each processor within a two dimensional array to communicate with its nearest North, East, West and South (NEWS) neighbors. The Goodyear NEWS system is described in U.S. Pat. No. 4,314,349, issued to Batcher Feb. 2, 1982, the disclosure of said patent being incorporated herein by reference. DEC (Digital Equipment Corp. of Massachusetts) has developed a multistage crossbar type of network for allowing clusters of processor units to randomly communicate with other clusters of processor units in a two dimensional n.times.m array. The DEC crossbar system is described in PCT application WO 88/06764 of Grondalski which was published Sep. 7, 1987 and is based on U.S. patent application Ser. No. 07/018,937. The disclosures of the Grondalski applications are incorporated herein by reference.
The problems with these previous approaches to interprocessor messaging are as follows. In the Goodyear NEWS network, each processor of a MIMD or SIMD machine is positioned in a two dimensional X-Y grid and limited to communicating by way of hardware with only its four nearest neighbors. Software algorithms (parallel processing programs) which call for messaging between non-neighboring processors do not run efficiently within the constraints of such a two dimensional NEWS topology. Complex software schemes have to be devised so that globally broadcast SIMD instructions ultimately allow a first positioned processor to talk (communicate) with another processor located for example, three rows and four columns away in the X-Y grid. The message is sent during a first SIMD machine cycle to the memory of a neighboring NEWS processor. The neighbor then passes the message on to one of its NEWS neighbors in a subsequent SIMD machine cycle and the process repeats until eventually the message gets to the intended recipient. In this software-mediated form of a message store and forward scheme, so-called SIMD instruction-obey enabling bits (E-bits) of individual processors are typically toggled on and off so that intermediate processors do not actively accept a message not intended for them. Preferably, message bits of parallel paths should arrive at destination processors in synchronism so that all receiving processors can respond simultaneously within a SIMD machine to a single instruction broadcast by a centralized array control unit. If the time for transmitting a message from source processor to destination processor varies across the array, all receiving processors must wait until the last message is delivered before they can all simultaneously respond to a SIMD instruction broadcast by the centralized array control unit. Sophisticated software has to be developed for routing messages efficiently. The cost of software development and the execution time overhead for such a strategy detract from the performance of the overall system.
As more processors are added to the Goodyear NEWS array, random messaging time disadvantageously tends to increase. This is because the time for message transfer between one randomly located processor and any other randomly located member of the processor array is at least roughly proportional to the two dimensional distance between processors. (Number of hops is roughly proportional to N.sup.1/2.) Users who attempt to increase the price/performance ratios of their systems by increasing the number N of processors in a NEWS array do not necessarily realize any improvement in system price/performance, and in some instances, the act of increasing array size may actually be detrimental to the price/performance ratio of the machine.
The hypercube structure of the Connection Machine suffers from similar drawbacks. Instead of being limited to direct communication with only four neighboring processors, each processor of an H-dimensional hypercube can talk via hardware directly with H neighboring processors, each of the neighbors being a processor which belongs to one of H hypercube planes theoretically passing through the message originating processor. A packet switching scheme is used to allow message packets to hop from one node to the next until the message packet reaches a destination node that is identified by a fixed-length header field of the packet. If a message originating processor wishes to communicate with a hypercube member other than its H immediate neighbors, such messaging must be carried out with a store and forward scheme similar to that of the NEWS network, except that it is mediated mostly by hardware rather than software. Message forwarding distance is usually much shorter in the hypercube environment than it is in the two-dimensional NEWS grid (because of the unique H-dimensional nature of a hypercube), but because the packet switching circuitry of each processor (node) in an H-dimension hypercube might be simultaneously receiving as many as H requests from its neighbors to act as an intermediary and to perform message store and forward operations, the message handling capabilities of the intermediate message-forwarding circuitry can be easily overwhelmed when more processors are added (when N is scaled upwardly) and the value of H increases. If the packet-switching circuits of destination processors are also being overwhelmed by store and forward requests, such that they are "too busy" to receive the message packets meant for them, the message packets have to be temporarily directed elsewhere (by modifying the destination field in the packet header) and there is the danger, in some cases, that a (multiply-modified) message packet may never get to its intended recipient. The danger of this.-increases as the value of H increases and thus, the hypercube does not provide an architecture whose number of processors (N) may be easily scaled upwardly.
A further drawback of the hypercube structure has to do with its wire density. At least H message-carrying wires must radiate from each node of a hypercube having 2.sup.H nodes. (A node can be a single processor or a cluster of processors.) As H increases, the number of wires in the hypercube increases as 1/2(H2.sup.H). For massively parallel machines (i.e., H.gtoreq.10), there is the problem of how to concentrate such a massive number of wires (H2.sup.H /2) in a practical volume and how to minimize cross talk between such radially concentrated wires (H wires per node).
The crossbar type of multi-stage interconnect network (MIN) described in the Grondalski application overcomes some of the problems associated with wire concentration. It is not a true crossbar switching matrix of the kind which allows any processor to directly communicate through a single switching element with any other processor in the array, but rather the Grondalski system is a multi-stage interconnect network (MIN) wherein pluralities of processors are grouped into clusters and each cluster communicates indirectly with any other cluster including itself through a message routing path formed by a series of spaced apart router "stages" each having message routing switches (message steering stages) and each being coupled to the next by lengths of stage-connecting wires. Each cluster has one wire for sending a transmission into the multi-stage interconnect network (MIN) and one wire for receiving a transmission from the interconnect network. Processors within a cluster access the transmit and receive wires by way of multiplexing. A routing path is created through the MIN by a sequential series of switch closings in the stages rather than by a single switch closing. This approach of assigning processors to clusters and forming an intercluster message routing network wherein routing paths are defined by plural switches (plural steering stages) advantageously reduces the number of wires and switches that would otherwise be required for a true crossbar switching matrix.
While it has many beneficial attributes, the Grondalski network suffers from a major drawback. The Grondalski routing system has an excessively long per-path message transmission time (i.e., 250 nanoseconds per bit) which grows disadvantageously as the size of the routing system is scaled upwardly. This drawback arises from the same factor which gives the Grondalski network its benefits. It is because each message routing path in the Grondalski network is defined by a plurality of spaced-apart "stages" and thus defined by a plural number of serially coupled switches, relatively long wires or other serially-connected message routing devices and because each such device has an inherent signal propagation delay (i.e., signal flight time), that the time it takes for a single bit to travel through the message routing path is so long (i.e. 250 nS per bit). Messaging time disadvantageously increases in proportion to the number of serially-connected routing devices employed to define each routing path and the lengths of wires which connect these devices together. Thus, system performance is affected detrimentally as the size of the routing system is scaled upwardly by adding more routing devices and/or longer lengths of connecting wires. But, on the other hand, it is necessary to add more routing devices if the computation bandwidth Npubw of a parallel array and the parallel messaging bandwidth, Marmbw, of the routing system are to be scaled upwardly in an efficiently matched manner. If the computation power, Npubw, of an array were to be increased while the parallel messaging bandwidth, Marmbw, remains constant, messaging time would begin to overshadow computation time.
A designer wishing to build a system in accordance with the teachings of Grondalski is therefore caught in a dilemma On the one hand, it is desirable to be able to add more processing units so that the total computational bandwidth, Npubw, increases. On the other hand, it is necessary to limit wire length and the number of switching elements in each path of the message routing network so that messaging time does not become excessively long. At some point, the advantages of increased computing bandwidth, Npubw, are outweighed by the drawbacks of decreased messaging bandwidth, armbw and/or Marmbw, and upward scaling of the parallel processing machine no longer makes sense from the vantage point of price versus performance.
There exists in the field of parallel processing a need for a scalable message routing system whose messaging delays do not grow substantially with size.