This invention relates to intercomputer message passing systems and apparatus and, more particularly, in an intercomputer routing system wherein message packets are routed along communications paths from one computer to another, to the improvement comprising, a routing automaton disposed at each computer and having an input for receiving a message packet including routing directions as a header thereto and a plurality of outputs for selectively outputting the message packet as a function of the routing directions in the header; and, routing logic means disposed within the routing automaton for reading the header, for directing the message packet to one of the outputs as a function of the routing directions contained in the header, and for updating the header to reflect the passage of the message packet through the routing automata.
For a message-passing, concurrent computer system with very few nodes as depicted in FIG. 1, it is practical to use a full interconnection scheme between the nodes 10 thereof. A full interconnection of channels quickly becomes impractical as the number of nodes increases, since each node of an N node machine must have N-1 connections. A configuration used for larger message-passing multicomputers such as the Caltech Cosmic Cube [Seitz 85] and its commercial descendants is that of a binary n-cube (or hypercube) as depicted in FIG. 2 which is used to connect N=2.sup.n nodes 10. Each node 10 has n=log.sub.2 N connections, and a message never has to travel through more than n channels to reach its destination.
Although the choice of the binary n-cube for the first generation of multicomputers is easily justified, the analyses presented in a 1986 Caltech PhD thesis by William J. Dally [Dally 86] showed that the use of lower dimension versions of a k-ary n-cube [Seitz 84a] connecting N=k.sup.n nodes, e.g. an n=2 (2-D) torus or mesh, is optimal for minimizing message latency under the assumptions of (1) constant wire bi-section and (2) "wormhole" routing [Seitz 84b].
These 2-D (or optionally 3-D) networks also have the advantage that each node has a fixed number of connections to its immediate neighbors, and, if the nodes are also arrayed in two or three dimensions, the projection of the connection plan into the packaging medium has all short wires. Also, the number of nodes in such a machine can be increased at any time with a minimum amount of rewiring. The low dimension k-ary n-cube greatly decreases the number of channels, so that with a fixed amount of wire across the bisection, one may use wider channels of proportionally higher bandwidth. This higher bandwidth, particularly with wormhole routing, can more than compensate for the longer average path a message packet must travel to reach its destination.
The time required for a packet to reach its destination in a synchronous router is given by, T.sub.n =T.sub.c (pD+[L/W]); where T.sub.c is the cycle time, p is the number of pipeline stages in each router, D is the number of channels that a packet must traverse to reach its destination, L is the length of the packet, and W is the width of a flow control unit (referred to hereinafter as a "flit").
As an example, let us assume that there are N=256 nodes, 512 wires crossing the bisection for communication (neglecting overhead from synchronization wires), a message length of 20 bytes (i.e. 160 bits), and an internal 2-stage pipeline. The bisection of a binary hypercube has 128 channels in each direction, each with a width of 2 bits, and an average of (log.sub.2 N)/2=4 nodes that must be traversed, so that T.sub.n =(2.times.4+160/2)T.sub.c =88T.sub.c. By comparison, the bisection of a 2-D (k.times.k) mesh, where k=16, has 16 channels in each direction, each with a width of 16 bits, and an average of (2k/3).about.11 nodes must be traversed, so that T.sub.n =(2.times.11+160/16)T.sub.c =32T.sub.c. Thus, the binary hype network in this example has over twice the average latency of a bidirectional mesh network with the same wire bisection.
The Torus Routing Chip (TRC) designed at Caltech in 1985 [Dally & Seitz 86] used unidirectional channels between the nodes 10 connected in a torus as shown in FIG. 3. This is also the subject of a patent application entitled Torus Routing Chip by Charles L. Seitz and William J. Dally, Ser. No. 944,842, Filed Dec. 19, 1986, and assigned to the common assignee of this application, the teachings of which are incorporated herein by reference. As depicted in FIG. 3, the torus is shown folded in its projection onto a common plane in order to keep all channels the same length. Deadlock (a major consideration in multicomputers) was avoided by using the concept of virtual channels, by which a packet injected into a network travels along a spiral of virtual channels, thus avoiding cyclic dependencies and the possibility of deadlock. The TRC was self-timed to avoid the problems associated with delivering a global clock to a large network. There were a total of 5 channels to deal with, i.e., channels to and from the node and 2 virtual channels each in x and y. Thus, the heart of the TRC involved a 5.times.5 crossbar switch. Although the initial version had a slow critical path, the revised version was expected to operate at 20MHz, with a latency from input to output of 50ns. Since each channel had 8 data lines, the TRC achieved a data rate of 20MB/s. Each packet is made up of a header, consisting of 2 bytes containing the relative x and y address of the destination, any number of non-zero data bytes, and a zero data byte signifying a "tail" or end of the packet. Upon entering the router, each packet has the address in its header decremented and tested for zero and is then passed out through the proper output channel. The connection stays open for the rest of the message and closes after passage of the tail (wormhole routing). If the desired output channel is unavailable, the message is blocked until the channel becomes available.
In the winter and spring of 1986, concurrently with the developments described above, groups of students in the "VLSI Design Laboratory" project course, under the direction of Dr. Charles Seitz of Caltech, were put to work designing different parts of the "Mosaic C" element. This single-chip node of a message-passing multicomputer was to contain a 16-bit central processing unit (CPU), several KBytes of on-chip dynamic random access memory (dRAM), and routing circuitry for communication with other chips. Each chip would form a complete node in a so-called fine-grain concurrent computer.
After looking at a few possible implementations, including the TRC described above, the group working on the routing section decided that a simple, bidirectional 2-D mesh should be used. A mesh had the advantage of keeping the length of wires between chips down to less than one inch, which would allow the use of a synchronous protocol, since clock skew as a function of wire length could be made very small between chips. A mesh would also allow the channels at the edge of the array to be reserved for communications with the outside world. The group also decided to use a bit-serial protocol for packets, both to minimize the number of pins on each chip and to minimize the number of connections needed between them; but, to organize the packets into flits sufficiently large that all of the routing information could be contained in the first flit. As in the TRC, the first Mosaic C router as specified by this group was to use virtual channels to avoid the possibility of deadlock. Each packet consisted of a 20-bit header with the relative x and y addresses of the destination and an arbitrary number of 20-bit flits consisting of a 16-bit data word and 4 control bits. The router also used wormhole routing with one of the control bits signifying a tail. Internally, flits were switched between input and output channels using a time multiplexed bus. The control circuitry was kept as simple as possible, and as a result, did not know how to forward a packet by itself. Each time the header of a packet came in, the CPU would be interrupted (using a dual-context processor for fast interrupt handling) to determine which output channel the packet should be connected to. This approach resulted in a latency of several micro-seconds per step in path formation, but allowed a lot of flexibility in routing under software control. Acknowledgement packets would automatically be sent and received between chips using the same channels to announce the availability of buffers. With a 20MHz system clock (anticipated for 2 micrometer CMOS technology), the bandwidth was expected to be about 2MB/s on each channel. This initial attempt at a routing circuit for incorporation into the Mosaic C chip was never reduced to a layout. After due consideration, it became obvious that it would consume a large amount of silicon area (on the chip) only to achieve fairly dismal performance.
Wherefore, it is an object of the present invention to provide a new method for routing message packets in a message-passing, multicomputer system which will allow the routing processor to provide good performance with a minimum amount of silicon area on the chip consumed thereby.
It is a further object of the present invention to provide a new element for use in a routing processor for routing message packets in a message-passing, multicomputer system.
It is still another object of the present invention to provide a multifunction node chip for use in fine grain message-passing, multicomputer systems incorporating a router for routing message packets in a manner to provide good performance with a minimum amount of silicon area on the chip consumed thereby.
Other objects and benefits of the present invention will become apparent from the detailed description which follows hereinafter when taken in conjunction with the drawing figures which accompany it.