A multicomputer system can generally be described as a system that includes a set of interconnected computers, i.e., processor and memory pairs. System communication and synchronization are performed by the exchange of messages between the computers. A simple example of a multicomputer system is a system including a set of serially connected computers, each of which performs a unique function. The set of functions carried out by the computers defines the overall multicomputer function. In operation, information or data in the form of a message is routed to the first computer, which performs its particular function and passes the output to the next computer. The cascading of information continues until the final output or result is produced at the last computer. In more complex systems, the information paths are not so deterministic nor are the interconnections so simple. Multicomputer systems are often adopted in order to increase the speed of processing.
A multiprocesor system is similar in theory to a multicomputer system, in that the multiprocessor system includes multiple processors and memories. The distinguishing factor is that a multiprocessor system may group the processors together, separately from the memory components. Each processor may then be connected to each memory. Thus, multicomputer and multiprocessor systems each include processors that must communicate with one another, although the processor and memory interconnections may be different. For ease of description, multicomputer systems will be referred to in this application. However, it is to be understood that the discussion, unless otherwise noted, is also applicable to multiprocessor systems.
Individual computers in multicomputer systems are often referred to as nodes. Each node is connected to one or more nodes by communication lines or channels. The nodes are connected in a variety of configurations such as hypercubes, meshes, tori, and three-dimensional cube meshes. The particular interconnection configuration defines how many channels are connected to each node. The dimension of a system determines the number of channels connected to each node. For example, in a three-dimensional bidirectional system each node may send or receive a message along six different channels. Although numerous geometric configurations exist, multicomputer configurations are physically limited by the number of wires or channels that can be connected to a particular node and interconnected in a workable system.
During normal operation, many messages may be moving through a multicomputer system. Each message includes routing information identifying its destination and possibly its source. If the source and destination nodes are not directly connected, the message must be routed through intermediate nodes. In certain multicomputer systems, each computer includes a routing subsystem to control internode routing. Each routing subsystem includes a routing controller, channel input/output components, and usually memory in the form of message buffers. The routing subsystem may be integrated into or separate from the node computers.
A simple prior art multicomputer architecture is illustrated in FIG. 1. The system is three-dimensional, having dimensions or directions (x,y,z). If a message is to be passed from node A to node D, the message must travel two nodes in the x-direction, one node in the y-direction and zero nodes in the z-direction. Thus, the initial routing information for the message might be represented at (2,1,0). Each time the message is routed along a particular dimension, the address is updated. For example, if the message is routed from node A to node B, the y routing value is decremented by one since the message is one step closer to its destination in the y-direction. The routing address is updated at node B to (2,0,0). A few possible paths for the delivery of the message are: A-B-C-D, A-E-C-D and A-E-F-D. In some multicomputer systems routing subsystems are responsible for the message routing choices at each node.
Because the passing of messages between the nodes in a multicomputer system is so important, the determination of the message path between nodes is extremely important. There are three essential properties of routing in a successful system: the router should be free from deadlock, livelock, and starbation.
Deadlock in a multicomputer system occurs when some messages are unable to move regardless of the future (normal) activity of the system. One cause of deadlock is that one or more nodes fail and as a result all messages destined to or through the failed nodes clog up the system. In a deadlocked system, the ultimate result is that the deadlocked messages do not reach their destinations.
Livelock is a serious problem that occurs when a message continually circulates in the network and never reaches its destination. This can happen in a system wherein messages can be derouted, i.e., sent away from their destinations, in order to avoid congestion or deadlock. In the example described above, a message might be derouted so that its node path is A-B-C-E-F-D. The message was derouted when it was sent from C-E since it was routed further away from its destination. Livelock will occur if the message was continually sent along the circular path A-B-C-E-A . . . on its way to Node D. One standard solution to livelock is to time stamp every message and use the timestamp to prioritize message delivery decisions. When multiple messages are to be routed along the same dimension, the oldest message is selected and routed first. Eventually, each message ages enough to be delivered along its preferred dimension. The problems with this prioritized solution are that the process of selecting the oldest message complicates and slows the routing decision and that the timestamp portion of the message, which must be sufficiently large not to overflow, adds bits to the message
Finally, starvation occurs when a node cannot inject its own messages into the system and thereby loses its ability to initiate messages. This occurs when the node's message buffer is always full because "through" messages initiated by and destined for other nodes are always filling the channels. One main goal of a message routing system related to starvation-freedom is to limit the delay before a processor can inject a message.
One prior art routing system is an oblivious router that completely determines a message's path by the message's (source node address, destination node address) pair. Such a router dispatches the messages in a manner analogous to a group of commuters who daily leave their houses and follow a predetermined fixed path to their work places. If the commuters or messages do not interfere with one another, they go directly to their destination. But since only one path is used, the commuters or messages must wait if there are commuters or messages ahead of them. Oblivious routers require only relatively simple logic in order to route messages and to guarantee deadlock freedom. (Such routers are not subject to livelock.) As a result, oblivious routers can be very fast under light to moderate random traffic. However, such routers may experience a relatively high rate of delay under heavy traffic or local congestion. These routers are fault intolerant.
Routers of another kind are randomized routers, which are meant to increase message delivery speeds over oblivious routers. One type of randomized router sends each message from its source node to a randomly selected intermediate node and from there to its destination node. The route is predetermined to meet these criteria. The practical problem with this type of router is that the length of the average message path doubles because the intermediate node is not necessarily in a direct path between the source and destination nodes. Moreover, the router penalizes average random traffic to improve routing for relatively infrequent worst case traffic.
Another alternative is an adaptive router that selects message paths based on the local load characteristics at its node. Adaptive routers can avoid local node congestion by exploiting alternative paths to a destination that can be selected locally. Such routers are more fault tolerant than oblivious routers since alternative routes avoiding nonfunctional nodes can be used. Fault tolerance is increasingly important as multicomputer systems get larger. One type of adaptive router is a minimal adaptive router, which always routes messages closer to their destinations. However, such routers do not allow derouting or misrouting, i.e., sending a message further from its destination in the presence of congestion. In nonminimal adaptive routers, derouting is allowed. Such routers may be better at handling nonuniform traffic than the minimal routers. However, although potentially fast and robust, current nonminimal adaptive routers are slowed by the standard, complex livelock protection mechanisms such as priority techniques.
The present invention solves these and other problems in the prior art using a relatively simple solution that exploits the inherent asynchrony in independent node processes coupled with a random derouting selection process. The result is a system that is deadlock free and probabilistically and operationally livelock free.