1. Field of the Invention
The present invention relates to multi-node computer systems.
2. Description of the Related Art
Multi-node computer systems have been provided to promote processing ability and speed. An example of such a system is IBM's Blue Gene petaflop supercomputer, which can have 32,000 nodes, with each node being established by a chip having perhaps dozens of microprocessors.
In a multi-node system, processing is undertaken by the nodes acting in concert with each other. Accordingly, the nodes communicate with each other in one of various network topologies, such as grids, meshes, hypercubes, and torus graphs.
Regardless of the topology, however, it is possible that one or more nodes or links between nodes might fail. “Fault tolerance” is a term that refers to the ability of a multi-node system to continue to operate effectively in the presence of such failures.
Specifically, when faults in the network occur, processing formerly undertaken by failed nodes must be assumed by the remaining good nodes, and messages between nodes must be routed around faulty nodes and links. Representative of past solutions to the route-around problem are those set forth in Boppana et al., “Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks”, IEEE Trans. on Computers, 44: 848-864 (1995) and Chalasani et al., “Communication in Multicomputers with Nonconvex Faults”, IEEE Trans. on Computers, 46: 616-622 (1997), incorporated herein by reference. Boppana et al. disclose a method for message route-around that uses only two virtual channels to avoid a message routing interference problem known as “deadlock”, provided that the fault regions are rectangular and the fault rings (non-faulty boundaries around fault regions) do not overlap. As used herein, “k virtual channels” means a physical channel (communication link) must be shared by k different channels, typically in a round-robin manner. Thus, the larger “k” is, the more the hardware cost in manufacturing the communication link.
The Boppana et al. method is extended to regions such as crosses, “L”s, and “T”s by Chalasani et al. using four virtual channels and again assuming that fault rings do not overlap. Chen et al., “A Fault-Tolerant Routing Scheme for Meshes with Nonconvex Faults”, IEEE Trans. on Parallel and Distributed Systems, 12: 467-475, 2001 improve on Chalasani et al. in that fault rings are allowed to overlap, and only three virtual channels are required.
Unfortunately, as can be appreciated from the above discussion the latter two improvements over the relatively limited applicability of Boppana et al. require more than two virtual channels to guarantee deadlock avoidance. Moreover, all of the above-referenced methods assume that the number of “turns” in message routing through the system is not an issue, which in practical implementations may not be the case from a performance standpoint.
The present invention has recognized the above-noted problems and provides solutions to one or more of them as disclosed below.