1. Technical Field
This invention generally relates to fault recovery in a parallel computing system, and more specifically relates to an apparatus and method for dynamically rerouting node traffic on the compute nodes of a massively parallel computer system using hint bits without restarting applications executing on a massively parallel super computer.
2. Background Art
Efficient fault recovery is important to decrease down time and repair costs for sophisticated computer systems. On parallel computer systems with a large number of compute nodes, a failure of a single component may cause a large portion, or the entire computer to be taken off line for repair. Restarting an application may waste a considerable amount of processing time prior to the failure.
Massively parallel computer systems are one type of parallel computer system that have a large number of interconnected compute nodes. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/L node consists of a single ASIC (application specific integrated circuit) with 2 CPUs and memory. The full computer is housed in 64 racks or cabinets with 32 node boards in each rack.
The Blue Gene/L supercomputer communicates over several communication networks. The 65,536 computational nodes are arranged into both a logical tree network and a 3-dimensional torus network. The logical tree network connects the computational nodes in a tree structure so that each node communicates with a parent and one or two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to communicate with its closest 6 neighbors in a section of the computer. Since the compute nodes are arranged in a torus and tree network that require communication with adjacent nodes, a hardware failure of a single node can bring a large portion of the system to a standstill until the faulty hardware can be repaired. For example, a single node failure could render inoperable a complete section of the torus network, where a section of the torus network in the Blue Gene/L system is a half a rack or 512 nodes. Further, all the hardware assigned to the partition of the failure may also need to be taken off line until the failure is corrected.
On large parallel computer systems in the prior art, a failure of a single node during execution often requires that the software application be restarted from the beginning or from a saved checkpoint. When a failure event occurs, it would be advantageous to be able to move the processing of a failed node to another node so that the application can resume on the backup hardware with minimal delay to increase the overall system efficiency. Without a way to more effectively recover from failed or failing nodes, parallel computer systems will continue to waste potential computer processing time that increases operating costs.