1. Technical Field
This invention generally relates to backing up and fault recovery in a computing system, and more specifically relates to an apparatus for fast backup of compute nodes in a massively parallel super computer.
2. Background Art
Efficient fault recovery is important to decrease down time and repair costs for sophisticated computer systems. On parallel computer systems with a large number of compute nodes, a failure of a single component may cause a large portion of the computer to be taken off line for repair.
Massively parallel computer systems are one type of parallel computer system that have a large number of interconnected compute nodes. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/L node consists of a single ASIC (application specific integrated circuit) with 2 CPUs and memory. The full computer is housed in 64 racks or cabinets with 32 node boards in each rack.
The Blue Gene/L supercomputer communicates over several communication networks. The 65,536 computational nodes are arranged into both a logical tree network and a 3-dimensional torus network. The logical tree network connects the computational nodes in a tree structure so that each node communicates with a parent and one or two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to communicate with its closest 6 neighbors in a section of the computer. Since the compute nodes are arranged in a torus and tree network that require communication with adjacent nodes, a hardware failure of a single node can bring a large portion of the system to a standstill until the faulty hardware can be repaired. For example, a single node failure could render inoperable a complete section of the torus network, where a section of the torus network in the Blue Gene/L system is a half a rack or 512 nodes. Further, all the hardware assigned to the partition of the failure may also need to be taken off line until the failure is corrected.
On large parallel computer systems in the prior art, a failure of a single node during execution often requires that the data of an entire partition of the computer be saved to external file system so the partition can be taken off line. The data must then be reloaded to a backup partition for the job to resume. When a failure event occurs, it is advantageous to be able to save the data of the software application quickly so that the application can resume on the backup hardware with minimal delay to increase the overall system efficiency. Without a way to more effectively save the software state and data, parallel computer systems will continue to waste potential computer processing time and increase operating and maintenance costs.