1. Technical Field
This invention generally relates to massively parallel computing systems and development, and more specifically relates to re-utilizing partially failed compute resources as network resources.
2. Background Art
Supercomputers continue to be developed to tackle sophisticated computing jobs. These computers are particularly useful to scientists for high performance computing (HPC) applications including life sciences, financial modeling, hydrodynamics, quantum chemistry, molecular dynamics, astronomy and space research and climate modeling. Supercomputer developers have focused on massively parallel computer structures to solve this need for increasingly complex computing needs. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/P system is a scalable system in which the maximum projected number of compute nodes is 73,728. The Blue Gene/P node consists of a single ASIC (application specific integrated circuit) with 4 CPUs and memory. The full computer would be housed in 72 racks or cabinets with 32 node boards in each.
The Blue Gene/P supercomputer communicates over several communication networks. The 73,728 computational nodes are arranged into both a logical tree network and a logical 3-dimensional torus network according to the prior art. The logical tree network connects the computational nodes in a binary tree structure so that each node communicates with a parent and two children. The torus network logically connects the compute nodes in a lattice like structure that allows each compute node 110 to communicate with its closest 6 neighbors. Since the compute nodes are arranged in a torus and tree network that require communication with adjacent nodes, a hardware failure of a single node in the prior art can bring a large portion of the system to a standstill until the faulty hardware can be repaired. This catastrophic failure occurs because a single node failure would break the network structures and prevent communication over these networks. For example, a single node failure would isolate a complete section of the torus network, where a section of the torus network in the Blue Gene/P system is a half a rack or 512 nodes.
On a massively parallel super computer system like Blue Gene, the mean time before failure of a hardware component may be measured in hours while the complex computing programs describe above may take several hours to several days to run. Thus it is advantageous to be able to continue to operate the system if there is a failure of an individual compute node or processor to decrease the overall system down time. A parallel computer system could potentially be capable of processing with only slightly diminished capability when a single compute node has failed if the network structure is still viable. Without a way to utilize partially failed computer resources super computers will need to continue to halt all processing for all hardware failures thereby wasting potential computer processing time.