1. Technical Field
This invention generally relates to fault isolation in a computing system, and more specifically relates to autonomic fault isolation in a highly interconnected system such as a massively parallel super computer.
2. Background Art
Fault isolation is important to decrease down time and repair costs for sophisticated computer systems. In some sophisticated computer systems, when a failure occurs, the operating system software is able to generate a list of suspect field replaceable units (FRUs) that are potentially the location or cause of the failure. A technician can then quickly change out the suspect FRUs to get the system quickly operational again. A highly interconnected system as used herein is a sophisticated computer system that has a large number of interconnecting nodes such as compute nodes. Fault isolation in a highly interconnected system is much more complicated because a failure in one node may cause the system to report a failure in many adjacent interconnected nodes. The raw data for the failure is difficult for a technician to determine which FRU is most likely the cause of the failure.
Massively parallel computer systems are one type of highly interconnected system that have a large number of interconnected compute nodes. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/P system is a similar scalable system under development. The Blue Gene/L node consists of a single ASIC (application specific integrated circuit) with 2 CPUs and memory. The full computer would be housed in 64 racks or cabinets with 32 node boards in each rack.
The Blue Gene/L supercomputer communicates over several communication networks. The 65,536 computational nodes are arranged into both a logical tree network and a logical 3-dimensional torus network. The logical tree network connects the computational nodes in a binary tree structure so that each node communicates with a parent and two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to communicate with its closest 6 neighbors in a section of the computer. Since the compute nodes are arranged in a torus and tree network that require communication with adjacent nodes, a hardware failure of a single node in the prior art can bring a large portion of the system to a standstill until the faulty hardware can be repaired. This catastrophic failure occurs because a single node failure would break the network structures and prevent communication over these networks. For example, a single node failure would isolate a complete section of the torus network, where a section of the torus network in the Blue Gene/L system is a half a rack or 512 nodes.
On a massively parallel super computer system like Blue Gene, the time to troubleshoot a failure of a hardware component is critical. Thus it is advantageous to be able to quickly isolate a fault to an FRU to decrease the overall system down time. Without a way to more effectively isolate faults to FRUs highly interconnected computers will need to require manual effort to isolate faults thereby wasting potential computer processing time and increasing maintenance costs.