1. Technical Field
This invention generally relates to fault recovery on a parallel computing system, and more specifically relates to fault recovery on a massively parallel super computer to handle node failures without ending an executing job.
2. Background Art
Supercomputers continue to be developed to tackle sophisticated computing jobs. These computers are particularly useful to scientists for high performance computing (HPC) applications including life sciences, financial modeling, hydrodynamics, quantum chemistry, molecular dynamics, astronomy and space research and climate modeling. Supercomputer developers have focused on massively parallel computer structures to solve this need for increasingly complex computing needs.
One such massively parallel computer being developed by International Business Machines Corporation (IBM) is the Blue Gene system. The Blue Gene system is a scalable system in which the maximum number of compute nodes is 65,536. Each node consists of a single ASIC (application specific integrated circuit) and memory. Each node typically has 512 megabytes or 1 gigabyte of local memory. The full computer would be housed in 64 racks or cabinets that are closely arranged in a common location and interconnected together with several networks. Each of the racks has 32 node boards and each node board has 32 nodes with 2 processors for each node.
The Blue Gene supercomputer's 65,536 computational nodes and 1024 I/O processors are arranged into both a logical tree network and a logical 3 dimensional torus network. The logical tree network is a logical network on top of a collective network topology. Blue Gene can be described as a compute node core with an I/O node surface. Each I/O node handles the input and output function of 64 compute nodes. The I/O nodes have no local storage. The IO nodes are connected to the compute nodes through the logical tree network and also have functional wide area network capabilities through its built in gigabit ethernet network. The nodes can be allocated into multiple node partitions so that individual applications or jobs can be executed on a set of Blue Gene's nodes in a node partition.
Soft failures in a computer system are errors or faults that are not due to a recurring hardware failure or hard fault. A soft failure can be caused by random events such as alpha particles and noise. In most computer system, such soft failures are quite infrequent and can be dealt with in traditional ways. In a massively parallel computer system like Blue Gene, the problem of soft and hard failures is significantly increased due to the complexity of the system and the number of compute nodes in the system. Further, a failure in one node in the prior art can cause a whole partition of the computer system to become unusable or require a job executing on a partition to be aborted and restarted.
Since computer system downtime and restarting a job wastes valuable system resources, without a way to more effectively recover from system faults caused by soft failures, parallel computer systems will continue to suffer from inefficient utilization of hardware and unnecessary computer downtime.