1. Field
The present invention relates to fault resilience in computing systems. Fault-resilient computer programs are required in a wide range of application areas, for instance from simple computations to image rendering and large-scale, complex simulations, including on-the-fly and offline processing. As one important example, mission-critical jobs (e.g. operational weather forecasting) or systems (e.g. the internet) must be resilient to failure. This invention addresses the whole gamut of these application areas, and is focused particularly on distributed, parallel computer programs running on very large high-performance computing systems with data distributed over a number of CPUs.
2. Description of the Related Art
Computationally intense applications are usually carried out on high performance computer systems. Such high performance computer (HPC) systems often provide distributed environments in which there is a plurality of processing units or cores on which processing threads of an executable can run autonomously in parallel.
Many different hardware configurations and programming models are applicable to high performance computing. A popular approach to high-performance computing currently is the cluster system, in which a plurality of nodes each having one or more multicore processors (or “chips”) are interconnected by a high-speed network. Each node is assumed to have its own area of memory, which is accessible to all cores within that node. The cluster system can be programmed by a human programmer who writes source code, making use of existing code libraries to carry out generic functions. The source code is then compiled to lower-level executable code, for example code at the ISA (Instruction Set Architecture) level capable of being executed by processor types having a specific instruction set, or to assembly language dedicated to a specific processor. There is often a final stage of assembling or (in the case of a virtual machine, interpreting) the assembly code into executable machine code. The executable form of an application (sometimes simply referred to as an “executable”) is run under supervision of an operating system (OS).
Applications for computer systems having multiple cores may be written in a conventional computer language (such as C/C++ or Fortran), augmented by libraries for allowing the programmer to take advantage of the parallel processing abilities of the multiple cores. In this regard, it is usual to refer to “processes” being run on the cores. A (multi-threaded) process may run across several cores within a multi-core CPU. One such library is the Message Passing Interface, MPI, which uses a distributed-memory model (each process being assumed to have its own area of memory), and facilitates communication among the processes. MPI allows groups of processes to be defined and distinguished, and includes routines for so-called “barrier synchronization”, which is an important feature for allowing multiple processes or processing elements to work together. Barrier synchronization is a technique of holding up all the processes in a synchronization group executing a program until every process has reached the same point in the program. This is achieved by an MPI function call which has to be called by all members of the group before the execution can proceed further.
Alternatively, in shared-memory parallel programming, all processes or cores can access the same memory or area of memory. In a shared-memory model there is no need to explicitly specify the communication of data between processes (as any changes made by one process are transparent to all others). However, it may be necessary to control access to the shared memory to ensure that only one process at a time modifies the data. In a “threaded” shared memory programming model, such as OpenMP, a single process can have multiple, concurrent execution paths (possibly one thread of execution per physical core available to the process).
The latest generation of supercomputers contain hundreds of thousands or even millions of cores. The three systems on the November 2012 TOP500 list with sustained performance over 10 Pflop/s contain 560,640 (Titan), 1,572864 (Sequoia) and 705,024 (K computer) cores. In moving from petascale to exascale, the major performance gains will result from an increase in the total number of cores in the system (flops per core is not expected to increase) to 100 million or more. As the number of nodes in the system increases (and especially if low-cost, low-energy nodes are used to maintain an acceptable power envelope) the mean-time-to-component-failure of the system will decrease—eventually to a time shorter than the average simulation run on the system. Hence, it will be necessary for exascale software to be resilient to component failure.
There are several methods that may be used to achieve fault-resilient exascale application software. These include:                Improvements to MPI to automatically handle component failure in a manner that is invisible to the application.        Development of new algorithms that can be implemented within software to allow it to compensate if one (or more) MPI task suffers a fault during execution.        Improved methods to frequently (and rapidly) checkpoint massively parallel simulations in order that they can be restarted from a point immediately prior to the fault.        Replication of work, so that tasks are identically executed by more than one processing element—if one processing element suffers a fault then the result from the other is generally still available.        The use of task pools with reassignment, where a master process coordinates the execution of independent tasks and can reassign a task where the processor originally assigned the work fails.        
There are problems with each of these prior art methods. An automatic MPI response to a fault may not be optimal for a particular application, so a developer may prefer to retain control of how faults are dealt with. Checkpointing (especially on very large systems) is time consuming and—if a fault occurs just before a checkpoint is due (or during a checkpoint)—may result in a large amount of computation having to be repeated. Replication of work is also expensive—and if the entire program function must be duplicated for fault resilience then, in effect, the available computing power is halved. Task pools with reassignment avoid the need to duplicate so much work, but for some applications (especially the very large applications expected to run on exascale systems) it may not be possible to break the work up into sufficiently fine-grained independent tasks.
The inventor is aware of a related-art method for algorithm-based fault-tolerance based on the combination method. In this method, the combination method is used within a solver to overcome faults: solutions are computed on several coarse grids, and combined to produce a more accurate solution. A component failure in any one grid reduces the accuracy of the combined solution, but only within a known tolerance. However, there are drawbacks to this method also. In particular:                It assumes that there is an underlying grid in the simulation. This is not necessarily the case for a general application.        Failure of one node leads to other nodes also being unable to contribute to the solution (and nodes computing the solution on the coarse grid which the faulty node was working on are unused). If there are a large number of coarse grids this may not be a significant problem, but in general an application will want to exploit all resources available to it.        
It is desirable to enable a simulation running over a plurality of processors to run to completion (and retain sufficient accuracy) even when one (or more) of the processors suffers a fault. This would be applicable particularly in exascale computing, in which applications such as simulations will be required to be run using many millions of processors and the likelihood of a small number of failures while the simulation is running is high.