1. Field of the Invention
The present invention relates to the recovery of data from a failed computational node, particularly but not exclusively in a parallel computing environment and in high performance computing (HPC) applications. The present invention finds application particularly in the field of fault-resilient distributed computing, with emphasis on exascale computers. Computationally intense and other large-scale applications are usually carried out on HPC systems. Such HPC systems often provide distributed environments in which there is a plurality of processing units or “cores” on which independent sequences of events such as processing threads or processes of an executable can run autonomously in parallel.
2. Description of the Related Art
Many different hardware configurations and programming models are applicable to HPC. A popular approach to HPC currently is the cluster system, in which a plurality of nodes, each having one or more multicore processors (or “chips”), are interconnected by a high-speed network. Each node is assumed to have its own area of memory, which is accessible to all cores within that node. The cluster system can be programmed by a human programmer who writes source code, making use of existing code libraries to carry out generic functions, such as hardware control. The source code is then compiled to lower-level executable code, for example code at the ISA (Instruction Set Architecture) level capable of being executed by processor types having a specific instruction set, or to assembly language dedicated to a specific processor. There is often a final stage of assembling or (in the case of a virtual machine, interpreting) the assembly code into executable machine code. The executable form of an application (sometimes simply referred to as an “executable”) is run under supervision of an operating system (O/S) and uses the O/S and libraries to control hardware. The different layers of software used may be referred to together as a software stack.
The term “software stack” as used herein includes all the software required to run an application, including the base level software (the operating system or O/S); libraries interfacing, for example, with hardware components such as an interconnect between nodes, a disc or other memory etc (also a type of system software) and the application itself. The application currently executing may be seen as the top layer of the software stack, above the system software.
Applications for computer systems having multiple cores may be written in a conventional computer language (such as C/C++ or Fortran), augmented by libraries for allowing the programmer to take advantage of the parallel processing abilities of the multiple cores. In this regard, it is usual to refer to “processes” being run on the cores. A (multi-threaded) process may run across several cores within a multi-core CPU and each node may contain one or more CPUs. One such library is the Message Passing Interface, MPI, which uses a distributed-memory model (each process being assumed to have its own area of memory), and facilitates communication among the processes. MPI allows groups of processes to be defined and distinguished, and includes routines for so-called “barrier synchronization”, which is an important feature for allowing multiple processes or processing elements to work together.
Alternatively, in shared-memory parallel programming, all processes or cores can access the same memory or area of memory. In a shared-memory model there is no need to explicitly specify the communication of data between processes (as any changes made by one process are transparent to all others). However, it may be necessary to use a library to control access to the shared memory to ensure that only one process at a time modifies the data.
Exascale computers (i.e. HPC systems capable of 1 exaflop (1018 floating point operations per second) of sustained performance) are expected to be deployed by 2020. Several national projects to develop exascale systems in this timeframe have been announced. The transition from petascale (current state-of-the-art, approximately 1015 flops) to exascale is expected to require disruptive changes in hardware technology. There will be no further increase in processor clock frequency, so the improved performance will result from an increase in parallelism or concurrency (possibly up to approximately 1 billion cores). The requirement to keep the power usage of an exascale system within an acceptable window means that low-power (and low-cost) components are likely to be used, resulting in a reduced mean-time-to-failure for each component. Thus, an exascale system will contain many more components than today's state-of-the-art systems—and each component is likely to fail more frequently than its equivalent today. It is likely that the mean-time-to-component-failure for an exascale system will be measured in minutes (as opposed to days for current systems).
Therefore, exascale software in particular will require increased resilience to these faults and will need to be able to continue to run through component failure. Since HPC applications are generally carefully load balanced to ensure that work is distributed across all of the available computational cores, it can be important that a replacement node be made available to the application to carry out the work allocated to the failed node (assigning this work to one or more of the remaining nodes which is already loaded is likely to disrupt the load balance and lead to a significant performance degradation).
FIG. 1 illustrates the process of replacing the failed node with a replacement node. The diagram shows six nodes in a system (nodes 0-5), with node 5 failing and no longer being able to contribute to running the application. Of course in reality many more nodes make up a computer system. A replacement node (node 6) is made available to the system and is inserted to replace the failed node. Once the replacement node has been assigned to the application, it is also necessary to initialize it with the data required in order to continue execution (e.g. the values of variables computed by the application).
The need to initialize a replacement node is not new and known initialisation techniques include:                Restarting the node from a checkpoint file: This method guarantees that the data is initialised to the correct values. However, generating a checkpoint is time consuming (since it involves copying large amounts of data either to the memory on another node or to a file on disk). Hence, data is generally checkpointed periodically with relatively large gaps between checkpoints. Thus, when recovering data from a checkpoint, all computation since the last checkpoint needs to be repeated (at least on the failed node and possibly globally). Thus, there is a two-fold overhead in time from checkpointing; once when creating the checkpoint and once again when reading and re-computing to recover from it.        Interpolating values on the failed node from those of the equivalent data on the surviving nodes: This method is not always possible (e.g. if each node is responsible for a discrete region of space in a modelling algorithm then it is unlikely to be possible to interpolate the solution across the whole of this region simply from the values on its boundary). Even if it is possible to interpolate the data on the failed node from that on other nodes, there will be a loss of accuracy as a result of doing this.        
Both of these prior art techniques have deficiencies and thus it is desirable to provide an alternative way of initialising a replacement node.