1. Field of the Invention
The present invention relates to techniques for providing fault-tolerance in parallel-processing systems.
2. Related Art
High-performance computing (HPC) applications often use message-passing techniques, such as the Message Passing Interface (MPI) technique, to facilitate executing distributed parallel-computing applications. The MPI technique allows computationally-intensive and memory-intensive jobs to be decomposed into smaller problems which are executed in parallel across a number of computing nodes.
For example, a problem can be decomposed into N “chunks,” and the chunks can be distributed across N computing nodes to be processed in parallel, thereby decreasing the execution time of the distributed parallel-computing application by a factor of approximately N (less the overhead due to inter-process communications and the overhead for combining the processed chunks). Unfortunately, one drawback of existing message-passing techniques for parallel-computing applications is that they lack a fault-tolerance mechanism. Consequently, if one of the computing nodes fails before all of the chunks complete, the entire parallel-processing job needs to be restarted from the beginning.
One solution to this fault-tolerance problem is to use checkpointing to save the state of the parallel-computing problem into memory and/or disk at regular intervals (at some frequency F). The frequency, F should be selected with care because the checkpointing operation imposes a nontrivial overhead penalty on the execution time of the distributed parallel-computing application. If a checkpoint is taken too frequently, it is possible to mitigate the speedup gains that result from parallel-computing. On the other hand, if a checkpoint is taken too infrequently, there is an increased likelihood of losing data that has been computed since the last checkpoint was taken.
Hence, what is needed is a method and an apparatus for improving fault-tolerance in a parallel-processing system without the problems described above.