1. Field of the Invention
The present invention relates to techniques for enhancing throughput and fault-tolerance in parallel-processing systems. More specifically, the present invention relates to a method and an apparatus that enhances throughput and fault-tolerance in a parallel-processing system by using standby computing nodes to take over jobs from computing nodes that are determined to be at risk of failure.
2. Related Art
Computation-intensive and memory-intensive applications, such as proteomics and genomics in life sciences, often use message-passing techniques to distribute computational work across multiple computing nodes. This typically involves decomposing a problem into multiple smaller problems, which are then executed in parallel across a plurality of computing nodes on a parallel-processing system.
For example, a problem can be decomposed into N “chunks,” and the chunks can be distributed across N computing nodes to be processed in parallel, thereby decreasing the execution time of the parallel-computing application by a factor of approximately N (less the overhead due to inter-process communications and the overhead for combining the processed chunks). Particularly, one class of problems which is referred to as “embarrassingly parallel” problems in high-performance computing (HPC) incurs minimal overhead from inter-process communications and combining the processed chunks, because the associated parallel processes are either independent or very loosely coupled. Hence, using N identical computing nodes, this class of problems can achieve speedup factors very close to N.
Unfortunately, one drawback of existing message-passing techniques for parallel-computing applications is that they lack a fault-tolerance mechanism. Consequently, if one of the computing nodes fails before all of the chunks complete, the entire parallel-processing job needs to be restarted from the beginning.
One solution to this fault-tolerance problem is to use a checkpointing technique, wherein the system periodically stores the states of all computing nodes into memory and/or disk. By periodically performing checkpoints, if a machine or a single computing node crashes during execution, the system does not have to start the entire problem over. Instead, the system can simply return to the last checkpoint to retrieve the saved state information and can restart from there.
Unfortunately, the checkpointing process can increase the execution time of the parallel-computing application. Specifically, if a checkpoint is taken too frequently, the checkpointing overhead can become significant enough to largely mitigate the speedup gains that result from parallel execution. On the other hand, if a checkpoint is taken too infrequently, there is an increased likelihood of losing data that has been computed since the last checkpoint was taken.
Hence, what is needed is a method and an apparatus for enhancing the throughput and fault-tolerance in a parallel-processing environment without the above-described problems.