1. Field of the Invention
The present invention relates to computing systems and techniques for enhancing throughput in these computing systems. More specifically, the present invention relates to adjusting a checkpointing frequency in computing systems based on risk metrics for computing nodes in these computing systems.
2. Related Art
Distributed high-performance computing systems (such as grid computing), in which multiple computing nodes are linked by optical fibers, can provide significant computational capacity. These computing systems allow complicated problems to be divided into separate jobs that are processed in parallel by the computing nodes.
However, as the size and complexity of a computing system increases, the computing system can become more vulnerable to failures. For example, if there is a failure on a computing node that is executing one of the jobs, all of the jobs may need to be repeated.
In existing computing systems, this problem can be addressed using checkpointing. During checkpointing, the operation of a computing node is typically interrupted and a current state of a job executing on the computing node may be stored to facilitate a subsequent recovery of the job in the event of a failure.
Unfortunately, the input/output bandwidth of the optical links has been increasing more slowly than other components in a grid computing system, such as: processor performance, Linpack performance, and hard-disk-drive capacity. As the performance of these other components increases, the amount of data to be checkpointed correspondingly increases. However, this increase in data has not been matched by a corresponding increase in I/O bandwidth through the optical links. Consequently, the time needed to checkpoint large-grid computing systems through such optical links has been increasing and may soon exceed the mean time between failure of computing nodes in such computing systems.
Hence, there is a need to provide other techniques for checkpointing computing nodes in a computing system.