The present invention relates generally to computer systems, and more particularly to dynamically determining a checkpoint trigger on a computer system.
In computer systems that include multiple processing resources for executing a plurality of tasks, distribution of task execution is important to system performance. Some computing systems include multiple processing nodes to execute tasks in parallel. Memory and processing bandwidth intensive tasks can be distributed to the processing nodes for parallel execution.
In high-performance computing, applications execute over long periods of time. To support error recovery, checkpoints can be established periodically to capture the state of critical values needed to restart execution and recover from an error condition. In systems of higher complexity, checkpoint overhead typically increases while decreasing overall available processing throughput. Latency associated with slower bandwidth paths further increases checkpoint overhead.