The present invention relates generally to computer systems, and more particularly to establishing checkpoints on a hybrid computing node.
In computer systems that include multiple processing resources for executing a plurality of tasks, distribution of task execution is important to system performance. Some computing systems include processing accelerators that assist a main processor in executing tasks. Memory bandwidth intensive tasks can be distributed to processing accelerators that have locally available memory with a high bandwidth, and processing results can be reported back to the main processor.
In high-performance computing, applications execute over long periods of time. To support error recovery, checkpoints can be established periodically to capture the state of critical values needed to restart execution and recover from an error condition. In systems of higher complexity, checkpoint overhead typically increases while decreasing overall available processing throughput. Latency associated with slower bandwidth paths further increases checkpoint overhead.