Many large computational problems must run concurrently as a set of multiple threads or processes across a distributed set of compute nodes, such as those in a compute cluster. These types of workloads are seen in what has typically been referred to as High Performance Computing (HPC) and may use programming models such as Message Passing Interface (MPI) for coordinating their distributed computations. Distributed computing, however, is not limited to HPC or to MPI. Grid computing and cloud computing also run distributed concurrent computations. Although grid computing and cloud computing also use MPI, they also use other programming models such as distributed shared memories.
Regardless of the computing domain and the programming model, all of these distributed computations share the same challenge: how to maintain coordinated progress in spite of the continuous unpredictable failures of the underlying systems. For coordinated programs, a failure of any one component typically interrupts all of the distributed threads and processes. To mitigate this problem, a strategy referred to as checkpoint-restart is employed in which the distributed state is periodically persisted to stable storage and subsequently used to restart the processes following a failure. Distributed state capture, however, is extremely challenging primarily due to the challenges of capturing both the distributed memories as well as pending messages.
A need therefore exists for improved techniques for performing distributed state capture in parallel computing environments.