Parallel storage systems are widely used in many computing environments. Parallel storage systems provide high degrees of concurrency in which many distributed processes within a parallel application simultaneously access a shared file namespace. Parallel computing techniques are used in many industries and applications for implementing computationally intensive models or simulations.
In many parallel computing applications, a group of distributed processes typically protect themselves against failure using synchronous checkpoints. Synchronous checkpointing is an extremely difficult workload for the storage system since each application simultaneously writes data to the storage system. Synchronous checkpoints also result in wasted resources since the storage system must be extremely powerful while remaining substantially idle between checkpoint phases.
Capturing a synchronous image of a distributed state with a large number of frequent messages exchanged is difficult. Therefore, the simplest way to checkpoint is to halt the messages by each application calling barrier operation, such as an MPI_Barrier( ) operation in a Message Passing Interface (MPI) implementation. Every process then simultaneously stores their own distributed state.
While asynchronous checkpointing provides a better solution from the perspective of storage system efficiency and also provides faster throughput for the system, asynchronous checkpointing requires the logging of all messages since the checkpoints do not correspond to a synchronous moment in the state of the distributed data structure. In other words, the complete state can be reconstructed only from both the asynchronous checkpoints and all of the logged messages. The difficulty, however, is that for typical parallel computing applications, the number of exchanged messages that must be stored can be quite large. To save the message log using disk storage would be extremely slow, and to save the message log to a faster memory storage system would be too expensive as it would require much more memory.
A need therefore exists for improved techniques for checkpointing in parallel computing environments.