Parallel storage systems are widely used in many computing environments. Parallel storage systems provide high degrees of concurrency in which many distributed processes within a parallel application simultaneously access a shared file namespace. Parallel computing techniques are used in many industries and applications for implementing computationally intensive models or simulations.
In many parallel computing applications, a group of distributed processes typically protect themselves against failure using checkpoints. Checkpointing is an extremely difficult workload for the storage system since each application simultaneously writes data to the storage system. Checkpoints thus create a bursty period of input/output (JO) in which the storage system is mostly idle except for infrequent periods of IO in which the bandwidth of the entire storage system is saturated and the expensive distributed processes in compute nodes are idle. Checkpoints often result in wasted resources since the storage system must be extremely powerful while remaining substantially idle between checkpoint phases.
It is desirable for storage systems to provide a minimum amount of capacity to store the required checkpoint data while also requiring a minimum amount of bandwidth to perform each checkpoint operation quickly enough so that the expensive processors in the compute nodes are not idle for excessive periods of time. A need therefore exists for improved checkpointing techniques in parallel computing environments.