A requirement of any robust computing environment is to be able to recover from errors, such as device hardware errors (e.g., mechanical or electrical errors) or recording media errors. In order to recover from some device or media errors, it is necessary to restart a program, either from the beginning or from some other point within the program.
To facilitate recovery of a program, especially a long running program, intermediate results of the program are taken at particular intervals. This is referred to as checkpointing the program. Checkpointing enables the program to be restarted from the last checkpoint, rather than from the beginning of the program.
When checkpointing a program, it is important to generate a complete new checkpoint file before destroying any old checkpoint file. This is to ensure that at any instant there is a valid checkpoint file from which the program can be restored. If an old checkpoint file is erased before the new checkpoint file is completed (or if the old checkpoint file is directly overwritten with the new checkpoint file), it is possible that a system failure will occur at precisely the moment when the old checkpoint file no longer exists, but the new checkpoint file is not yet valid. This causes a situation in which there is no valid checkpoint file.
When checkpointing a parallel program, there is an additional complication. The state of all the processes of the parallel program are to be saved in a consistent manner. Thus, in general, it is not sufficient to simply take a checkpoint of each of the processes individually. Instead, the processes are coordinated, so that the resulting checkpoints reflect a valid state of the parallel program, when taken as a whole.
A problem arises if any one of the processes has an inconsistent checkpoint file as compared to the others. For example, assume a parallel program has a plurality of processes and all but one of those processes completed taking a new checkpoint. If one of the processes that finished taking a checkpoint erases its old checkpoint file, then upon restart there is no complete set of consistent checkpoint files. This is because the one process no longer has an old checkpoint file, and the process that failed does not have a new checkpoint file.
Based on the foregoing, a need exists for a capability that ensures the capture of a complete and consistent set of checkpoint files for a parallel program. A further need exists for a capability that identifies a complete and consistent set of checkpoint files for a parallel program.