This invention relates, in general, to processing of parallel programs and, in particular, to performing checkpoint and restart of a parallel program.
Enhancing the performance of computing environments continues to be a challenge for system designers, as well as for programmers. In order to help meet this challenge, parallel processing environments have been created, thereby setting the stage for parallel programming.
A parallel program includes a number of processes that are independently executed on one or more processors. The processes communicate with one another via, for instance, messages. As the number of processors used for a parallel program increases, so does the likelihood of a system failure. Thus, it is important in a parallel processing environment to be able to recover efficiently so that system performance is only minimally impacted.
To facilitate recovery of a parallel program, especially a long running program, intermediate results of the program are taken at particular intervals. This is referred to as checkpointing the program. Checkpointing enables the program to be restarted from the last checkpoint, rather than from the beginning.
One technique for checkpointing and restarting a program is described in U.S. Pat. No. 5,301,309 entitled xe2x80x9cDistributed Processing System With Checkpoint Restart Facilities Wherein Checkpoint Data Is Updated Only If All Processors Were Able To Collect New Checkpoint Dataxe2x80x9d, issued on Apr. 5, 1994. With that technique, processes external to the program are responsible for checkpointing and restarting the program. In particular, failure processing tasks detect that there has been a system failure. Restart processing tasks execute the checkpoint restart processing in response to the detection of the system failure, and checkpoint processing tasks determine the data necessary for the restart processing. Thus, the external processes are intimately involved in checkpointing and restarting the program.
Although the above-described technique, as well as other techniques, have been used to checkpoint and restart programs, further enhancements are needed. For example, checkpoint/restart capabilities are needed in which the checkpointing and restarting of a process of a parallel program is handled by the process itself, instead of by external processes. Further, a need exists for checkpoint/restart capabilities that enable the saving of interprocess message state and the restoring of that message state. Additionally, a need exists for a checkpoint capability that provides for the committing of a checkpoint file, so that only one checkpoint file for a process need be saved for restart purposes. Yet further, a need exists for a checkpoint capability that allows the writing of checkpoint files to either global or local storage. Further, a need exists for checkpoint/restart capabilities that allow migration of the processes from one processor to another.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a system of checkpointing parallel programs. This system includes means for taking a checkpoint of a parallel program, which includes means for writing, by a process of the plurality of processes, message data to a checkpoint file corresponding to the process. The message data includes an indication that there are no messages, or it includes one or more in-transit messages between the process writing the message data and one or more other processes of the plurality of processes.
In one embodiment, the checkpoint file is stored in local storage accessible by the process. However, in a further embodiment, the checkpoint file is stored in global storage accessible by the plurality of processes of the parallel program.
In one aspect of the present invention, the system includes means for restoring the process that wrote the message data to the checkpoint file. The means for restoring includes means for copying the message data from the checkpoint file to memory of a computing unit executing the process. In one example, the computing unit executing the process is a different computing unit from when the checkpoint was taken by the process.
In another embodiment of the present invention, the means for taking a checkpoint includes means for taking a checkpoint by a number of processes. The means for taking a checkpoint by the number of processes includes means for writing data to a number of checkpoint files, wherein each process of the number of processes takes a corresponding checkpoint.
In a further example of the present invention, the system includes means for coordinating the taking of the corresponding checkpoints by the number of processes.
In yet another aspect of the present invention, a system of checkpointing parallel programs is provided. The system includes a computing unit being adapted to write to a data section of a process of a parallel program a signal state and/or one or more file offsets, and to subsequently write the data section to a checkpoint file corresponding to the process. Further, the computing unit is adapted to write message data to the checkpoint file, wherein the message data includes an indication that there are no messages, or it includes one or more in-transit messages between the process and one or more other processes of a parallel program. Additionally, the computing unit is adapted to write executable information, stack contents and/or register contents to the checkpoint file.
In a further aspect of the present invention, a system of restoring parallel programs is provided. The system includes, for instance, means for restarting one or more processes of a parallel program on one or more computing units, wherein at least one process of the one or more processes is restarted on a different computing unit from the computing unit that was previously used to take at least one checkpoint for the at least one process. The system of restoring further includes means for copying data stored in one or more checkpoint files corresponding to the one or more restarted processes into memory of the one or more computing units executing the one or more restarted processes.
In another aspect of the present invention, a system of checkpointing parallel programs is provided. The system includes means for indicating, by a process of a parallel program, that the process is ready to take a checkpoint; means for receiving, by the process, an indication to take the checkpoint; means for taking the checkpoint, wherein the means for taking the checkpoint includes means for having the process copy data from memory associated with the process to a checkpoint file corresponding to the process; and means for indicating, by the process, completion of the taking of the checkpoint.
In accordance with the principles of the present invention, checkpoint/restart capabilities are provided that allow the processes themselves to take the checkpoint and to restart after a failure. Additionally, in-transit messages between processes (interprocess messages) or an indication that there are no messages is saved during the checkpointing of the program. The messages are saved without having to log the messages in a log file. Further, after the processes have taken their checkpoints, the checkpoint files are committed, so that there is only one checkpoint file for each process at the time of restart. Yet further, the capabilities of the present invention allow the writing of checkpoints to either global or local storage. Additionally, migration of the processes from one system to another is allowed, when the checkpoints are written to global storage.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.