This invention relates, in general, to processing of parallel programs and, in particular, to performing checkpoint and restart of a parallel program.
Enhancing the performance of computing environments continues to be a challenge for system designers, as well as for programmers. In order to help meet this challenge, parallel processing environments have been created, thereby setting the stage for parallel programming.
A parallel program includes a number of processes that are independently executed on one or more processors. The processes communicate with one another via, for instance, messages. As the number of processors used for a parallel program increases, so does the likelihood of a system failure. Thus, it is important in a parallel processing environment to be able to recover efficiently so that system performance is only minimally impacted.
To facilitate recovery of a parallel program, especially a long running program, intermediate results of the program are taken at particular intervals. This is referred to as checkpointing the program. Checkpointing enables the program to be restarted from the last checkpoint, rather than from the beginning.
One technique for checkpointing and restarting a program is described in U.S. Pat. No. 5,301,309 entitled xe2x80x9cDistributed Processing System With Checkpoint Restart Facilities Wherein Checkpoint Data Is Updated Only If All Processors Were Able To Collect New Checkpoint Dataxe2x80x9d, issued on Apr. 5, 1994. With that technique, processes external to the program are responsible for checkpointing and restarting the program. In particular, failure processing tasks detect that there has been a system failure. Restart processing tasks execute the checkpoint restart processing in response to the detection of the system failure, and checkpoint processing tasks determine the data necessary for the restart processing. Thus, the external processes are intimately involved in checkpointing and restarting the program.
Although the above-described technique, as well as other techniques, have been used to checkpoint and restart programs, further enhancements are needed. For example, checkpoint/restart capabilities are needed in which the checkpointing and restarting of a process of a parallel program is handled by the process itself, instead of by external processes. Further, a need exists for checkpoint/restart capabilities that enable the saving of interprocess message state and the restoring of that message state. Additionally, a need exists for a checkpoint capability that provides for the committing of a checkpoint file, so that only one checkpoint file for a process need be saved for restart purposes. Yet further, a need exists for a checkpoint capability that allows the writing of checkpoint files to either global or local storage. Further, a need exists for checkpoint/restart capabilities that allow migration of the processes from one processor to another.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of an article of manufacture, including at least one computer usable medium having computer readable program code means embodied therein for causing the checkpointing of parallel programs. The computer readable program code means in the article of manufacture includes, for instance, computer readable program code means for causing a computer to take a checkpoint of a parallel program. The parallel program includes a plurality of processes, and the computer readable program code means for causing a computer to take a checkpoint includes computer readable program code means for causing a computer to write, by a process of the plurality of processes, message data to a checkpoint file corresponding to the process. The message data includes an indication that there are no messages, or it includes one or more in-transit messages between the process writing the message data and one or more other processes of the plurality of processes.
In a further embodiment of the invention, the computer readable program code means for causing a computer to take a checkpoint further includes computer readable program code means for causing a computer to write, by a process of the plurality of the processes, a data section, a signal state and/or one or more file offsets to a checkpoint file corresponding to the process that is writing the data section, signal state and/or file offset(s).
In another aspect of the present invention, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform a method of checkpointing parallel programs is provided. The method includes, for instance, taking a checkpoint by a process of a parallel program. The taking of the checkpoint includes writing to a data section of the process at least one of a signal state and one or more file offsets; subsequently, writing the data section to a checkpoint file corresponding to the process; writing message data to the checkpoint file, wherein the message data includes an indication that there are no messages, or it includes one or more in-transit messages between the process and one or more other processes of the parallel program; and writing at least one of executable information, stack contents and register contents to the checkpoint file.
In yet another aspect of the present invention, an article of manufacture including at least one computer usable medium having computer readable program code means embodied therein for causing the restoring of parallel programs is provided. The computer readable program code means in the article of manufacture includes computer readable program code means for causing a computer to restart one or more processes of a parallel program on one or more computing units, wherein at least one process is restarted on a different computing unit from the computing unit that was previously used to take at least one checkpoint for the at least one process. Further, the article of manufacture includes computer readable program code means for causing a computer to copy data stored in one or more checkpoint files corresponding to the one or more restarted processes into memory of the one or more computing units executing the one or more restarted processes.
In yet a further aspect of the present invention, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform a method of checkpointing parallel programs is provided. The method includes indicating, by a process of a parallel program, that the process is ready to take a checkpoint; receiving, by the process, an indication to take the checkpoint; taking the checkpoint, wherein the process copies data from memory associated with the process to a checkpoint file corresponding to the process; and indicating by the process, completion of taking the checkpoint.
In accordance with the principles of the present invention, checkpoint/restart capabilities are provided that allow the processes themselves to take the checkpoint and to restart after a failure. Additionally, in-transit messages between processes (interprocess messages) or an indication that there are no messages is saved during the checkpointing of the program. The messages are saved without having to log the messages in a log file. Further, after the processes have taken their checkpoints, the checkpoint files are committed, so that there is only one checkpoint file for each process at the time of restart. Yet further, the capabilities of the present invention allow the writing of checkpoints to either global or local storage. Additionally, migration of the processes from one system to another is allowed, when the checkpoints are written to global storage.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.