This application contains subject matter which is related to the subject matter of the following applications, each of which is assigned to the same assignee as this application and filed on the same day as this application. Each of the below listed applications is hereby incorporated herein by reference in its entirety:
xe2x80x9cA SYSTEM OF PERFORMING CHECKPOINT/RESTART OF A PARALLEL PROGRAM,xe2x80x9d by Meth et al., Ser. No. 09/181,981;
xe2x80x9cPROGRAM PRODUCTS FOR PERFORMING CHECKPOINT/RESTART OF A PARALLEL PROGRAM,xe2x80x9d by Meth et al., Ser. No. 09/182,555;
xe2x80x9cCAPTURING AND IDENTIFYING A COMPLETE AND CONSISTENT SET OF CHECKPOINT FILES,xe2x80x9d by Meth et al., Ser. No. 09/182,175;
xe2x80x9cRESTORING CHECKPOINTED PROCESSES INCLUDING ADJUSTING ENVIRONMENT VARIABLES OF THE PROCESSES,xe2x80x9d by Meth et al., Ser. No. 09/182,357; and
xe2x80x9cRESTORING CHECKPOINTED PROCESSES WITHOUT RESTORING ATTRIBUTES OF EXTERNAL DATA REFERENCED BY THE PROCESSES,xe2x80x9d by Meth et al., issued Jul. 3, 2001 as U.S. Pat. No. 6,256,751.
This invention relates, in general, to processing of parallel programs and, in particular, to performing checkpoint and restart of a parallel program.
Enhancing the performance of computing environments continues to be a challenge for system designers, as well as for programmers. In order to help meet this challenge, parallel processing environments have been created, thereby setting the stage for parallel programming.
A parallel program includes a number of processes that are independently executed on one or more processors. The processes communicate with one another via, for instance, messages. As the number of processors used for a parallel program increases, so does the likelihood of a system failure. Thus, it is important in a parallel processing environment to be able to recover efficiently so that system performance is only minimally impacted.
To facilitate recovery of a parallel program, especially a long running program, intermediate results of the program are taken at particular intervals. This is referred to as checkpointing the program. Checkpointing enables the program to be restarted from the last checkpoint, rather than from the beginning.
One technique for checkpointing and restarting a program is described in U.S. Pat. No. 5,301,309 entitled xe2x80x9cDistributed Processing System With Checkpoint Restart Facilities Wherein Checkpoint Data Is Updated Only If All Processors Were Able To Collect New Checkpoint Dataxe2x80x9d, issued on Apr. 5, 1994. With that technique, processes external to the program are responsible for checkpointing and restarting the program. In particular, failure processing tasks detect that there has been a system failure. Restart processing tasks execute the checkpoint restart processing in response to the detection of the system failure, and checkpoint processing tasks determine the data necessary for the restart processing. Thus, the external processes are intimately involved in checkpointing and restarting the program.
Although the above-described technique, as well as other techniques, have been used to checkpoint and restart programs, further enhancements are needed. For example, checkpoint/restart capabilities are needed in which the checkpointing and restarting of a process of a parallel program is handled by the process itself, instead of by external processes. Further, a need exists for checkpoint/restart capabilities that enable the saving of interprocess message state and the restoring of that message state. Additionally, a need exists for a checkpoint capability that provides for the committing of a checkpoint file, so that only one checkpoint file for a process need be saved for restart purposes. Yet further, a need exists for a checkpoint capability that allows the writing of checkpoint files to either global or local storage. Further, a need exists for checkpoint/restart capabilities that allow migration of the processes from one processor to another.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a method of checkpointing parallel programs. The method includes, for instance, taking a checkpoint of a parallel program, wherein the parallel program includes a plurality of processes. The taking of a checkpoint includes writing, by a process of the plurality of processes, message data to a checkpoint file corresponding to the process. The message data includes an indication that there are no messages, or it includes one or more in-transit messages between the process writing the message data and one or more other processes of the plurality of processes.
In a further embodiment, the taking of a checkpoint further includes writing, by a process of the plurality of processes, a data section, a signal state and/or one or more file offsets to a checkpoint file corresponding to the process that is writing the data section, signal state and/or file offset(s).
In yet a further embodiment, the taking of a checkpoint further includes writing, by a process of the plurality of processes, executable information, stack contents, and/or register contents to a checkpoint file corresponding to the process writing the executable information, the stack contents and/or the register contents.
In another embodiment of the invention, the method includes restoring the process that wrote the message data to the checkpoint file, wherein the restoring includes copying the message data from the checkpoint file to memory of the computing unit executing the process.
In one example, the computing unit executing the process is a different computing unit from when the checkpoint was taken by the process.
In another embodiment of the invention, the taking of a checkpoint further includes taking a checkpoint by a number of processes of the plurality of the processes. The taking of a checkpoint by the number of processes includes writing data to a number of checkpoint files, wherein each process of the number of processes takes a corresponding checkpoint.
In a further example, the taking of the corresponding checkpoints by the number of processes is coordinated.
In another aspect of the invention, a method of restoring parallel programs is provided. The method includes, for instance, restarting one or more processes of the parallel program on one or more computing units, wherein at least one of the processes is restarted on a different computing unit from the computing unit that was previously used to take at least one checkpoint for the at least one process. Further, data stored in one or more checkpoint files corresponding to the one or more restarted processes is copied into memory of the one or more computing units executing the restarted processes.
In yet a further aspect of the invention, a method of checkpointing parallel programs is provided. The method includes indicating, by a process of a parallel program, that the process is ready to take a checkpoint; receiving, by the process, an indication to take the checkpoint; taking the checkpoint, which includes having the process copy data from memory associated with the process to a checkpoint file corresponding to the process; and indicating, by the process, completion of taking the checkpoint.
In accordance with the principles of the present invention, checkpoint/restart capabilities are provided that allow the processes themselves to take the checkpoint and to restart after a failure. Additionally, in-transit messages between processes (interprocess messages) or an indication that there are no messages is saved during the checkpointing of the program. The messages are saved without having to log the messages in a log file. Further, after the processes have taken their checkpoints, the checkpoint files are committed, so that there is only one checkpoint file for each process at the time of restart. Yet further, the capabilities of the present invention allow the writing of checkpoints to either global or local storage. Additionally, migration of the processes from one system to another is allowed, when the checkpoints are written to global storage.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.