The present invention is generally directed to checkpoint and restart of cooperating processes employed in multinode data processing systems. More particularly the present invention is directed to such systems in which shared memory is present, between or among several processes, but which is not persistent and is saved as needed during checkpoint operations in nonvolatile storage systems, such as disk memory. Even more particularly, the present invention is directed to methods for the establishment, restoration and release of shared memory regions that are accessed by multiple cooperating processes.
In checkpoint operations, all processes (or threads) are stopped and held in a stable state so that the operational states are captured in a set of files which may be later read and used to restore the same stable state as part of a application restart operation. The state information includes the contents of process stack, heap, registers and other resources “owned by” the process. Once a checkpoint or restart operation is completed, control is returned to all threads or processes which then resume execution beginning at their next instruction, whatever that may be. It is assumed that threads and processes find all of the memory and other resources looking just as they were at the time they were checkpointed. Resumption of execution is a minor issue as long as the checkpoint operation itself has not needed to undo some parts of the application state. However, the restarting of operations is more complex and is particularly considered with respect to the invention disclosed herein. Accordingly, in the discussions herein, references to taking a “checkpoint” implies that a checkpoint operation has been carried out and has generated information with the needed coherence for later use in a restart operation.
Checkpoint and restart capabilities in a data processing system, particularly a multinode data processing system, provide fault tolerance and system preemption mechanisms. Certain forms of process migration are also implemented with the help of checkpoint and restart operations. These make checkpoint and restart features very appealing to large scale parallel processing applications. A parallel job is any set of processes that work together or communicate with one another for the purpose of achieving a common goal. To checkpoint a parallel job, not only does one have to save the private states of all participating processes but one must also save the shared resources and the relationships among the processes. Shared memory is one of the shared resources that should be handled very carefully in checkpoint and restart operations.
Checkpoint and restart features are typically implemented at the operating system level and higher. Operating systems, such as AIX (marketed and distributed by the assignee of the present invention) and other UNIX based systems are oriented to the handling of processes. However, in the case of multiple processes that work cooperatively and use shared memory resources, information about such cooperation is unavailable at the operating system level. One of the fundamental problems with shared resources, such as shared memory, is that it does not have a single owner. Additionally, it is very desirable that the above-mentioned system level checkpoint and restart operations be transparent to user level applications. In the typical large data processing system, most of the resources are accessible to the operating system kernel; and the system kernel maintains control of all running processes. Therefore, there are very few restrictions on when and how checkpoint can be taken. As a result of the fact that checkpoints can occur at unpredictable times, the possibility that a system hangup occurs is exists. In terms of the checkpointing of parallel jobs that store information in the shared memory regions, the operating system kernel can easily freeze the participating processes of the parallel job and save the entire shared memory region. However, a problem here is that the operating system does not know when these shared resources will be at a stable state since this information is only available at the application level. This adds a level of difficulty to checkpointing operations. Typically, only the application itself has the exact knowledge on what state and resource should be saved in order to reconstruct the process later. At the system level there is no knowledge of desired or necessary application level resources.
Nevertheless, AIX and other operating systems provide certain mechanisms which can be used to solve these problems. In particular, AIX employs a defined data structure called a mutex which is a locking mechanism which is useful in providing data integrity and coherency. It is described more particularly below in the context of its use in the present invention. Operating systems also typically provide mechanisms called handlers which are essentially subroutines or functions that are invoked upon the occurrence of a defined event. Handlers are registered with the operating system and provide application level programming with an extremely flexible method of handling certain events, such as a checkpointing operation. It is also worth mentioning here that for AIX, and for other UNIX-like operating systems, the word “process” is a term of art that refers to the execution of a program.
In prior approaches the operating system handles resources that are germane to a single process but cannot handle resources that transcend a single process. The operating system can provide hooks in the form of checkpoint, restart and resume handler invocations at the appropriate moments, but the application level must use those hooks. The present invention provides a mechanism to do so safely in the case of shared memory across multiple cooperating processes. Specifically, there exists a mechanism which allows a process at the application level process to handle resources that transcend a single process. Application level programming registers checkpoint and restart handlers and/or callback functions for this purpose. The current AIX operating system checkpoint facility provides hooks (that is, well defined interface mechanisms) for registering these handlers. When a checkpoint operation is requested, the registered application checkpoint handlers or the callback functions run first to take care of checkpointing those resources before the system kernel starts its checkpointing process. During the restart, the system kernel operations are performed first and then the restart handlers finish application level restoration before the operating system allows the process to continue. Since the shared memory regions are shared among participating processes, the checkpoint and restart handlers contain synchronization points among the processes for the integrity of the content of the shared memory regions. When there are multiple levels of shared memory usage, multiple checkpoint/restart handlers are run in a specific order. Some data processing operating systems (Silicon Graphics, Inc.'s IRIX operating system, for example) use a first-in-first-out (FIFO) order for such handlers while others (International Business Machine's AIX with Parallel Environment for AIX, for example) employ a first-in-last-out order (FILO) for these handlers. For instance, a parallel job on a data processing system exploits a shared memory segment for communication. Participating processes register the handlers at beginning of the job. Once the handlers are registered, a checkpoint is allowed to proceed and the data structures in the shared memory segment necessary for restart are saved by the handlers during the checkpoint operation. Note that if the checkpoint operation is requested while some process has not registered its handlers, and the active handlers expect that process, which has not yet registered a handler to participate, then the synchronization among the checkpoint handlers can never complete, thus resulting in a hung process. Thus, in prior systems, if checkpointing of a parallel job using parallel shared resources is attempted before the needed checkpoint and restart handlers are registered, then deadlocks may happen. Accordingly, insuring that exactly those processes which have registered their handlers are expected to participate in the synchronization process is an important step.
Restrictions on when checkpoint operations can be undertaken are not acceptable in certain circumstances, especially when dynamic tasking and third party libraries are used. (See below for a more complete description of these terms.) For example, consider an application that does not create shared memory regions directly but uses the IBM MPI (Message Passing Interface) or LAPI (Low level Application Programming Interface) libraries. Quite often, these communication libraries make use of shared memory within one data processing node for better performance. The application may not be aware of shared memory usage employed on its behalf. In order to be checkpoint/restart safe, communication libraries should register checkpoint and restart handlers for handling of the shared memory regions during initialization. The problem is that the communication libraries may be initialized at different times by different processes. Therefore, it is possible that some processes may perform a lot of computation before communicating with other processes. Not allowing a checkpoint operation before the initialization of all communicating processes means these computation results can not be checkpointed.
Normally, tasks of a parallel job and their relationships are determined statically at the job startup time and do not change until the end of the job. However, MPI (Message Passing Interface) allows dynamic task creation and termination after a job has started. Accordingly, the group of tasks of the job can change and the group of tasks sharing a shared memory resource can also dynamically change as a result. With respect to third party libraries these are libraries that are not part of the operating system but are not part of the user application either. MPI is commonly implemented as a third party library.
In the present invention, there is provided methods to establish, to restore and to release shared memory regions so that the use of application level handlers is sufficient without imposing unnecessary restrictions on when checkpoint operations can be undertaken. The description of the present invention and its methods is based on the implementation of the shared memory communication operation in the MPI (Message Passing Interface) library on the IBM AIX operating system and within the IBM Parallel Environment for AIX. However, it is noted that this particular platform is used as a mechanism for better explaining the concepts of the invention and also to describe the embodiment preferred by the inventors; however, it should be fully appreciated that the use and operation of the present invention is applicable to any shared memory data processing system running processes in parallel or simultaneously. The AIX/PE platform supports registration of application level checkpoint/restart handlers. Users are able to register three different sets of handlers: checkpoint, resume and restart in the application code. Checkpoint handlers run during checkpoint operations. Either the resume or restart handlers run after checkpoint is completed, depending on whether the job is to continue or is being restarted. The AIX operating system (made and sold by the assignee of the present invention) also provides a mechanism for a process to block checkpointing operations. Checkpointing of a process only occurs when no thread of that process is holding a “Pthread mutex” lock (“mutex” for short; see below for an explanation of the meaning and operation of a mutex lock). Checkpoint blocking is used to protect critical sections of application level code. (In the present context, a critical section of code is one whose operations must be allowed to complete atomically in order to insure data integrity and/or data coherence.) Various blocking mechanisms are provided by different systems. In some cases, it is done by simply masking/unmasking the checkpoint signal. The blocking of checkpointing for a thread within a critical section would not be not required in a system in which “resume” or “restart” functionality provides an assurance that threads in a critical section at checkpoint time will resume or restart still holding whatever lock they held before the checkpoint operation began. The present invention also employs a blocking mechanism but it is noted that the specific nature of the blocking mechanism is not at all critical.
To checkpoint or restart a parallel job, which can be any set of processes working together or communicating with one another for a common goal, a framework is desired to coordinate the activities involved in the checkpoint or in the restart process. This framework is typically provided above the level of the operating system facilities that support single process checkpoint and restart operations. Within this framework, checkpoint signals are delivered to all participating processes, checkpoint file sets are gathered and packaged, and processes are restarted with the checkpoint file sets. The simplest form of the framework is just a set of directions and rules that users follow so that the parallel job is able to be correctly restarted. In the present implementation, the framework is the IBM PE (Parallel Environment) for AIX platform, as mentioned above, which automates signal delivery and checkpoint file set management. However, the present methods to establish, to restore and to release shared memory regions apply to any such framework thus allowing user level shared memory checkpoint and restart operations to be carried out, without any timing restriction, both coherently and correctly.