1. Field of the Invention
The present invention relates to operating systems for computers. More specifically, the present invention relates to a method and an apparatus for recovering a multi-threaded process from a checkpoint.
2. Related Art
Computer systems often provide a checkpointing mechanism for fault-tolerance purposes. A checkpointing mechanism operates by periodically performing a checkpointing operation that stores a snapshot of the state of a running computer system to a checkpoint repository, such as a file. If the computer system subsequently fails, the computer system can rollback to a previous checkpoint by using information from the checkpoint file to recreate the state of the computer system at the time of the checkpoint. This allows the computer system to resume execution from the checkpoint, without having to redo the computational operations performed prior to the checkpoint.
The checkpoint recovery process is complicated for multi-threaded processes that support multiple threads of execution, which share a single address space. Such multi-threaded processes are growing increasingly more common.
Note that native threads within an operating system are often referred to as xe2x80x9clight-weight processesxe2x80x9d (LWPs). LWPs are typically created and scheduled by the operating system, and the operating system typically provides only a minimal application program interface (API) to manipulate LWPs from outside the operating system kernel. The abstraction of an LWP through an API is often referred to as a xe2x80x9cthreadxe2x80x9d. Within this specification, we refer to both an xe2x80x9cLWPxe2x80x9d and an abstraction of the LWP through an API as a xe2x80x9cthreadxe2x80x9d.
In order to checkpoint a process with multiple threads, it is necessary to record thread-specific information from inside the kernel of an operating system, so that the threads can be accurately recreated during a checkpoint recovery operation. For example, thread identifiers must be accurately recreated during a recovery operation because some aspects of program execution may depend upon thread identifiers. Hence, if threads are not recreated with the same identifiers, the restored program may behave differently than the original program.
Unfortunately, retrieving thread-specific information from the kernel and using this information to restore threads within the kernel may require complicated additions and/or modifications to the kernel, and such kernel additions are typically very hard to debug and maintain.
Alternatively, thread-specific information can be manipulated through modifications and/or additions to the thread library that provides user-level linkages to thread-specific information in the kernel. However, modifying the thread library can potentially cause unexpected side-effects, and can additionally create maintenance problems.
Another option is to modify an application program to recreate the necessary threads. However, this involves a great deal of additional work for the application programmer.
What is needed is a method and an apparatus for restoring a process with multiple threads without the above-described complications.
One embodiment of the present invention provides a system for recovering a process that is multi-threaded from checkpoint information that was previously stored for the process. During a recovery operation, the system first retrieves the checkpoint information for the process. Next, the system extracts an identifier for a program being run by the process as well as parameters of the program from the checkpoint information. The system also extracts thread identifiers for threads associated with the process from the checkpoint information. Next, the system modifies the program so that executing the program will cause threads associated with the process to be restored. The system then creates a replacement process to replace the process, and causes the replacement process to execute the modified program so that the threads are reconstituted within the replacement process.
In one embodiment of the present invention, the modified program causes the threads associated with the process to be restored. This is accomplished by creating threads with identifiers matching the thread identifiers extracted from the checkpoint information, and then restoring registers for the threads from the checkpoint information. This modified program also restores an address space for the process, and activates the threads so that they commence execution.
In a variation on this embodiment, restoring the address space for the process involves overwriting the modified program with an unmodified version of the program. This unmodified version of the program does not contain the modifications that cause the threads to be restored.
In one embodiment of the present invention, creating the threads involves creating threads for successive identifiers until a thread with a highest identifier in the extracted identifiers is created. The system then disposes of threads with identifiers that do not match the extracted identifiers.
In one embodiment of the present invention, other processes continue executing while the process is being recovered.
In one embodiment of the present invention, obtaining the checkpoint information for the process involves retrieving the checkpoint information from a file.
In one embodiment of the present invention, the process is recovered by code executing in user space, outside of a kernel of an operating system, so that no modifications to the kernel are required to facilitate the recovery process.
In one embodiment of the present invention, the modified program makes system calls to create the threads.
In one embodiment of the present invention, modifying the program involves pre-loading and linking a dynamic library containing code that causes threads associated with the process to be restored.