“Checkpointing” has long been used as a method for achieving fault tolerance in computer systems. It is a procedure for establishing and recording a consistent system state from which all running applications can be safely resumed following a fault. In particular, in order to checkpoint a system, the complete state of the system, that is, the contents of all processor and input/output (I/O) registers, cache memories, and main memory at a specific instance in time, is periodically recorded to form a series of checkpointed states. When a fault is detected, the system, possibly after first diagnosing the cause of the fault and circumventing any malfunctioning component, is returned to the last checkpointed state by restoring the contents of all registers, caches and main memory from the values stored during the last checkpoint. The system then resumes normal operation. If inputs and outputs (I/Os) to and from the computer are correctly handled, and if, in particular, the communication protocols being supported provide appropriate protection against momentary interruptions, this resumption from the last checkpointed state can be effected with no loss of data or program continuity. In most cases, the resumption is completely transparently to users of the computer.
Checkpointing has been accomplished in commercial computers at two different levels. Early checkpoint-based fault-tolerant computers relied on application-directed checkpointing. In this technique, one or more backup computers were designated for each running application. The application was then designed, or modified, to send periodically to its backup computer, all state information that would be needed to resume the application should the computer on which it was currently running fail in some way before the application was able to establish the next checkpoint.
This type of checkpointing could be accomplished without any specialized hardware, but required that all recoverable applications be specially designed to support this feature, since most applications would normally not write the appropriate information to a backup computer. This special design placed a severe burden on the application programmer not only to ensure that checkpoints were regularly established, but also to recognize what information had to be sent to the backup computer. Therefore, in general, application-directed checkpointing has been used only for those programs that have been deemed especially critical and therefore worth the significantly greater effort required to program them to support checkpointing.
System-directed checkpointing has also been implemented in commercial computer systems. The term “system-directed” refers to the fact that checkpointing is accomplished entirely at the system software level and applications do not have to be modified in any way to take advantage of the fault-recovery capability offered through checkpointing. System-directed checkpointing has the distinct advantage of alleviating the application programmer from all responsibility for establishing checkpoints. System-directed checkpointing involves periodically establishing checkpoints in which the system state at that instant is recorded in such a way that, should a fault occur before reaching the next checkpoint, the system can be rolled back and the state that prevailed at the last checkpoint can be restored. Either of two basic methods is used to accomplish this. The first, called pre-image checkpointing, requires the contents of any page in memory to be copied to a checkpoint buffer before that page is allowed to be modified. The second, called post-image checkpointing, depends on the existence of a shadow memory with a shadow page for each page in main memory. On this case, when an attempt is made to write to a page in main memory, its address is captured and placed on an address queue. Following each checkpoint, all modified pages are copied into a shadow buffer and from there into the shadow memory.
While system-directed checkpointing has obvious advantages over application-directed checkpointing, its implementation has traditionally been accomplished through the use customized hardware and software, making it virtually impossible for such systems to remain competitive in an era of rapidly advancing state-of-the-art commodity computers and operating systems.
More recently, techniques have been disclosed for achieving system-directed checkpointing on standard computer platforms. These techniques, however, all require either modified hardware or else modifications to the operating system kennel. The first of these techniques involves modifying the hardware to capture the information needed to establish a checkpoint. This procedure is best implemented in the memory controller hardware, but unfortunately, standard memory controllers do not support the required functionality. The second technique entails modifying the operating system kernel to enable certain memory writes to be interrupted momentarily so that either the pre-image of the addressed section of memory, or the address itself, can be captured and recorded elsewhere in memory. The problem with this approach is that it can be implemented only on systems having operating systems that have been so modified.