1. Field of the Invention
The present invention is related to checkpointing and error recovery in computer systems, particularly for fault tolerant computer systems.
2. Description of the Related Art
A fault which occurs during execution of machine instructions often renders data or subsequent execution of machine instructions invalid. Instead of halting operation entirely and restarting the execution of the program anew, it is preferable to recover from the fault and to continue processing the machine instructions with a minimum amount of disruption while preserving data and subsequent instructions. Techniques for recovering from faults have traditionally been achieved through the use of software and hardware.
Software recovery techniques are well known in the art. In a typical application, periodically, or upon the occurrence of specific events, software “checkpoints” the system by recording data adequate to restore the system to a known valid state. When the software detects a fault, the file modifications performed since the last checkpoint are undone, the computing system is “rolled back” to the most recent checkpoint, and operation of the system is resumed from that point.
Software techniques such as this are not transparent to an applications programmer because the programmer must carefully write checkpointing instructions into each application in order to record enough information to restore the application to a valid state. This requirement places a serious burden on the programmer and has impeded the widespread use of checkpointing as a means for achieving fault tolerance. In addition, since the scheme requires the programmer to select which information to record at each checkpoint and when to record the information, it is prone to human error. If the checkpoint code contains flaws, needed data may be overwritten or otherwise lost before proper recording.
In addition, checkpointing through software is very slow. When a fault occurs, certain software routines must be executed to diagnose the problem and to circumvent any permanently malfunctioning component of the computer. As a consequence, the resulting recovery time may preclude the use of this technique for achieving fault tolerance for some real-time applications where response times on the order of milliseconds or less are required. The layering of multiple applications further compounds this problem. Each application may have its own checkpointing subroutines, which, when layered (for example, a Java™ applet running inside a web browser running within an operating system) duplicate the checkpointing processes and substantially decrease the operating efficiency of the entire system.
Other methods for capturing data for checkpointing purposes have been proposed, for example, by Kirrmann (U.S. Pat. No. 4,905,196). Kirrmann's method involves a cascade of memory storage elements consisting of a main memory, followed by two archival memories, each of the same size as the main memory. Writes to the main memory are simultaneously copied into a write buffer. When it is time to establish a checkpoint, the buffered data is then copied by the processor first to one of the archival memories and then to the second. The two archival memories ensure that at least one of them contains a valid checkpoint. Some problems with this architecture include a triplication of memory, the use of slow memory for the archival memory and the effect on processor performance since the three memory elements are different ports on the same bus.
Other techniques have been developed to establish mirroring of data on disks rather than in main memory. U.S. Pat. No. 5,247,618 discloses one example of such a scheme. As a disk access is orders of magnitude slower than a main memory access, such schemes have been limited to mirroring data files, that is, to providing a backup to disk files should the primary access path to those files be disabled by a fault. No attempt is made to retain program continuity or to recover the running applications transparently to the users of the system. In some cases, it is not even possible to guarantee that mirrored files are consistent with each other, only that they are consistent with other copies of the same file.
Disk control systems have also been developed as an alternative method of checkpointing. Shimizu discloses one such system in U.S. Pat. No. 5,752,268. In Shimizu's system, when an operating system generates a write request to a disk device, both the write request and the associated write data are first stored into a nonvolatile memory whereupon a signal is sent to the operating system acknowledging the storage of the write request and write data in nonvolatile memory. Afterwards, the write request and write data are read from the nonvolatile memory and stored in the hard disk. As this architecture combines both hardware and software, it suffers from problems common to both the software and hardware checkpointing designs. The use of a slow disk drive for the archival memory can also decrease processor performance significantly. In addition, since the Shimizu scheme is not user transparent, it requires the programmer to select which information to record at each checkpoint and when to record the information. Consequently, this architecture is programmer intensive and prone to human error.