Fault tolerance in computers is generally realized in either of two ways: either through a hardware-intensive technique called masking, or a software-based approach called checkpointing. Masking is achieved by replicating hardware and executing computer programs on several independent units in parallel. The outputs of these units are then compared to determine their validity. In the simplest and oldest embodiment of this technique, three complete computers are implemented and a simple majority vote on their outputs is used to determine the "correct" output. If at least two of the computers are functioning properly and the voter system itself is also working correctly, the potentially incorrect output of the malfunctioning computer is outvoted and the correct answer is indeed presented to the user. While there are other embodiments of masking that are somewhat more efficient, masking systems generally suffer from the significantly increased cost of the hardware that must be added to mask out the effect of a faulty component. In addition, masking protects only against hardware faults; a software bug that causes one unit to malfunction will also cause other units running the same software to malfunction in the same way. All outputs will contain the same error which as a result will, as a result, pass undetected.
The alternative technique called checkpointing has the potential of providing tolerance to faults in a considerably more cost-effective way. This technique requires that the state of the entire computer be periodically recorded at time intervals designated as checkpoints. A fault may be detected by either a hardware fault monitor (e.g., by a decoder operating on data encoded using an error detecting code, by a temperature or voltage sensor, or by one device monitoring another identical device) or by a software fault monitor (e.g., an assertion executed as part of the executing code that checks for out-of-range conditions on stack pointers or addresses into a data structure). If a fault is detected, recovery involves first diagnosing and circumventing a malfunctioning unit, if possible, and then returning the system to the last checkpoint and resuming normal operation from that point.
Recovery is possible if sufficient hardware remains operational after any elements identified as faulty during the recovery process have been circumvented. In a multiprocessor system, for example, the system can continue to operate as long as at least one of the processors continues to function. Similarly, a system that can remap memory or redirect I/O through alternate ports can survive the loss of memory or I/O resources as well. Moreover, most faults encountered in a computer system are transient or intermittent in nature, exhibiting themselves as momentary glitches. It is therefore generally possible to recover from such faults without circumventing any hardware. However, since transient and intermittent faults can, like permanent faults, corrupt the data that is being manipulated at the time of the fault, it is necessary to have a consistent state to which the computer can return following such events. This is the purpose of the periodic checkpointed state.
Since checkpoints are typically established every 50 milliseconds or so, rolling an executing program back to its last checkpoint is generally entirely transparent to a user. If handled properly, all applications can be resumed from their last checkpoints with no loss of continuity and no contamination of data.
There are two primary advantages to checkpointing relative to masking. First, checkpointing is considerably less expensive to implement. Second, checkpointing offers protection against software as well as hardware faults. The first advantage simply reflects the fact that checkpointing does not require massive replication of hardware. The second advantage is a consequence of the fact that most software bugs remaining in well tested, mature software are exposed only in exceptional situations. Were this not true, the bugs would have been found and removed during normal testing. Such exceptional situations are generally caused by some asynchronous event such as an interrupt that forces program execution to follow a sequence that would not otherwise have been followed. If the system is forced to roll back to a consistent state and continue forward, that is, if the software bug is treated like a hardware transient, it is highly unlikely that the system will encounter exactly the same exception in exactly the same state as before. Consequently, it is highly unlikely that it will encounter the same bug a second time.
Checkpointing also suffers from two potential disadvantages relative to masking. First, masking generally results in instantaneous or near-instantaneous recovery from faults. Any resulting errors are simply masked, so no explicit recovery is necessary. Checkpointing requires that certain software routines be executed to diagnose the problem and to circumvent any permanently malfunctioning component of the computer. As a consequence, the resulting recovery time, typically on the order of one second, may preclude the use of this technique for achieving fault tolerance for some real-time applications where response times on the order of milliseconds or less are required. In applications in which humans directly interact with the computer, e.g., in transaction processing applications; however, a momentary interruption of a second or so is entirely acceptable and, in fact, is generally not even perceptible. Thus, this potential disadvantage of checkpointing is not relevant to that class of applications.
Second, checkpointing has traditionally been achieved at the application level. Thus, the application programmer has been required to be concerned about what data has to be checkpointed, and when it should be done. This requirement places a serious burden on the programmer and has seriously impeded the widespread use of checkpointing as a means for achieving fault tolerance.
More recently, techniques have been developed that allow checkpointing to be done at the system software level so that the application programmer need not be concerned with attempting to identify the data that has to be checkpointed or even be aware that checkpointing is taking place. For this to be possible, the system itself must be able to establish periodic checkpoints, regardless of the applications that it might be running. U.S. Pat. Nos. 4,654,819 and 4,819,154 to Stiffler describe a computer system capable of doing exactly that. The system accomplishes this kind of checkpointing by requiring each of its processors to retain all modified data in its local cache until it is time to establish a new checkpoint at which time all modified data is flushed out to main memory. Such caches are sometimes called blocking caches. Prior to flushing its blocking cache, a processor does a context switch during which it places the contents of its internal registers, including its program counter, on a stack which is flushed out with all the other modified data. Consequently, memory is updated all at once with data that is internally consistent, thereby establishing a checkpoint to which the system can safely return should it subsequently suffer a fault. To guarantee the ability to survive both main memory faults and faults experienced during the flushing operation itself, memory is duplicated, with each data item stored in both a primary location and a shadow location.
While this technique does accomplish its goal of establishing checkpoints without burdening the application programmer, it does have certain disadvantages due to its dependence on the use of a blocking cache. Since a processor cannot write any cache line back to main memory unless it writes back all currently modified lines at the same time, any cache overflow or any request by one processor for data held in another processor's cache requires the processor releasing the data to flush its entire cache. This requirement precludes the use of standard cache coherency protocols (for example, the protocol described in U.S. Pat. No. 5,276,848 to Gallagher) and creates potential porting and performance problems when programs are executed that rely on such standard protocols.
Other methods for capturing data for checkpointing purposes have been proposed, for example, by Kirrmann (U.S. Pat. No. 4,905,196) and by Lee et al. ("A Recovery Cache for the PDP-11", IEEE Trans. on Computers, June, 1980). Kirrmann's method involves a cascade of memory storage elements consisting of a main memory, followed by two archival memories, each of the same size as the main memory. Writes to the main memory are also written by the processor into a write buffer. When it is time to establish a checkpoint, the buffered data is then copied by the processor first to one of the archival memories and then to the second, although techniques are also described that eliminate the need for one of the copies. The two archival memories ensure that at least one of them contains a valid checkpoint, even if a fault occurs while a buffer-to-memory copy is in progress. Some problems with this architecture include a triplication of memory, the use of slow memory for the archival memory and the effect on processor performance since the three memory elements are different ports on the same bus.
The paper by Lee et al. discusses a method for saving data in a recovery cache before updated data is written to memory, for all memory locations falling within an application-specified range of addresses. This method involves converting all writes to memory within the range specified by the application into read-before-write operations. If a fault occurs during the execution of the application, the contents of the recovery cache are stored back into main memory, thereby restoring it to the state that it was when the application began its current execution. One problem with this method is that it slows the host system due to interference with memory cycles by the read-then-write operations which are required. It also requires checkpointing to be handled or considered by the application programmer.
Other techniques have been developed to establish mirroring of data on disks rather than in main memory. Since disk access is orders of magnitude slower than main memory access, such schemes have been limited to mirroring data files, that is, to providing a backup to disk files should the primary access path to those files be disabled by a fault. No attempt is made to retain program continuity or to recover the running applications transparently to the users of the system. In some cases, it is not even possible to guarantee that mirrored files are consistent with each other, only that they are consistent with other copies of the same file. U.S. Pat. No. 5,247,618 discloses one example of such a scheme.