This invention relates to apparatus and techniques for achieving fault tolerance in computer systems and, more particularly, to techniques and apparatus for establishing and recording a consistent system state from which all running applications can be safely resumed following a fault.
xe2x80x9cCheckpointingxe2x80x9d has long been used as a method for achieving fault tolerance in computer systems. It is a procedure for establishing and recording a consistent system state from which all running applications can be safely resumed following a fault. In particular, in order to checkpoint a system, the complete state of the system, that is, the contents of all processor registers, cache memories, and main memory at a specific instance in time, is periodically recorded to form a series of checkpointed states. When a fault is detected, the system, possibly after first diagnosing the cause of the fault and circumventing any malfunctioning component, is returned to the last checkpointed state by restoring the contents of all registers, caches and main memory from the values stored during the last checkpoint. The system then resumes normal operation. If inputs and outputs (I/Os) to and from the computer are correctly handled, and if, in particular, the communication protocols being supported provide appropriate protection against momentary interruptions, this resumption from the last checkpointed state can be effected with no loss of data or program continuity. In most cases, the resumption is completely transparently to users of the computer.
Checkpointing has been accomplished in commercial computers at two different levels. Early checkpoint-based fault-tolerant computers relied on application-directed checkpointing. In this technique, one or more backup computers were designated for each running application. The application was then designed, or modified, to send periodically to its backup computer, all state information that would be needed to resume the application should the computer on which it was currently running fail in some way before the application was able to establish the next checkpoint.
This type of checkpointing could be accomplished without any specialized hardware, but required that all recoverable applications be specially designed to support this feature, since most applications would normally not write the appropriate information to a backup computer. This special design placed a severe burden on the application programmer not only to ensure that checkpoints were regularly established, but also to recognize what information had to be sent to the backup computer. Therefore, in general, application-directed checkpointing has been used only for those programs that have been deemed especially critical and therefore worth the significantly greater effort required to program them to support checkpointing.
System-directed checkpointing has also been implemented in commercial computer systems. The term xe2x80x9csystem-directedxe2x80x9d refers to the fact that checkpointing is accomplished entirely at the system software level and applications do not have to be modified in any way to take advantage of the fault-recovery capability offered through checkpointing. System-directed checkpointing has the distinct advantage of alleviating the application programmer from all responsibility for establishing checkpoints. Unfortunately, its implementation has been accomplished through the use of specialized hardware and software, making it virtually impossible for such systems to remain competitive in an era of rapidly advancing state-of-the-art commodity computers.
More recently, techniques have been disclosed for achieving system-directed checkpointing on standard computer platforms. These techniques, however, all require specialized plug-in hardware components. These plug-in components intercept either all reads from memory, or all writes to memory, so that the information needed to establish a checkpoint is made available to the checkpointing software. This procedure suffers from two major disadvantages: first, the intercepting hardware introduces additional delays in the processor-to-memory path, making it difficult to meet the very tight timing requirements for memory access in state-of-the-art computers. Second, new hardware has to be developed for each set of memory-control chips used in systems that are to be endowed with this capability. Since new memory-control chip sets are developed with high frequency in the rapidly evolving computer industry, it is costly to make this capability available on a continuing basis.
In accordance with one illustrative embodiment of the invention, a memory map that is normally used to convert virtual memory addresses into physical memory addresses is also used to guarantee that an image of each data page to be modified is captured before that modification occurs. In particular, following each checkpoint, all pages, including read-only pages and read/write pages are mapped as read-only pages. Therefore, when an attempt is made to write to a page, a system interrupt is generated. If the page is a read-only page, then normal page-fault interrupt protocol is followed. If the page is a read/write page that has temporarily been labeled read-only, the page is copied to a buffer and the memory map is changed to indicate that the page is now a read/write page. Normal processing then resumes.
In accordance with another embodiment of the invention, after the aforementioned system interrupt occurs, the identity of the page is recorded, but the page itself is not copied. In addition, the locations of all pages modified through I/O events are also recorded. At the time of a checkpoint, the checkpoint software copies the contents of all modified pages to the secondary computer. The primary computer halts all further processing until all pages have been thus copied at which time the checkpoint is committed and normal processing resumes.
In accordance with yet another embodiment, a secondary computer is used as described in the previous embodiment, but instead of halting the primary computer during page copying, normal processing is resumed as soon as all caches have been flushed and the modified pages are copied as a background task.
In still another embodiment of the invention, the write-buffering technique used for local and remote checkpointing can also be used in a clustered environment with one computer effectively serving as a backup for every other computer in the cluster. The aforementioned method and apparatus enable checkpointing techniques to be realized using standard hardware platforms running standard operating systems. As a consequence, otherwise standard computers can be endowed with a significant level of fault tolerance without requiring the major hardware and software modifications normally associated with fault-tolerant computers. All applications receive the benefit of fault tolerance without having to be modified in any way.