The present invention relates to fault tolerant computers, and more particularly, to computer systems that utilize a checkpointing error recovery system to recover from a system failure.
Consider a computer system that must function properly for an extended period of time. If an error occurs during the system""s operation, the system can, in principle, be restarted from some known point in the operation that precedes the error thereby losing only the time invested in the system operation between the restart point and the point at which the failure occurred.
One type of fault tolerant computer system utilizes a fault detection system that depends on the state of the computer being periodically recorded. For example, U.S. patent application Ser. No. 09/111,250, which is hereby incorporated by reference, describes a fault tolerant computer system in which the state of the computer is recorded in a second xe2x80x9cslavexe2x80x9d computer. In this system, the master computer stores data in a master memory. The CPU includes at least one cache and reads and writes in cache lines having several words per line. Each time a line is written into master memory, a copy of the line is transferred to the slave memory and a copy of the contents of the slave memory at that location is transferred to a FIFO that is connected to the slave computer. If an error occurs, the contents of the FIFO buffer can be utilized to reconstruct the state of the slave computer""s memory at the last checkpoint, and the system is restarted on the slave computer in a state that matches that of the master computer at the end of the last checkpoints
At periodic intervals, the slave computer generates, or receives, a checkpoint signal. The signal may be generated internally via an interrupt timer associated with either of the computers, a hardware or software interrupt generated by either computer, or the signal may be generated by hardware that is external to both CPU""s. When the checkpoint signal is received by the slave computer, the slave computer stores the contents of its registers in a predetermined location so that the slave can be returned to the state at the checkpoint. Since no errors have occurred since the last checkpoint, the FIFO buffer is merely dumped, and the next computational cycle is begun.
As noted above, if an error occurs in the master computer system and the master computer system cannot recover from the error, program execution is transferred to the slave computer system, which begins execution from the last checkpoint. The slave computer uses the contents of the FIFO buffer to return the slave memory to the state it had at the end of the last checkpoint. At this point, the contents of the master and slave memories are synchronized as of that checkpoint. The slave computer then loads its registers from the register images stored at the end of the last checkpoint interval and picks up where the master computer left off.
While this system provides a high degree of fault tolerance, it is quite expensive, since it requires a second slave computer.
Broadly, it is the object of the present invention to provide an improved checkpoint computer system.
It is a further object of the present invention to provide a checkpoint computer system that operates on a single computer, thereby eliminating the need for a slave computer.
These and other objects of the present invention will become apparent to those skilled in the art from the following detailed description of the invention and the accompanying drawings.
The present invention is a fault-tolerant computer system having an application memory organized as a plurality of cache lines, each cache line being identified by an address in the memory. A FIFO buffer stores a plurality of such cache lines. The system includes at least one CPU for executing instructions stored in the application memory. The system includes a state memory for storing the contents of the internal registers of the CPU. A checkpoint controller defines a series of repeating checkpoint cycles. The application memory and FIFO buffer are operated under the control of a memory controller. The checkpoint controller also has access to a plurality of registers in the CPU that define the state of that CPU at a point in each checkpoint cycle that is controllable by the checkpoint controller. When the memory controller receives a cache line from the CPU in response to a write command specifying an address A in the application memory at which the cache line is to be stored, a copy of the cache line as stored in the application memory at A is copied into the FIFO buffer upon receiving the first write command specifying A after the start of the current checkpoint cycle. The cache line received in the write command is then used to overwrite the contents of address A in the application memory. At a predetermined point in each checkpoint cycle, the checkpoint controller causes the CPU to cease processing instructions from the application memory; to write back to the memory all dirty cache lines; and to store its internal registers defining the state of the CPU in the state memory. The checkpoint controller empties the contents of the FIFO buffer at the end of each checkpoint cycle if no error has been detected by the end of the checkpoint phase of the cycle. If an error is detected, the contents of the FIFO buffer are read back into the application memory, and the contents of the state memory are read back into the CPU. The system is then restarted after completing any hardware configuration needed for the restart.