One type of fault tolerant computer system utilizes a fault detection system that depends on the state of the computer being periodically recorded. In one version of this type of system, the state of the computer is recorded in a second "slave" computer. If an error is detected between checkpoints, the slave computer takes over from the state recorded at the last checkpoint. When a cache line is written into the memory of the "master" computer, the same cache line is copied into a buffer in the slave computer system. At each checkpoint, the contents of the buffer are written into the memory of the slave computer thereby bringing the master and slave memories into synchronization at the checkpoint. If a failure occurs, the slave computer's memory is already synchronized with the master computer at the state that existed at the last check point. Hence, the slave computer can take over the computation starting from that point.
The buffer is typically first-in-first-out (FIFO). The FIFO must be large enough to store all of the writes that occur between checkpoints. If a buffer overflow occurs, the state of the two systems will not be synchronized at the next checkpoint, and the error recovery system will fail. Accordingly, a large FIFO must be utilized. Such a buffer increases the cost of the system.
Unfortunately, there is no guaranteed FIFO size that will guarantee that an overflow will not occur. Consider a case in which the FIFO gradually accumulates data during a checkpoint period. The transfer of the data to the slave memory for this checkpoint period does not start until the checkpoint period is completed. At this point the slave begins to read entries from the FIFO and write those entries into the slave's memory. In the meantime, checkpoint data for the next period is arriving at the FIFO for storage. The FIFO now holds partial checkpoint data for the previous period and the current period. If the inflow rate is particularly high, the FIFO can have more than two intervals worth of data stored in it. The ultimate limit on the rate of data accumulation is determined by the speed at which the slave computer can read the FIFO and then write its main memory. If the applications are generating a series of writes with no intervening memory cycles, the data will accumulate in the FIFO. The extent of the accumulation depends on the density of writes; hence, there is no guaranteed FIFO size that will assure that a failure will not occur. Such a failure would require stopping both machines and copying the master memory in its entirety into the slave memory. Since the memories in question may be quite large, it is advantageous to avoid such system failures.
Broadly, it is the object of the present invention to provide an improved checkpoint memory system.
It is a further object of the present invention to provide a checkpoint memory system that requires less FIFO buffer space than prior art systems.
It is a still further object of the present invention to provide a checkpoint memory system that does not fail if a buffer overflow occurs.
These and other objects of the present invention will become apparent to those skilled in the art from the following detailed description of the invention and the accompanying drawings.