In many computing environments, it is desirable to provide continuous operation, even in the event of component failure. To maintain operation of a computing system during a component failure requires fault tolerance. One known technique for achieving fault tolerance employs a redundant process pair. The primary process actually performs the work and periodically synchronizes a backup process with the primary process using checkpointing techniques. With prior known checkpointing techniques, the primary sends messages that contain information about changes in the state of the primary process to the backup process. Immediately after each checkpoint, the primary and backup processes are in the same state.
In other prior known checkpointing methods, distinctions between operations that change state (such as write operations) and operations that do not change the state (such as read operations) are not made, and all operations are checkpointed to the backup process. Such a system is shown in U.S. Pat. No. 4,590,554 (Glazer—Parallel Computer Systems) where all inputs to the primary are provided via messages and all messages sent to the primary are made available to the secondary or backup, essentially allowing the backup to “listen in on” the primary's messages. Another such system is described in and U.S. Pat. No. 5,363,503 (Gleeson—Unisys Corporation) where checkpointing is provided as described in U.S. Pat. No. 4,590,554.
Other prior art, such as that shown in U.S. Pat. No. 4,228,496 (Katzman—Tandem Computers), describe that the primary receives a message, processes the message, and produces data. The produced data is stored in the primary's data space thereby changing the primary's data space. The change in the primary's data space causes a checkpointing operation of the data space to be made available to the backup. Thus, there is frequent copying of the primary's data space to the backup's data space, which uses a significant amount of time and memory for transferring the state of the primary to the backup. It may also result in the interruption of service upon failure of the primary. The overhead for such checkpointing methods can have considerable performance penalties.
Other prior art examples attempt to update only portions of the state of the primary that has changed since the previous update, but use complex memory and data management schemes. In others as shown in U.S. Pat. No. 5,621,885 (Del Vigna—Tandem Computers) the primary and backup, which run on top of a fault tolerant runtime support layer (that is, an interface between the application program and operating system) are resident in memory and accessible by both the primary and backup CPUs used in the described fault-tolerance model. The primary and backup processes perform the same calculations because they include the same code.
[Possible other prior art: Pat. Nos. 5,455,932; 5,157,663; 4,823,256; 5,155,678; 5,968,185; 5,802,265]
In addition, systems that provide fault tolerance and implement conventional checkpointing schemes are physically equivalent in their implementation or at least share resources.
In light of the above, it is desirable to arrive at an approach to checkpointing that may be used to foster fault-tolerance without some or all of the drawbacks to conventional checkpointing approaches described above.