Fault tolerant systems support computer architectures that require only a few minutes of downtime a year. Achieving extended computing uptime often requires redundant computing systems with multiple processors, specialized interconnects, and various monitoring and control modules. One approach to fault tolerant system design uses two or more processors operating in lock step synchronicity. In these lock step systems, the processors perform substantially the same operations and provide substantially the same output data at substantially the same time. Accordingly, if one of the processors fail, a particular transaction or mathematical operation is still in process within the other secondary or standby processors as a result of the dual processing paths. This processing redundancy is advantageous, but not without additional costs and considerations.
Another approach for achieving fault tolerance is to have two computers interconnected, such that one computer, the active computer or host, actively makes calculations while the other computer is idle or on standby in case the active computer experiences some failure. In these systems the information about the state of the active computer must be saved periodically to the standby computer—so that the standby computer can substantially take over at the point in the calculations where active computer experienced a failure.
One way to synchronize the state of operations for two processors is through checkpointing. In checkpointing, the active processor halts either periodically or in the cause of a specific event, and sends its data about its state change to the standby computer. During the checkpointing operation the host computer is not performing useful calculations. The length of the checkpointing interval needs to be kept at a minimum while still providing sufficient time for the requisite checkpoint operations to take place. Because of the nature of checkpointing data, the data must be complete and in the correct order on the standby computer when the data is acted upon or committed.
This issue becomes especially important when the processors each run virtual machines for each of their applications. Each virtual machine requires its own checkpoint data and transfer of that data to the standby virtual machine. Checkpointing several virtual machines can require a significant amount of time and is complicated to perform in a timely and organized manner. Processing slowdowns and errors can occur when performing such checkpointing.
The present invention addresses this issue.