There are a variety of ways to achieve fault tolerant computing. Specifically hardware and software are typically used either alone or together. As an example, it is possible to connect two (or more) computers, such that one computer, the active computer or host, actively makes calculations while the other computer (or computers) is idle or on standby in case the active computer or hardware or software component thereon experiences some type of failure. In these systems the information about the state of the active computer must be saved periodically to the standby computer so that the standby computer can substantially take over at the point in the calculations where active computer experienced a failure. This example can be extended to the modern day practice of using a virtualized environment as part of a cloud or other computing system.
Virtualization is used in many fields to reduce the number of servers or other resources needed for a particular project or organization. Present day virtual machine computer systems utilize virtual machines (VM) operating as guests within a physical host computer. Each virtual machine includes its own virtual operating system and operates under the control of a managing operating system or hypervisor executing on the host physical machine. Each virtual machine executes one or more applications and accesses physical data storage and computer networks as required by the applications. In addition, each virtual machine may in turn act as the host computer system for another virtual machine.
Multiple virtual machines may be configured as a group to execute one or more of the same programs. Typically, one virtual machine in the group is the primary or active virtual machine and the remaining virtual machines are the secondary or standby virtual machines. If something goes wrong with the primary virtual machine, one of the secondary virtual machines can take over and assume its role in the fault tolerant computing system. This redundancy allows the group of virtual machines to operate as a fault tolerant computing system. The primary virtual machine executes applications, receives and sends network data, and reads and writes to data storage while performing automated or user initiated tasks or interactions. The secondary virtual machines have the same capabilities as the primary virtual machine, but do not take over the relevant tasks and activities until the primary virtual machine fails or is affected by an error.
For such a collection of virtual machines to function as a fault tolerant system, the operating state, memory and data storage contents of a secondary virtual machine should be equivalent to the operating state, memory and data storage contents of the primary virtual machine. If this condition is met, the secondary virtual machine may take over for the primary virtual machine without a loss of any data. To assure that the state of the secondary machine and its memory is equivalent to the state of the primary machine and its memory, it is necessary for the primary virtual machine periodically to transfer its state and memory contents to the secondary virtual machine. It is also necessary to coordinate the release of primary virtual machine egress network traffic with this periodic update of the secondary.
The periodic exchange of data to maintain synchrony between the states of the virtual machines is termed checkpointing. A checkpoint cycle is executed involving steps to identify, acquire, transfer, acknowledge, and commit. These cycles repeat with each one defining a potential starting point for the secondary virtual machine, in the event of a failure of the primary virtual machine.
In the event of a primary VM failure, the secondary VM is ‘rolled back’ to the most recently committed checkpoint and all pending (buffered) network egress frames from the failed primary are discarded. This allows the secondary to safely roll back and restart its processing without creating conflicting results to network clients. Any new network egress traffic is again buffered until the next checkpoint cycle ‘commit’ allows them to be released.
The buffering of egress network traffic is thus an integral part of a checkpointing system. Unfortunately, this buffering adds substantial latency which can only be reduced by increasing the rate of checkpointing, which in turn increases system load. Even at the highest checkpointing rate possible, though, network latency continues to be substantially higher than with a non-checkpointing system due to the fundamental steps of checkpoint cycle processing.
A need therefore exists for ways to selectively eliminate this buffering-induced latency and corresponding checkpoint cycle overhead for applications capable of correctly handling a roll back.