There are a variety of ways to achieve fault tolerant computing. Specifically, fault tolerant hardware and fault tolerant software may be used either alone or together. As an example, it is possible to connect two (or more) computers, such that one computer, the active computer or host, actively makes calculations while the other computer (or computers) is idle or on standby in case the active computer, or hardware or software component thereon, experiences some type of failure. In these systems, the information about the state of the active computer must be saved periodically to the standby computer so that the standby computer can substantially take over from the previously active computer at the point in the calculations where active computer experienced a failure. This function can be extended for use with the modern day practice of using a virtualized environment as part of a cloud or other computing system.
Virtualization is used in many fields to reduce the number of servers or other resources needed for a particular project or organization. Present day virtual machine computer systems utilize virtual machines (VM) operating as guests within a physical host computer. Each virtual machine includes its own virtual operating system and operates under the control of a managing operating system or hypervisor executing on the host physical machine. Each virtual machine executes one or more applications and accesses physical data storage and computer networks as required by the applications. In addition, each virtual machine may in turn act as the host computer system for another virtual machine.
Multiple virtual machines may be configured as a group to execute one or more of the same programs. Typically, one virtual machine in the group is the primary or active virtual machine, and the remaining virtual machines are the secondary or standby virtual machines. If something goes wrong with the primary virtual machine, one of the secondary virtual machines can take over and assume its role in the fault tolerant computing system. This redundancy allows the group of virtual machines to operate as a fault tolerant computing system. The primary virtual machine executes applications, receives and sends network data, and reads and writes to data storage while performing automated or user-initiated tasks or interactions. The secondary virtual machines have the same capabilities as the primary virtual machine, but do not take over the relevant tasks and activities until the primary virtual machine fails or is affected by an error.
For such a collection of virtual machines to function as a fault tolerant system, the operating state, which defines memory and data storage contents of a secondary virtual machine, should be equivalent to the operating state that is memory and data storage contents of the primary virtual machine. If this condition is met, the secondary virtual machine may take over for the primary virtual machine without a loss of any data. To assure that the state of the secondary machine and its memory is equivalent to the state of the primary machine and its memory, it is necessary for the primary virtual machine periodically to transfer its state and memory contents to the secondary virtual machine.
The periodic transfer of data to maintain synchrony between the states of the virtual machines is termed checkpointing. A checkpoint defines a point in time when the data is to be transferred. During a checkpoint, the processing on the primary virtual machine is paused, so that the final state of the virtual machine and associated memory is not changed during the checkpoint interval and once the relevant data is transferred, both the primary and secondary virtual machines are in the same state. The primary virtual machine is then resumed and continues to run the application until the next checkpoint, when the process repeats.
Checkpoints can be determined by either the passage of a fixed amount of elapsed time from the last checkpoint or by the occurrence of some event, such as: the number of memory accesses (termed dirty pages); the occurrence of a network event (such as network acknowledgement that is output from the primary virtual machine); or the occurrence of excessive buffering on the secondary virtual machine (as compared to available memory), during the execution of the application. Elapsed time checkpointing is considered fixed checkpointing, while event based checkpointing is considered dynamic or variable-rate checkpointing.
Excessive checkpointing can lead to performance degradation of the primary virtual machine. In turn, this performance degradation can result in delays and data loss, which can compromise the fault tolerant nature of the system. Triggering checkpoints in response to network traffic can be particularly taxing for a checkpointing system.
Therefore, a need exists for ways to reduce overhead in the system without reducing the reliability of the system.
Embodiments of the invention address this need and others.