Many computing systems are migrating to a “cloud computing” environment. Cloud computing is the use of a virtualized resource (referred to herein as a “virtual machine”) as a service over a network. The virtual machine can execute over a general technology infrastructure in the cloud. In other words, the virtual machine can operate on many different types of hardware computing systems or over several computing systems. The hardware computing systems are generally commodity type systems that are both inexpensive and easy to operate. Cloud computing often provides common business applications online that are accessed over the network, while the software and data are stored on servers. Cloud computing generally precludes the need to use specially designed hardware.
Unfortunately, the commodity type hardware can be prone to faults or breakdowns. As a result, the virtual machine may also be prone to faults from losing the underlying hardware platform. Some virtual machines execute applications that are required to be highly available. In other words, the applications cannot be prone to frequent faults. There have been attempts to create systems or processes to make virtual machines highly available. However, these prior approaches generally suffer from problems.
Prior approaches suffer from two major problems. First, to make the protected virtual machine and the associated standby virtual machine in sync, the high availability protection has to apply at the time when the virtual machine starts. For one reason, once the virtual machine is running, the memory image associated with the virtual machine in the active host server memory and disk will keep changing constantly. To have high availability protection, the protected virtual machine and the associated standby virtual machine must be in perfect sync (i.e., there cannot be changes while replicating information). Second, to capture the running state of the virtual machine (e.g. using live migration), the virtual machine needs to suspend the running for an indefinite amount of time ranging from 10 ms to 10 seconds or even several minutes depending on the state of virtual machine. This suspension can be a major problem for communication services because, with a long suspension, live communication connections, e.g. transmission communication protocol (TCP), will be lost due to timeouts. The loss of communication connections then can result in service disruptions that defeat the purpose of high availability protection.