Many computing systems are migrating to a “cloud computing” environment. Cloud computing is the use of a virtualized resource (referred to herein as a “virtual machine”) as a service over a network. The virtual machine can execute over a general technology infrastructure in the cloud. In other words, the virtual machine can operate on many different types of hardware computing systems or over several computing systems. The hardware computing systems are generally commodity type systems that are both inexpensive and easy to operate. Cloud computing often provides common business applications online that are accessed over the network, while the software and data are stored on servers. Cloud computing generally precludes the need to use specially designed hardware.
Unfortunately, the commodity type hardware can be prone to faults or breakdowns. As a result, the virtual machine may also be prone to faults from losing the underlying hardware platform. Some virtual machines execute applications that are required to be highly available. In other words, the applications cannot be prone to frequent faults. There have been attempts to create systems or processes to make virtual machines highly available. However, these prior approaches generally suffer from problems.
To copy data stored in memory used by the VM, the protected VM is generally suspended and copies of changed memories (dirty pages) are copied to a local memory buffer. Once the copying process is completed, the protected VM resumes running while the buffer starts transmitting the dirty pages in its local memory buffer to a standby host for system replication. Generally, the local memory buffer is pre-allocated with a fixed capacity in random access memory.
If the local memory buffer cannot hold all the dirty pages of the protected VM, prior systems generally send all the data in the local memory buffer to the standby host (empty/flush the buffer). Then, once the local memory buffer is empty again, the memory replication module copies the remaining dirty pages of the protected VM to the local memory buffer. This process is repeated until all dirty pages of the protected VM are copied to the buffer, and the protected VM resumes running once this copying process is completed. Thus, the VM is suspended at least through a complete copy process, the send process, and then the rest of the copy process.
This overflow of the buffer makes the memory replication very inefficient. The cost of handling a local memory buffer overflow is large because the protected VM has to be suspended and wait until the copy and flush process completes. Further, the memory copying process is also suspended until the local memory buffer empties the local memory buffer. The local memory buffer flushing process adds the additional network transmission overhead to the suspension time of the protected VM which can be large based on the network. In addition, the network transmission overhead is proportional to the size of the local memory buffer being used as that determines the amount of data to transmit before allowing the copying process to continue and before the protected VM is allowed to resume running.
In other systems, the local buffer is very large to ensure that the buffer is never overflowed. Unfortunately, the largest amount of dirty pages to copy for the protected VM often occur in peaks, and the amount of dirty pages can vary drastically beyond an order of magnitude depending on the running state of the protected VM. Thus, to protect against overflow, the buffer is made extremely large, which is also inefficient and costly. The large buffer can create a large memory footprint and take away a significant portion of the system resources. Further, the large buffer can incur a huge resource overhead and cannot extend to support multiple protected VMs.
The management of the read and write thread that moves data in and out of the buffer can be taxing on the processor and system. Generally, ring buffer requires frequent coordination between the read and write threads. The coordination ensures the two threads operate in sequence and are synchronized. However, establishing the coordination between the read and write threads can incur significant overhead on performance, as it requires signals to be sent between the controllers of the threads to wake up and stop the threads, and the said signals can become cumbersome as greater synchronicity requires more frequent signals and more stop-and-go action for read and write threads.