Embodiments presented herein generally relate to fault tolerance in virtual machines, and more specifically, to improving micro-checkpointing performance to provide fault tolerance to a virtual machine cluster.
Fault tolerance allows a virtual machine to continue executing with little to no interruption after failure of one or more underlying physical components. Typical techniques for providing fault tolerance include synchronizing virtual machine memory contents executing on a primary server with a copy hosted on a secondary server. As a result, a virtual machine state remains consistent across both the primary and secondary server, so that even if the primary server goes offline (e.g., due to disk failure, power outage, routine maintenance, etc.), the virtual machine may continue to execute uninterrupted on the secondary server.
One approach for providing fault tolerance is micro-checkpointing (also known as continuous migration). Micro-checkpointing is a fault tolerance technique typically used to achieve symmetric multiprocessing (SMP) for host systems in a virtualization environment. In micro-checkpointing, the primary server initially uploads a virtual machine memory to the secondary server. Thereafter, the primary server periodically uploads updated memory pages of virtual machine state information (e.g., I/O state, processor state, network state, etc.) to the secondary server. To do so, the primary server suspends execution of the virtual machine and identifies changes made to the virtual machine since the last upload. The primary server then sends the updated pages and other state information to the secondary server. Suspending execution of the virtual machine prevents the virtual machine from further memory updates while the primary server copies the identified updates to the secondary server, thus preserving consistency between the virtual machine state on the primary server and the secondary server. However, because the primary server is continuously sending memory content to the secondary server while the virtual machine is stopped, performance may suffer.