This invention relates to computing systems, and more particularly, adding fault tolerance into virtual machine (VM) computing systems.
In the recent years, virtualization has not only evolved as a key consolidation technology, but also formed the foundation for cloud computing. The ability to create and manage virtual machines has become a necessity for data-center management. Cloud providers typically use these techniques to consolidate multiple applications-VMs (potentially from different clients) onto a single node to minimize their cost and to take maximum advantage of virtualization. At the same time, cloud consumers expect VMs (and applications) to obey desired SLAs (Service Level Agreements) in order to satisfy their own client needs. Today, two important ingredients of such SLAs are performance and availability constraints.
On the availability front, variants of SLAs typically include: (i) local restart, (ii) remote restart, (iii) live migration, and (iv) micro-checkpointing. While the first two items are simple to implement, they come with certain downtime for the application and loss of all the current VM state. Live migration, while offering no downtime, can take a few minutes to complete. The last item, micro-checkpointing, has been gaining attention in the recent years as a way to switch to a secondary copy instantaneously, thereby offering high availability at the expense of some performance impact.
Most micro-checkpointing implementations today have a primary VM and a secondary VM (typically on different nodes to survive a node failure), with the secondary VM mirroring the primary. Frequent (millisecond interval) checkpoints are sent from the primary to the secondary over the network, which contain the latest modified data of the primary VM. The secondary VM, as such, does not execute any application by itself, but simply applies all the modifications (in-memory) sent by the primary to reach the same state as the primary.
While this approach allows for instantaneous switch over to the secondary in the event of a primary crash/failure, the disadvantages with this approach are: (i) there is still only one copy left if the primary crashes thus not honoring the SLA guaranteed for a span of time (it would take some amount of time to setup of the conversion of the secondary to a new primary and instantiating a new secondary), (ii) the amount of memory taken by the secondary is the same as primary, which means, if primary is a large memory node with an enterprise application, the cloud provider would have to dedicate a similar node for the secondary all the time, (iii) a typical heterogeneous infrastructure with several nodes of different memory sizes and connected using different network links would not be used to the full extent (since there are only two nodes involved), and (iv) if both the primary and secondary nodes crash, then the application crashes (does not tolerate two node failures).