Virtualization is commonly applied on computer clusters to improve the robustness of the implemented computing architecture to faults and to increase utilization of the resources of the architecture. In a virtualized architecture, the processor units, e.g. processors and/or processor cores, of the computer systems in the cluster act as the physical hosts of virtual machines (VMs), which are seen by the outside world as independent entities. This facilitates robustness of the architecture to hardware failures, as upon a hardware failure a VM previously hosted by the failed hardware may be fail over to another host in some manner without the user becoming aware of the hardware failure. This concept is an important facilitator of so-called ‘high availability’ of a service provided by such a VM.
Implementing such a failover is not a trivial task, as the VM ideally should be relaunched in a state that is identical to the state of the VM at the point of the hardware failure to avoid inconvenience to the user.
In one approach, failover is provided by running multiple copies of a single VM in lock-step on different entities, e.g. different physical servers, such that upon the failure of one entity another entity can take over the responsibility for hosting the VM. A significant drawback of such lock-step arrangements is that processing resources are consumed by a failover copy of a VM, thus reducing the available bandwidth of the system, i.e. reducing the total number of VMs that can be hosted by a system.
In another approach commonly found in commercial products, a physical host responds to a failure of another physical host by simply rebooting the VM from a shared disk state, e.g. a shared image of the VM. This however increases the risk of disk corruption and the loss of the exposed state of the VM altogether.
A different failover approach is disclosed in “Remus: High Availability via Virtual Machine Replication” by Brendan Cully et al. in NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, 2008, pages 161-174. In this approach, all VM memory is periodically marked as read only to allow for changes to the VM memory to be replicated in a copy of the VM memory on another host. In this read-only state, a hypervisor is able to trap all writes that a VM makes to memory and maintain a map of pages that have been dirtied since the previous round. Each round, the migration process atomically reads and resets this map, and the iterative migration process involves chasing dirty pages until progress can no longer be made. This approach improves failover robustness because a separate up to date image of the VM memory is periodically created on a backup host that can simply launch a replica of the VM using this image following a hardware failure of the primary host.
However, a drawback of this approach is that as the VM remains operational during the read-only state of its VM memory, a large number of page faults can be generated. In addition, this approach does not allow for the easy detection of what portion of a page has been altered, such that whole pages must be replicated even if only a single bit has been changed on the page, which is detrimental to the overall performance of the overall architecture, as for instance small page sizes have to be used to avoid excessive data traffic between systems, which reduces the performance of the operating system as the operating system is unable to use large size pages.
U.S. Pat. No. 5,893,155 discloses a digital computer memory cache organization implementing efficient selective cache write-back, mapping and transferring of data for the purpose of roll-back and roll-forward of e.g. databases. Write or store operations to cache lines tagged as logged are written through to a log block builder associated with the cache. Non-logged store operations are handled local to the cache, as in a writeback cache. The log block builder combines write operations into data blocks and transfers the data blocks to a log splitter. A log splitter demultiplexes the logged data into separate streams based on address.
In short, the above approaches are not without problems. For instance, during suspension of the VM, the cache is sensitive to page faults as the cache is put into a read-only state, as previously explained. Furthermore, large amounts of data may have to be stored for each checkpoint, which causes pressure on the resource utilization of the computing architecture, in particular the data storage facilities of the architecture.