This invention relates to the field of checkpoint-based high-availability solutions in mirrored virtual machines. In particular, the invention relates to storage writes in mirrored virtual machine checkpointing.
A virtual machine mirror is a way of running a virtual machine (VM) such that if a hardware failure occurs, it can continue execution from the mirror that exists on a second physical machine or a logical partition of the same physical machine. The virtual machine state is exchanged between a primary virtual machine and a secondary virtual machine. This is done by means of checkpointing the primary virtual machine by capturing the state of the first virtual machine and transferring it to the secondary virtual machine. The aim is to reduce downtime caused by hardware failure in a computing system.
These checkpoint-based systems, are built on top of existing virtual machine hypervisors and extend the hypervisor's functionality by capturing modifications to a primary virtual machine's memory state and transferring it over to a secondary computing system at very frequent intervals (for example, every 25 ms).
The core idea is that, should the primary computing system fail, the secondary computing system has a virtual machine in almost the precise same state ready for immediate execution. When this secondary virtual machine is activated, it starts to receive and transmit network packets and perform disk I/O just as the virtual machine did when it ran on the primary computing system. The effect from the outside world is of a minor (milliseconds) discontinuation of activity; similar to if the network connection to the virtual machine had been briefly disconnected and reconnected.
Because the virtual machines are not kept in complete lockstep, but only synchronize on these frequent checkpoints, writes by the primary virtual machine to disk have to be handled specially. This is because, to ensure correctness, the secondary virtual machine must not only resume from a valid checkpoint of the primary virtual machine's state, but disk storage must also be in precisely the same state. In effect, the secondary virtual machine is the primary virtual machine “rolled back” some number of milliseconds, to the last checkpoint.
Checkpoint-based high-availability is a technique whereby a virtual machine running on a host machine (the “primary host”) regularly (for example, every 25 ms) mirrors its processor and memory state onto another host machine (the “secondary host”). The primary and secondary host machines may be logical partitions of the same physical machine.
The basic approach to mirroring process involves the following steps:                tracking changes to the memory of the virtual machine;        periodically stopping the virtual machine;        sending these changes over a network to the secondary host;        waiting for the secondary host to acknowledge receipt of the memory and CPU state update; and        resuming the virtual machine.        
This ensures that the secondary host is able to resume the workload with no loss of service should the primary host suffer a sudden hardware failure. This process is known as “failover”.
In a very naive implementation, network and disk I/O must cause checkpoints to be performed. This is because the primary host may not release a network packet or modify a block on disk, only to fail, and have the secondary host resume from the last checkpoint and re-transmit the packet again, or read the now erroneous block again. Packet transmission must only occur once, and the disk state must match that at the time the checkpoint was taken.
Concerning disk I/O, a naive implementation of “checkpoint-on-write” would perform a checkpoint on the primary just prior to issuing the I/O operation to the disk controller. One basic optimisation to “checkpoint-on-write” is that of combining multiple writes together, and checkpointing a few writes in one go. Conventionally, the virtual machine will track these I/O operations as pending until the checkpoint has completed and the I/O operating has been issued to, and completed, on the disk subsystem. This knowledge of pending I/O operations is exchanged as part of the checkpoint state, along with the CPU and memory state of the virtual machine. An example can be seen in FIG. 1.
Referring to FIG. 1, a diagrammatic illustration 100 of checkpoint-on-write as known in the prior art is provided in which the progression of writes through time is shown progressing vertically down the illustration 100. A primary virtual machine 110 writes to disk blocks 120. Changed blocks are shown by diagonally hashed shading.
In this illustration 100, a first block 131 of a sequence of blocks 130 is modified by the primary virtual machine 110, followed by a second block 132. The modifications to the first and second blocks 131, 132 are held 141, 142 and written 143 to the disk blocks 120 at the next checkpoint 150. An acknowledgement 144 is sent by the disk blocks 120 to confirm the writes.
After the checkpoint 150, a further third block 133 is modified, followed by a fourth block 134. The modifications to the third and fourth blocks 133, 134 are held 145, 146 and written 147 to the disk blocks 120 at the next checkpoint 151. An acknowledgement 148 is sent by the disk blocks 120 to confirm the writes.
The problem with this approach is the additional latency that it adds to write operations. A write operation does not complete successfully until after the next checkpoint, and so in a system where checkpoints are taken every 25 ms, this would add an average of 12.5 ms to every write.
If the primary host were to fail between this acknowledged checkpoint and a future checkpoint, it may be difficult to determine if the pending I/O operations were complete. As such, all pending I/O operations are re-issued, forcing the disk subsystem to reflect the correct state.
A straightforward optimisation to the above is to ignore any I/O operations that do not modify the on-disk state (i.e. a simply read operation). These can be allowed directly through without a checkpoint being performed, since they do not modify any state.
Two key drawbacks with the described approach are as follows:                1. The storage I/O operations of the virtual machine must be intercepted and delayed until the next checkpoint is exchanged with the secondary machine. This increases the latency of I/O operations.        2. In anything but the most naive implementation, operations that do not modify storage (such as a simple read) must be distinguished from those operations that do modify storage. This removes the latency overheads from those operations, but at the cost of having to inspect, and understand the semantics of each I/O operation as it is performed.        
Therefore, there is a need in the art to address the aforementioned problem.