Virtualized computer systems have been used to provide fault tolerance capability. In these systems, a primary host computer system provides a primary virtual machine (VM) which executes a guest operating system and whatever applications or programs are required for the particular implementation. A second computer system, referred to herein as the “backup host” executes an identical virtual machine, referred to herein as the “backup VM” in parallel with the primary VM. The identical virtual machines are configured identically with software and virtual hardware, and have identical virtual disks. They are executed simultaneously so that the virtual processors on each VM follow identical execution paths. In this fault-tolerant mode of operation, only the primary VM communicates with the outside world, i.e., providing services as needed to other computer systems or devices connected over a network or system bus. If the primary host fails, then the backup VM on the backup host can immediately take over.
In order for the backup VM to effectively reproduce the execution of the primary VM, it must receive nondeterministic events at the same point in its execution as the primary VM. Nondeterministic events are events that cannot be predicted based on the state of the processor. They include (i) inputs from the network external to the virtualized computer system, (ii) information regarding when virtual interrupts were delivered to the virtual processor due to external events, (iii) timer interrupts delivered to the processor, and (iv) timestamps delivered to the processor when it acquires the current time via various hardware functionality. To ensure synchronicity of nondeterministic events, existing systems use what is referred to as “record-replay”. In a record operation, the primary VM records each nondeterministic event, along with identifiers that specify the point in execution of the primary VM at which the nondeterministic event is received. These events are recorded in a log that is provided to the backup host, which injects the events into the backup VM at the corresponding point in its execution. Thus, the backup VM executes at a slight delay, on the order of hundreds of milliseconds, from the host VM.
There are a number of storage-related issues that arise in order to effectively implement record and replay for VM fault tolerance. The general problem is that input/output (IO) to storage devices is asynchronous, and hence can be another source of non-determinism. For example, the state of the disk can be non-deterministic if multiple asynchronous IOs attempt to write to the same location on the disk. Also, in virtualized computer systems, physical DMA (Direct Memory Access) may be done directly from the physical disk that contains an image of the virtual disk to memory mapped to the virtual machine's virtual memory, and thus races on memory access may arise as well. That is, disk IOs may be implemented via DMA directly to the virtual memory of the VMs. Thus, any possible races caused by the asynchronous modification of the virtual memories the VMs should be resolved. In addition, the exact same IO to a storage device can result in different completion statuses (either success or various kinds of errors), so the IO completion status can be another source of non-determinism.
Conventional computer systems typically implement fault tolerance by running virtual machines in exact lockstep by using specialized hardware. However, this solution is not possible for commodity systems with no specialized hardware for the computer systems to provide fault tolerance.