Conventional fault tolerance (FT) systems are typically based on lockstep execution of redundant hardware. That is, custom hardware ensures that a primary machine and a secondary machine are synchronized by driving the same clock signal to CPUs and Input/Output (I/O) subsystems of both the primary machine and the secondary machine; given the same initial state, and fully deterministic CPUs, two machines driven by the same clock signal remain in lockstep. Similarly, motherboards and chipsets are kept in lockstep by using a single system clock source. Custom logic is often used to compare I/O outputs of all motherboards, and initiate corrective actions such as failover on output mismatch.
Virtual machine technology has become widely recognized, and as is well known, a virtualized computer system is often provided with FT capabilities so the virtualized computer system may continue to operate properly in the event of a failure of one of the virtual machines (VMs) running thereon. However, FT in virtualized computer systems has specific requirements that make hardware-based fault tolerance less convenient. First, a VM rather than a primary machine is a primary unit of management in a virtualized computer system. In particular, while some VMs running on a host computer might need FT, other VMs might not. Although it is not uncommon to aggregate many VMs on the same host computer, the number of VMs with FT requirements is generally small relative to the total number of VMs running on the same host computer. Thus, it is inefficient to use customized hardware to provide FT when some VMs running on the host computer do not need FT.
Second, virtualized workloads are mobile. For example, techniques exist to migrate VMs across host computer systems. In such environments, individual host computer systems are treated as members of a larger resource pool. As such, the use of custom FT hardware is inconsistent with an individual host computer's being a member of a larger resource pool.
Third, in some virtualized computer systems, guest operating systems operate under an illusion of utilizing the same virtual hardware, regardless of the underlying physical hardware. This improves VM mobility, and speeds up hardware upgrade cycles since the VMs are generally unaware of upgrades to the physical hardware of the host computer system. Some conventional, hardware-based FT systems use modified kernel drivers to shield a guest operating system in a VM from detected hardware failures. However, in such systems, the guest operating system is generally aware of the special nature of FT hardware, even though this is inconsistent with a guest operating system's illusion of utilizing the same virtual hardware.
In addition to the above-identified issues involved with providing FT in virtualized computer systems, full host lockstep execution to provide FT is becoming increasingly difficult to achieve due to increases in CPU speed. In addition, CPUs might be internally non-deterministic, and I/O based on a newer PCI-Express interface may be less synchronous than that based on an older PCI interface, thereby making hardware-based FT less reliable. Further, custom hardware used for FT is more expensive than commodity hardware.