A fundamental tradeoff in providing computer system fault tolerance is low fault-free execution overhead versus short fault-recovery latency. Current techniques for creating fault tolerant computer systems strike balances between consuming significant computing resources to provide faster recovery times and tolerating longer recovery times to conserve such resources. For example, while multiple instances of the same system may be run in parallel to provide robust fault tolerance, such redundancy increases the cost of failure-free operation, in terms of hardware, processing overhead, memory bandwidth, power consumption and other computing resources. More passive checkpointing and replication techniques, such as starting up backup instances of the system only upon a failure of the primary instance, achieve lower total computing and/or hardware overhead but require longer recovery times and/or provide incomplete recovery which is visible to software in the system. Furthermore, in any of the foregoing techniques, implementation of fault tolerance for a particular system requires costly and complex modifications to the hardware, operating system and/or applications to coordinate multiple instances of the system.
A virtual machine platform such as VMware Workstation 6 can provide fault tolerance without modifying the hardware, operating system and/or applications of a particular system running within a virtual machine. Assuming that the initial system state of a primary and backup virtual machine are identical, the virtual machine monitor layer of the primary virtual machine can capture the virtual machine's instruction stream in real-time and transmit such instructions to the backup virtual machine to “replay” the primary virtual machine's execution in a synchronized fashion. If the primary virtual machine fails, the backup virtual machine can take over. However, while fault tolerance implemented by creating multiple simultaneously running virtual machines provides a capability of coordinating the instances at a virtualization software layer without needing to modify the hardware, operating system or applications, failure-free operation still remains expensive since the primary virtual machine and the backup virtual machine are both simultaneously executing all instructions thereby consuming computing resources which cannot be otherwise utilized.