Cloud computing systems incorporate technologies to protect against software and hardware failures. For example, cloud providers seek to provide “high availability” of service to clients by minimizing the amount of downtime associated with a software or hardware failure. Technologies that provide high availability are typically used to protect systems that can tolerate at least brief interruptions of service. Cloud providers also seek to provide systems that offer “fault tolerance” by eliminating any and all downtime associated with software/hardware failures. Fault tolerance is typically used to protect systems that cannot tolerate any interruption of service or data loss (mission critical).
Both software and hardware redundancies are often built into cloud computing systems to provide protection against hardware/software failures. For example, in the event that a first virtual machine fails, traffic may be redirected from the first, failed virtual machine to a second, redundant virtual machine having a configuration identical to the first virtual machine. In some cases, the first and second virtual machines may have been load sharing prior to the failure. In other cases, the second virtual machine acts as a stand-by server that is in a powered off state until the failure of the first virtual machine is detected. In some examples, virtual machines are configured in clusters that are operating on different physical nodes and/or different physical locations. In this manner, a failure at one physical node and/or one physical location can be handled by hardware/software at another physical node/location. In some cases, a software failure on a virtual machine is handled with an attempted-restart of the crashed virtual machine.
A product to protect against hardware failure, called VMware® Fault Tolerance, uses a technology called vLockStep that guarantees that the state of a primary virtual machine operating on a first physical host processing system is the same as the state of a stand-by server operating on a different, second physical host processing system. The system operates by causing the primary and stand-by virtual machines to execute identical sets of x86 instructions. Both the primary and stand-by virtual machines process read/write operations in response to system inputs but the outputs of the stand-by virtual machine are suppressed so that only the output operations performed by the primary virtual machine take effect. When the primary virtual machine stops executing due to a hardware failure of the first host processing system, the outputs of the stand-by virtual machine of the second host processing system are no longer suppressed and the stand-by virtual machine becomes the primary in a manner that is transparent to system users.