The Service Availability (SA) Forum has defined the Availability Management Framework (AMF) for managing the availability of services provided by a compliant system. The availability management of virtual machines (VMs) and the applications residing on them is a hot subject as availability is a key premise for cloud computing. Currently, the research community has sought for different solutions for different layers of the availability management architecture. These solutions usually overlap and, as a result, often interfere with each other when used together.
Virtualization solutions provide some solution for availability. In this context, each physical host runs one or more virtual machines, which can be treated as logical nodes. The software managing multiple VMs on the host is the virtual machine manager (VMM) or hypervisor. For availability management, VMMs run in a cluster. VMMs typically detect host and VM failures.
In general, there are two types of solutions to VM failures. Most often a failed VM is restarted from its stored image on the same VMM. If the host fails then all of its VMs are restarted on a different VMM.
Some vendors provide another solution in which VMs run in tandem. The protected VM is replicated on a different host as a hot standby VM, which runs in parallel and synchronized with a protected primary VM using, for example, lockstep. In lockstep, the standby VM receives all the input the primary VM receives and executes everything the primary does, but its output is suppressed by the VMM. As a result this standby can take over the execution any moment the primary VM fails as long as the failure is caused by an external reason.
One main advantage of these virtualization solutions is that they do not modify the applications that run on the virtual machines. However, these solutions are unaware of the applications, and do not detect and therefore do not react to application failures. In addition, when the replication is done at the VM level, there is no fault isolation between the primary and the standby VMs. As a result, the application failures are propagated to the standby VM, hence the failure occurs in the standby exactly the same way as in the primary.
Furthermore, the virtualization layer (e.g., in a cloud computing environment) hides the underlying infrastructure from applications. As a result, availability solutions at the application level are disabled. Therefore, these virtualization solutions cannot decide on the proper distribution of redundant entities to protect application services against hardware failures.
There is a need to address the application level availability management that provides the needed hardware redundancy without interference with other layers that may react to the same failure.