Virtualization is a key technology in enterprise datacenters and cloud services. It provides flexibility and allows multiple virtual machines (VMs) to run in a single physical server, which increases hardware utilization. Along with server consolidation, however, comes the risk that a hardware failure will impact more VMs, and therefore more applications/services. As a result, a primary consideration for the architecture of a virtual datacenter is how to best maximize the availability of the services provided by the virtual machines. Availability solutions are designed to improve the resiliency of local systems or entire sites and fall broadly into the categories of downtime avoidance and fault recovery.
Fault recovery solutions include high availability. High availability (HA) is an automated failover solution, typically within a single datacenter, that responds to unplanned outages and restarts or migrates virtual machines as appropriate. For example, if the host computer running a virtual machine fails, HA may respond by restarting the virtual machine on another host computer. HA has become more important than ever, as the unavailability of services can cost a business up to millions of dollars per hour.
HA solutions provide for recovery in case of server (host) failure, guest (VM) operating system failure, VM application failure, and storage failure. In a virtualization environment, however, VMs also rely on physical network interface controller (PNIC) connectivity to communicate with VMs on other hosts and the external world. Although PNIC teaming technology provides redundancy of network connectivity and eliminates a single point of failure, a VM network may still fail due to backing PNIC(s) or switch port failures, network cable disconnections, switch misconfigurations, power failures, etc. When such a failure occurs, the VM network is lost and clients cannot access the services running on the VMs despite the VMs and the corresponding applications otherwise continuing to run properly within the host computer.
A VM network may be created such that it shares the same PNICs with a management network. When network failure induces VM network loss, the management network also fails. A management network isolation response will help initiate VM restart on other healthy hosts. Configuring the VM and management networks to share the same PNICs, however, has the side effect of also restarting VMs in response to what would otherwise only be a management network isolation event. Restarting VMs in response to such an event will cause unreasonable and unnecessary service downtime for customers.
Additionally, application level HA solutions may be added to the applications running inside VMs to protect these applications from network failure. These solutions, however, are costly. In a virtualization environment, a network failure can impact a large number of VMs, and application level HA solutions would need to be applied in each of the impacted VMs to provide protection. Furthermore, an application level HA solution is application and operating system specific. Protection of multiple VMs, therefore, includes the complication of considering various application and operating system types.