A primary consideration for the architecture of a virtual datacenter is how to best maximize the availability of the services provided by the virtual machines. Availability solutions are designed to improve the resiliency of local systems or entire sites and fall broadly into the categories of downtime avoidance and fault recovery. Fault recovery solutions include high availability and disaster recovery. High availability (HA) is an automated failover solution, typically within a single datacenter, that responds to unplanned outages and restarts virtual machines as appropriate. For example, if a virtual machine fails on one host device, HA may respond by restarting the virtual machine on another host device. Disaster recovery is a manual process for recovering all or a portion of a datacenter at a recovery site from replicated data. For example, a disaster recovery tool alerts an administrator of a possible site failure. The administrator may then provide input to the disaster recovery tool to initiate recovery of all or a portion of the inventory of virtual machines within the protected datacenter.
Recently, HA has been applied to clusters of devices that span datacenter sites. These “stretched clusters” offer the ability to balance workloads between two datacenters, enabling migration of services between geographically close sites without sustaining an outage. Stretched clusters add benefits to site-level availability and downtime avoidance, but introduce considerable complexity at the network and storage layers, as well as demanding rigorous operational management and change control. A cluster depends upon a single (logical) storage subsystem and single virtualization management server. As a result, the stretched cluster does not provide fault tolerance for the virtualization management server. A stretched cluster expands upon the functionality of a cluster by enabling devices within multiple locations to be a part of a single cluster. For example, disk writes are committed synchronously at both locations to ensure that data is consistent, regardless of the location from which it is being read. The stretched cluster replication model, however, does not support asynchronous replication and requires significant bandwidth and very low latency between the sites involved in the cluster. As a result, stretched cluster sites are kept within a limited geographic range, e.g., within 100 kilometers or 5 microseconds round-trip time latency. Additionally, should a major portion of the virtual environment fail, current implementations of HA are not designed for complex disaster recovery scenarios in which virtual machines start in a particular sequence. For example, critical virtual machines may need to start prior to other systems that are dependent on those virtual machines. Current implementations of HA are unable to control this start order, handle alternate workflows, or handle different scenarios for failure. Current implementations of HA also do not provide geographically distant multisite recovery.
While disaster recovery tools enable complex recovery scenarios while providing site and virtualization management server fault tolerance, current implementations of HA restrict the ability to use disaster recovery tools because HA is dependent upon a single virtualization management server and disaster recovery tools are dependent upon multiple virtualization management servers.