A primary consideration for the architecture of a virtual datacenter is how to best maximize the availability of the services provided by the virtual machines. Availability solutions are designed to improve the resiliency of local systems or entire sites and fall broadly into the categories of downtime avoidance and fault recovery. Fault recovery solutions include high availability and disaster recovery.
High availability (HA) is an automated failover solution, typically within a single datacenter, that responds to unplanned outages and restarts virtual machines as appropriate. For example, if the host computer running a virtual machine fails, HA may respond by restarting the virtual machine on another host computer.
Disaster recovery is a process for recovering all or a portion of a datacenter at a recovery site from replicated data. For example, a logical storage device within a protected datacenter site may be configured for active-passive replication to a recovery datacenter site. The protected logical storage device is active, e.g., configured to be available for read and write commands. The recovery logical storage device is passive, e.g., configured not to be available for read and write commands, to prevent corruption of the backup data. A disaster recovery tool may initiate recovery of all or a portion of the replicated data within the protected datacenter by making the recovery logical storage device active and then registering all the virtual machines stored in it at the recovery datacenter.
A new class of storage system products called “stretched storage” has emerged in the storage industry. These systems expose storage that can be stretched across two datacenters, enabling zero downtime failover of virtual machines across sites. For example, a stretched storage device is presented both to the protected site and to the recovery site as, effectively, the same device, enabling live migration across sites. Stretched storage devices add benefits to site-level availability and downtime avoidance, but introduce considerable complexity at the network and storage layers, as well as demanding rigorous operational management and change control. For example, stretched storage is configured to be active on both the protected site and the recovery site for read/write operations. Storage device writes are usually committed synchronously at both locations to ensure that data is consistent. If the stretched logical storage devices at both sites were to remain active despite a loss of communication between them, however, each would treat the replicated data as active and would allow conflicting modifications to the data. Such a scenario is referred to as a “split-brain” scenario. To avoid the split-brain scenario, a site preference is configured for each stretched logical storage device to elect a datacenter in which the stretched logical storage device is to be active in the event of a loss of the connectivity to the other datacenter. Consequently, the stretched logical storage device in the other datacenter is configured to become passive.