Virtual machine high availability, also known as “HA,” is a technology that minimizes unplanned virtual machine (VM) downtime by monitoring for and detecting failures that bring down VMs and orchestrating recovery of those VMs in response to the failures. An exemplary HA design is described in commonly-assigned U.S. Pat. No. 8,924,967, issued Dec. 30, 2014, entitled “Maintaining High Availability of a Group of Virtual Machines Using Heartbeat Messages.”
One limitation with current HA designs (collectively referred to herein as “traditional HA”) is that they are generally constrained to operating within the context of a single “cluster,” where a cluster is a user-defined group of host systems that are managed by a common instance of a virtual infrastructure management server, or “VIMS.” There are a couple of reasons for this limitation. First, in traditional HA, an HA agent is installed on each host system that is part of an HA-enabled cluster, and these agents collaborate with each other to perform failure monitoring, detection, and VM recovery (i.e., failover) entirely within the confines of the cluster. There is no structured way for the HA agents in one cluster to communicate or collaborate with HA agents in a different cluster, regardless of whether the clusters are managed by the same VIMS instance.
Second, traditional HA is generally reliant on shared storage—in other words, it requires that all host systems in a HA fault domain have access to the same storage devices and/or logical storage volumes (e.g., datastores) for retrieving VM files. This requirement arises out of the fact that traditional HA only relocates VMs between different host systems in the case of a failure; traditional HA does not move or replicate the VMs' persistent data between different storage devices/datastores. Accordingly, in a scenario where a VM is failed over from a first host system H1 to a second host system H2, H2 needs to have access to the same storage as H1 in order to read and write the VM's files. This need for shared storage is usually not an issue within a single cluster, but can be problematic in multi-cluster deployments because such deployments typically assign different storage to each cluster for performance and/or other reasons.
The foregoing means that traditional HA cannot be used to orchestrate VM recovery across different clusters that make use of non-shared storage. There are certain existing technologies, such as datacenter disaster recovery solutions, that are capable of failing over VMs from one cluster managed by one VIMS instance (at, e.g., a first datacenter) to another cluster managed by another VIMS instance (at, e.g., a second datacenter). However, these disaster recovery solutions are specifically designed for deployments with multiple VIMS instances, and thus do not address the need of enabling cross-cluster VM recovery in a single VIMS instance.
This gap in functionality is a pain point for organizations that deploy only one VIMS instance, but wish to failover VMs across clusters that may use non-shared storage in that one instance. For example, some organizations do not want the additional complexity and costs of deploying a second VIMS instance. This gap is also problematic for organizations that have multiple VIMS instances at geographically distant locations, and prefer that an attempt be made to restart a failed VM locally (i.e., within or across clusters in a VIMS instance at a single geographic location) if possible.