Virtual machine high availability (referred to herein simply as “high availability,” or HA) and hypervisor-converged object-based (HC/OB) storage are two emerging technologies in the field of computer virtualization. HA is designed to minimize virtual machine (VM) downtime by monitoring the availability of host systems and VMs in a host cluster. If an outage, such as a host or network failure, causes one or more VMs to stop executing, HA detects the outage and automatically restarts the affected VMs on active host systems in the cluster. In this way, HA ensures that guest applications running within the VMs continue to remain operational throughout the outage. One exemplary HA implementation is described in commonly-assigned U.S. Patent Application Publication No. 2012/0278801, published Nov. 1, 2012, entitled “Maintaining High Availability of a Group of Virtual Machines Using Heartbeat Messages.”
HC/OB storage is a distributed, software-based storage technology that leverages the local or direct attached storage resources (e.g., solid state disks, spinning hard disks, etc.) of host systems in a host cluster by aggregating these locally-attached resources into a single, logical storage pool. Thus, this technology effectively re-purposes the host cluster to also act as a distributed storage cluster. A hypervisor-based storage system layer (referred to herein generically as a “VSAN layer” comprising “VSAN modules”) manages the logical storage pool and enables interactions between the logical storage pool and storage clients, such as VMs running on host systems in the cluster. For example, the VSAN layer allows the VMs to access the logical storage pool during VM runtime in order to store and retrieve persistent VM data (e.g., virtual disk data).
The qualifier “object-based” in “hypervisor-converged object-based storage” refers to the manner in which VMs are maintained within HC/OB storage—in particular, the state of each VM is organized as a hierarchical collection of distinct storage objects (or simply “objects”). For example, the files that hold the metadata/configuration of a VM may reside in a file system that is created within a namespace object (also known as a “file system object”), the virtual disks of the VM may reside in virtual disk objects, and so on. Each of these storage objects may be composed of multiple component objects. The VSAN layer provisions, manages, and monitors each of these storage objects individually. For instance, in order to meet a particular storage policy for a particular virtual disk VMDK1, the VSAN layer may determine that the component storage objects that make up the virtual disk object corresponding to VMDK1 should be striped across the locally-attached storage of three different host systems. Through these and other mechanisms, HC/OB storage can provide improved ease of management, scalability, and resource utilization over traditional storage solutions. One exemplary implementation of an HC/OB storage system is described in commonly-assigned U.S. patent application Ser. No. 14/010,293, filed Aug. 26, 2013, entitled “Scalable Distributed Storage Architecture.”
Unlike non-object-based storage systems, the state of a VM is not contained within a larger, coarse storage container (e.g., a LUN). Having such storage containers provide a couple of benefits. First, a coarse storage container provides a convenient location to store information common to all VMs that use the container. For example, it is possible to create a file system on top of a LUN, create a directory within the file system for each VM whose state is stored on the underlying storage device(s), and then create a directory at the root to store shared information. Second, for a given class of failures, one can reason about the availability/accessibility of all of the VM data stored within a storage container by reasoning about the availability/accessibility of the container itself. For instance, one can determine whether a network failure impacts the accessibility of the VM data by determining if the container is accessible. As a result, there is no need to track the accessibility of each individual VM stored in a single storage container instead, it is sufficient to track the accessibility of the container itself.
The lack of coarse storage containers raises unique challenges when attempting to use HC/OB storage and HA concurrently in the same virtualized compute environment. As one example, existing HA implementations typically maintain information known as “HA protection state” that identifies the VMs in a host cluster that should be failed-over/restarted in the event of a failure. The “master” HA module in the cluster (i.e., the HA module that is responsible for detecting failures and orchestrating VM failovers/restarts) manages this HA protection state by persisting it to a centralized file (or set of files) on the storage tier. If there is an outage that affects a subset of host systems in the cluster, one or more new master HA modules may be elected. Each newly elected master HA module may then retrieve the file from the storage tier to determine which VMs are HA protected. This approach works well if the storage tier is implemented using dedicated shared storage, since the HA protection file can be placed in the storage container storing the configurations for the protected VMs. On the other hand, if the storage tier is implemented using HC/OB storage, there is no convenient location to store such information that is shared across VMs.
As another example, in existing HA implementations, when a master HA module detects a failure that requires one or more VMs to be failed-over/restarted, the master HA module executes a conventional failover workflow that involves (1) identifying active host systems for placing the VMs that can meet the VMs' resource needs, and (2) initiating VM restarts on the identified host systems. If the VMs are stored on dedicated shared storage, these two steps are generally sufficient for successfully completing the failover. However, if the VMs are stored on HC/OB storage, there may be cases where a VM cannot be restarted because one or more of its storage objects are not yet accessible to the host system executing on the master HA module (and/or to the host system on which the restart is being attempted). This situation cannot be uncovered using conventional coarse-grained storage accessibility checks. This, in turn, can cause the conventional failover workflow to break down, or result in multiple continuous restart attempts, which can increase the load on the affected host systems.
As yet another example, there are certain types of network partitions that can further complicate the HA protection state persistence and VM failover/restart workflows noted above. As one example, if there is a failure that causes the VSAN modules to observe a partition while the HA modules do not, there may be instances where the host system on which the master HA module is running does not have access/visibility to a particular VM (and thus cannot update/retrieve HA protection state information for the VM, or determine its accessibility for failover purposes), while the host systems of other, slave HA modules do have such access/visibility.
Accordingly, it would be desirable to have techniques for integrating HA with distributed object-based storage systems like HC/OB storage that overcome these, and other similar, issues.