Distributed systems allow multiple clients in a network to access a pool of shared resources. For example, a distributed storage system allows a cluster of host computers to aggregate local disks (e.g., SSD, PCI-based flash storage, SATA, or SAS magnetic disks) located in or attached to each host computer to create a single and shared pool of storage. This pool of storage (sometimes referred to herein as a “datastore” or “store”) is accessible by all host computers in the cluster and may be presented as a single namespace of storage entities (such as a hierarchical file system namespace in the case of files, a flat namespace of unique identifiers in the case of objects, etc.). Storage clients in turn, such as virtual machines spawned on the host computers may use the datastore, for example, to store virtual disks that are accessed by the virtual machines during their operation. Because the shared local disks that make up the datastore may have different performance characteristics (e.g., capacity, input/output operations per second or IOPS capabilities, etc.), usage of such shared local disks to store virtual disks or portions thereof may be distributed among the virtual machines based on the needs of each given virtual machine.
This approach provides enterprises with cost-effective performance. For instance, distributed storage using pooled local disks is inexpensive, highly scalable, and relatively simple to manage. Because such distributed storage can use commodity disks in the cluster, enterprises do not need to invest in additional storage infrastructure. However, one issue that arises with a datastore that is backed by the local storage of a “cluster” of host computers (or “nodes”) is that one or more nodes can fail, become inactive or otherwise go offline, such that the nodes are no longer visible to the other nodes in the cluster. For example, a cluster may subdivide into a number of “partitions” of sub-clusters because of failures or mis-configurations in the network that connect the nodes. Node failures are often indistinguishable from network partitions, making this partitioning effect difficult to diagnose and respond to. Referred to as a “split brain” situation, the nodes within particular partitions caused by node failures may continue to communicate normally but fail to communicate across nodes in other partitions. In such a “split brain” situation, nodes in different partitions may execute operations that access or modify their local copy of data without coordination with nodes in other partitions, resulting in inconsistent or inaccurate data across the entire cluster. To avoid such issues, existing distributed storage systems allow for at most one partition of cluster to be functional at any point in time. However, this approach is conservative and may disallow access to certain data which, given the data's particular distribution and replication characteristics within the various nodes of the cluster, could be safely accessed within a particular partition.