Modern distributed computing systems have evolved to include combinations of hardware and software so as to dynamically coordinate configurations of computing hardware, storage devices, networking hardware, and/or other distributed resources in such a way that incremental scaling can be accomplished in many dimensions. For example, a set of clusters in a distributed computing system might deploy hundreds of computing nodes (or more), any of which can support several thousand (or more) virtualized entities (VEs) such as virtual machines (VMs), containers, etc. that are individually tasked to perform one or more of a broad range of computing workloads. In many cases, several thousand VEs might be launched (e.g., in a swarm) to perform some set of tasks, then finish and collate their results, then self-terminate. As such, the working data, configuration (e.g., topology, resource distribution, etc.), and/or other characteristics of the distributed computing system can be highly dynamic as the workload fluctuates.
A system administrator might add or subtract nodes in a given cluster to scale or balance the resource capacity of the cluster. For example, scaling or balancing actions might include actions taken by a technician to physically install a hardware unit that comprises multiple nodes. The topology of such a cluster might describe a hardware partitioning such as a rack that can hold multiple (e.g., 42) such hardware units. The system administrator may also modify the physical and/or logical arrangement of the nodes based on then-current or forecasted resource usage. Such ongoing changes to the node topology raises certain events within the distributed computing system. Such events may in turn raise further events that invoke processes that re-evaluate the logical arrangement of the nodes and other distributed resources such as storage devices and/or the data or metadata stored on the storage devices.
In clustered computing environments such as heretofore described, distributed storage resources comprise aggregated physical storage facilities that form a logical storage pool throughout which data may be efficiently distributed according to various metrics and/or objectives. Metadata describing the storage pool and/or its data may be replicated any number of times across various hardware of the distributed computing system.
Users of these distributed systems have a data consistency expectation (e.g., “strictly consistent”) that the computing platform provide consistent and predictable storage behavior (e.g., availability, accuracy, etc.) for metadata and corresponding underlying data. Accordingly, distributed computing platform providers can address such expectations by implementing data replication such that at least one copy of any stored item survives even in the event of certain hardware failures. For example, a given data replication policy might indicate that two replica copies of certain subject data (e.g., metadata, user data, etc.) may be distributed across available hardware in the cluster.
In some computing clusters, the hardware for managing the distributed data are mapped into a logical replication configuration (e.g., ring configuration). Determining which of many possible replication configurations would comply with a given data replication policy requirement to avoid a total loss of any particular item of data, while at the same time observing physical hardware partitioning constraints, and while at the same time observing the separation (e.g., skew) between hardware partitions can present challenges. Specifically, certain challenges arise when enumerating and/or evaluating replication configurations that satisfy the replication policies and are, at the same time, fault tolerant with respect to the hardware partitioning.
More specifically, in certain replication configurations, multiple copies of data are stored in several different locations (e.g., in or at storage devices of computing nodes). Given multiple copies that are stored at different locations, if a location becomes inaccessible (e.g., a computing node fails or its storage device fails), then the stored data at a different location can be accessed, thus the data is not lost entirely. Due to the nature of computing equipment, often, when one location fails for a particular reason (e.g., a motherboard hosting multiple computing nodes fails), then other locations (e.g., the multiple computing nodes that are hosted on the same motherboard) also fail for the same reason. A boundary around a set of certain hardware elements (e.g., nodes, motherboards, racks, etc.) constitutes an availability domain. One way to mitigate the possible loss of all copies of subject data is to configure the multiple locations into different availability domains, and to store the copies across those locations such that all of the multiple locations are unlikely to be lost by reason of a single availability domain failure.
A particular selection of locations constitutes a replication configuration. In replication configurations where all locations/occurrences of the subject data are lost upon failure of a single availability domain (e.g., failure of a single hardware block), that particular replication configuration is considered to be “availability domain unaware” or “hardware block unaware”. However, a replication configuration that is “availability domain fault tolerant”, “availability domain aware”, or “hardware block aware” all refer to a configuration that retains at least one occurrence of the subject data even after a failure of any single availability domain.
Some replication configuration selection techniques might select the locations for distribution of subject data and its replication copies without considering the topology of availability domains with respect to a the hardware topology. In such cases, various availability domain failures can result in complete loss of all of the subject data. To avoid a complete loss of all of the subject data, certain techniques add more availability domains to a cluster (e.g., more hardware appliances, more nodes, more racks, more sites, more data centers, etc.) so as to reduce the likelihood of loss of all occurrences of the data, however this often imposes a significant (and possibly unnecessary) implementation expense.
Some techniques seek to reduce the likelihood of loss of all data by storing replicas in a ring topology, where data stored at one node of a ring is also stored at one or more neighboring nodes. This complicates determination of the desired availability domain aware configurations. For example, in a ring topology where copies of data are stored in neighboring nodes of the ring (e.g., either a clockwise neighbor or a counter-clockwise neighbor) maintenance of a domain aware configuration becomes more complicated as hardware elements are added (e.g., due to addition of new hardware in a manner that expands the size of the ring) and/or when hardware elements are removed (e.g., due to failure or decommissioning of a hardware element that contracts the size of the ring). Managing availability domain awareness under conditions of ring expansion or ring contraction, becomes even more complicated when additional constraints such as replication factor requirements and/or optimization objectives such as load balancing are considered. Specifically, when apportioning data to hardware elements of a ring, the data could be apportioned to a particular hardware element that is a neighbor in the ring when traversing in a clockwise direction around the ring, or it could apportioned to an alternate hardware element by traversing in a counter clockwise direction around the ring. Thus, there is a need to determine from among choices.
Unfortunately, all of the aforementioned techniques fail to consider load balancing and/or other costs or effects of data reapportionment when determining from among choices. This failure to consider the costs or load-balance effects of reapportionment leads to deployment of sub-optimal ring configurations. What is needed are techniques that avoid deployment of sub-optimal ring configurations that incur avoidable costs.
Some of the approaches described in this background section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by their inclusion in this section.