Enterprises are inexorably demanding more and more flexibility in scaling and configuring and their computing systems. Modern distributed computing systems have evolved to facilitate such flexibility by natively coordinating distributed compute, storage, networking, and/or other distributed resources in such a way that incremental scaling of computing capabilities can be accomplished in any or all of the aforementioned dimensions of computing capabilities, storage capabilities, and networking capabilities.
As an example of scaling computing resources, some clusters in an enterprise's distributed computing system might comprise multiple nodes that support numerous autonomous virtualized entities (e.g., virtual machines, executable containers, etc.) that are deployed across a large number of nodes so as to perform assigned parallelizable computing tasks. In many cases, a large number of virtualized entities (VEs) might be launched (e.g., in a swarm) across a large number of nodes, where each node performs some subset of the computing tasks. As an example of scaling storage capabilities, modern distributed storage resources comprise aggregations of physical storage devices (e.g., disks) that are organized into a logical storage area throughout which data may be distributed according to various metrics and/or partitioning. In particular, data may be distributed across many nodes and/or disks in the distributed computing system. In addition to the aforementioned distribution across nodes, the data might be replicated any number of times onto any number of disks so as to achieve high availability of the stored data (e.g., data consistency) even in the event of a node or disk failure. Increasing the number of times a data item is replicated tends to decrease the likelihood of data loss, so a naïve approach might be to add nodes or disks to support more replicas.
Unfortunately, increasing the number of nodes or disks in a cluster increases the likelihood of a disk failure in any given time period (i.e., since there are more nodes or disks that are susceptible to failure). As the number of nodes in a cluster increases, so does the number of disks in the cluster that store any particular dataset and its replicas. This increase in the number of nodes in a cluster in turn increases the probability that a failure at any one of those disks will occur before an earlier disk failure is remedied. As an example, with a replication factor (RF) of two (e.g., a dataset and one replica of the dataset), if a first disk storing a storage extent fails, and a second disk storing a replica of the storage extent also fails before the content of the first disk is rebuilt, the data comprising the storage extent will be lost. At the same time, one can observe that as the time to perform a rebuild becomes longer, such as due to storage I/O congestion when the data distribution is across fewer disks, the likelihood of a second failure increases merely due to the passage of time. Thus, on one hand, distribution of data to more disks might lead to reduced rebuild times (e.g., since there is less storage I/O congestion), whereas distribution to fewer disks might lead to a reduced likelihood of disk failure in a given time period (e.g., since disk drive failure rates are cumulative).
One legacy approach to addressing such data loss challenges is to place a limit on the number of nodes or disks in a cluster. However, this places a burden on the system administrators of large enterprises as their distributed systems scale (e.g., since the system administrators might then have to manage more and more clusters as separate entities). Further, as storage technologies continue to advance, the number of disks assigned to a given node also continues to increase, which increases the likelihood of disk drive data loss even with a constraint on the number of nodes in a given cluster.
Another legacy approach increases the replication factor to mitigate the increased probability of data loss in a growing cluster (e.g., due to use of more disks that could fail and cause a loss of data). This approach, however, is implemented at a substantial additional cost. For example, to increase the replication factor from “RF=2” to “RF=3”, approximately 50% more computing, storage, and networking resources would need to be deployed to manage the additional data replication. What is needed is a technological solution to relieve artificial constraints on the number of nodes in a cluster while still maintaining an acceptably low likelihood of data loss in a given time period.
Some of the approaches described in this background section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by their inclusion in this section.