Modern distributed computing systems have evolved to be able to coordinate deployment and use of different types of computing resources, storage resources, networking resources, and/or other computing resources in such a way that incremental scaling can be accomplished by adding additional computing capabilities or storage capabilities, or networking capabilities, etc. For example, a computing system might be composed of hundreds of nodes or more, any one of which nodes might support several thousand or more autonomous virtualized entities (VEs), such as virtual machines (VMs), that are individually tasked to perform one or more of a broad range of computing and/or storage workloads. As the workloads fluctuate, the demand on the resources of the distributed computing system can fluctuate dynamically as well. In some cases, system administrators might address fluctuating resource demands by adding or subtracting nodes. In some cases, administrators might deploy certain types of nodes that are configured so as to handle particular classes or types of workloads (e.g., storage-centric workloads), while other nodes might be configured to handle other classes or types of workloads (e.g., compute-centric workloads). Administrators might also change the physical and/or logical arrangement (e.g., topology) of the nodes based on the then-current or forecasted resource usage and/or workload schedule. Such ongoing changes result in a highly dynamic, ever-changing, computing system.
The highly dynamic, ever-changing, nature of modern computing systems combined with the ever-increasing storage demands of such computing systems has exacerbated the need for highly configurable storage resources to be added at will into such mixed node-type computing systems. For example, in many environments, such computing systems comprise aggregated physical storage facilities that implement a logical storage pool within which stored data needs to be efficiently distributed and/or replicated according to various metrics and/or objectives. Users of these computing systems have a data consistency expectation that the platform is able to provide consistent and predictable storage behavior (e.g., availability, accuracy, etc.) for all types of data (e.g., data and metadata).
Administrators can address such expectations by implementing a fault tolerance policy (e.g., specified in or derived from a service level agreement (SLA)) to facilitate a certain degree of fault tolerance in case of a node and/or storage device failure. At the same time, administrators are also tasked with managing the storage capacity consumed by the working data and replicated data in the system. Erasure coding (EC) is one technique that might be implemented to reduce the overall storage capacity demand on the computing and storage system while maintaining compliance with fault tolerance policies, replication factor policies and/or other data availability policies. Erasure coding works by forming a parity block that corresponds to two or more data blocks. If one of the data blocks is lost, it can be reconstructed through combining the data of one or more of the data blocks that was not lost together with the parity block, thus reconstructing the data of the lost block. As a simple example, if block B1=1 and block B2=0, then the parity block over block B1 and B2 is P1=1, so as to achieve parity in the combination. If block B2 is lost, then given the combination of B1=1 and P1=1, then it can be known that B2 must have been 0. This simple example can be extended to cover more complex erasure coding configurations, possibly involving more data blocks and possibly involving more parity blocks.
Unfortunately, applying an erasure coding configuration to a computing system relies on administrative determinations. As computing systems become more complex, this places an undue burden on the system administrators.
Furthermore, when relying on administrative approaches, manually determining appropriate erasure coding configurations for a computing system often leads to suboptimal configurations, and implementing a change from one particular EC configuration to another EC configuration can be costly and or bothersome. What is needed is a technological solution to reduce the burden on administrators when determining and implementing erasure coding in dynamically-changing computing and storage systems.