The use of virtualized environments in distributed computing systems to improve the utilization of computing resources continues to increase. For example, virtual machines (VMs) and/or application containers (ACs) can be implemented in full virtualization environments and/or operating system virtualization environments, respectively, in the distributed computing environments. The high storage I/O demand of VMs and ACs has precipitated an increase in deployment of distributed storage systems. Modern hyperconverged distributed systems (e.g., combining distributed computing and distributed storage) have evolved to comprise autonomous nodes that facilitate incremental and/or linear scaling. In some cases, the distributed systems comprise numerous nodes supporting multiple user VMs and ACs running a broad variety of applications, tasks, and/or processes. For example, with as many as several thousands of autonomous VMs per cluster, the storage I/O activity in a distributed system can be highly dynamic. Providers of such large scale, highly dynamic distributed systems have implemented certain metrics to characterize the behavior of the systems. For example, the system behavior can be monitored by collecting periodic measurements for the metrics. In some cases, the metrics can be used as an indication of system performance. Thresholds might also be established for the metrics which, when breached, can trigger various alerts and/or actions pertaining to the corresponding metric. For example, when a threshold related to storage I/O activity is breached, an alert recommending an increase in storage I/O capacity (e.g., adding more nodes) might be issued.
Unfortunately, legacy techniques for establishing metric thresholds present limitations, at least in their ability to determine system metric thresholds that correlate to the system performance as perceived by the user. Specifically, certain legacy approaches merely set the thresholds of certain system metrics based on default values specified by the providers of the distributed systems. Such default thresholds do not account for the particular dynamic user environment (e.g., applications, workloads, etc.) implemented in the distributed system. For example, a default threshold might underestimate the perceived performance in one user environment and overestimate the perceived performance in another user environment. In another example, a default threshold might trigger multiple alerts that are ignored by a user since the user is satisfied with the system performance. As yet another example, a default threshold might not be breached even though the user is not satisfied with the system performance. Certain approaches might allow a user to set the thresholds for certain detailed system metrics (e.g., CPU utilization, storage access latency, etc.). However, many users may not understand the relationship between the metrics and perceived performance (if any), resulting in user specified thresholds that might be ineffective (or deleterious) as to improving perceived performance. In the foregoing and other legacy approaches, certain actions might also be taken based on observations that are statistically unreliable and/or uncorrelated. For example, a threshold breach at one moment in time may precipitate an action, such as adding a node to a cluster, yet the mere occurrence of such a breach might not have a statistically significant correlation to improving the user's perception of the cluster performance, thus resulting in an expense without corresponding system improvements.
What is needed is a technique or techniques to improve over legacy and/or over other considered approaches. Some of the approaches described in this background section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.