Computing systems are becoming more and more complex. In some contemporary settings, a computing system might comprise hundreds of computing nodes that support thousands (or more) virtualized entities (e.g., virtual machines, executable containers, etc.) running a broad mix of workloads (e.g., applications, tasks, web services, user processes, system processes, etc.). Providers of such large scale, highly dynamic computing systems have implemented various techniques to facilitate characterization and monitoring of the behavior or “health” of the systems. As an example, certain system health monitoring tools might collect observations of various system behaviors related to measurable system metrics (e.g., CPU usage, occurrences of different types of input/output (I/O or IO) operations, I/O latency, etc.), compare the observations to expected behaviors or values, and then emit system messages when the observations breach corresponding expected values and/or ranges and/or time limits. For example, if a set of observations suggest that 97% of available cycles of a CPU are being used consistently for several minutes or more, that might be considered to breach an established CPU headroom value (e.g., 95%) for CPU usage, and a warning message or some other sort of alert might be emitted (e.g., to an administrative dashboard). In this case, static threshold values (e.g., a CPU headroom threshold value and a time period threshold value) are applied to determine whether or not to emit an alert. In other cases, further thresholds (e.g., 0% to 50% CPU utilization) can be defined to bound normal/acceptable behavior.
Unfortunately, there is often uncertainty in the range between the deemed normal threshold and the abnormal threshold. This is a “grey area” where the observed behavior might be deemed to be normal or might be deemed to be abnormal. Failure to consider this “gray area” often produces alerts that are not meaningful in determining the actual health and/or normal behaviors of the computing system (e.g., false alarms). Also, failure to consider this “gray area” can mask alerts that should be emitted (e.g., missed alerts). As an example, implementing static threshold values to determine behavioral alerts does not take into account the dynamic nature of the configuration and workloads of a modern computing system where VMs might temporarily demand a lot of CPU resources, but then settle down into a lower range. As such, a system administrator might receive dozens or hundreds of alerts on a given day, many of which might be false alarms that correspond to behavior that would be considered to be normal given the then-current configuration and/or workload. In the wake of so many false alarms, the system administrator might begin to ignore such alerts, thereby potentially overlooking meaningful alerts that are buried in the dozens or hundreds of alerts.
Furthermore, the aforementioned approaches miss various other types of abnormal behaviors. For example, if a database is normally accessed at a rate of 10,000 transactions per hour, and in some hour-long observation period the observed number of transactions is only 500, a system that has implemented merely a threshold to test for a maximum transaction rate will not trigger an alert, thus failing to alert the administrator to the possibility of system problems that might be the cause of the low transaction rate. Still further, alert thresholds that are derived from historical observations will inherently have some noise and/or errors in the accuracy of the threshold values, which can lead to still more erroneous or unnecessary alert reporting.
What is needed is an approach to computing system health monitoring and handling of alerts that limits or eliminates erroneous or unnecessary alert reporting.