The background description provided here is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Modest computing systems can have hundreds or even thousands of processor cores, memory arrays, storage arrays, networking ports and additional peripherals. In large-scale computing systems such as a data center or supercomputer, the number of processor cores can be in the hundreds of thousands to millions. Each hardware component may have a number of associated parameters such as clock speed, temperature, idle time, etc. Some of these parameters may be reported and/or measured by the computing system itself. Other parameters may be monitored by an associated monitoring system.
These parameters are referred to in this disclosure as metrics and may be defined at a component level such as available space on a given magnetic disk or may be at a subsystem level such as amount of available storage space in a storage area network. Metrics may also be defined at a system level, such as number of transactions per second in the data base or delay in returning results for a query. A monitoring system for a large computing system may measure and/or collect thousands, millions, or even billions of time-series (that is, the metrics are measured over time) metrics. Monitoring metrics allows for problems to be quickly identified and resolved, hopefully before affecting business outcomes such as losing users, missing revenue, decreasing productivity, etc.
Currently, problems are detected by skilled system administrators who manually create rules to generate alerts for specific metrics. For example, an administrator may set a threshold for available disk space such that an alert will be generated when available disk space decreases below 10% of total disk space. For many metrics, the “correct” threshold may not be known a priori. Instead, the administrator may have to observe the metric over time and infer a threshold based on the historical metric values.
Administrators may watch scores of metrics, such as in a dashboard, and use experience and intuition to determine if any of the metrics indicate the onset of a problem. However, regardless of how many computer screens are used, the number of metrics that can be visually tracked is limited.
Further, manually setting rules is a tedious and difficult task. For example, some values of a metric may be associated with problems at some times but with normal operation at others. Sometimes this inconsistency can be resolved by combining metrics. For example, an alert can be defined for when processor utilization is above a first threshold and memory utilization is above a second threshold. However, these thresholds may vary over time and their interrelationship may vary depending on tasks the system is performing. When combining metrics, some relationships may be well understood by the administrator but others are less apparent, escaping the administrator's notice.
Because of the limitations of visual and programmatic oversight by human administrators, big data principles have been applied to the problem of monitoring systems. Automated processes may evaluate every single metric, a significant advance compared to the tiny fraction that a human administrator can review, and determine normal historical behavior for each metric. However, automated processes don't have the insight and experience of an administrator, and this insight generally has to be manually taught to the system.
Machine learning is able to calculate statistics of values of the metric over time and declare that an anomaly is present when the metric deviates from algorithmically-determined behavior. Of course, determining this behavior algorithmically means that false positives will occur as metrics drift over time at a slow rate and various circumstances, such as bursts of activity, lead to fluctuation at higher rates.
When a monitoring system is collecting millions of metrics, the number of false positives, even with a very low false positive rate, can quickly become noise from which a human administrator cannot detect the signal. As just one example, a recent security breach at a major retailor was detected by security monitoring software. However, these detections were mixed in with so many false positives that the security software's detection of the breach was only recognized after the breach was reported in the press.