The background description provided here is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Modest computing systems can have hundreds or even thousands of processor cores, memory arrays, storage arrays, networking ports and additional peripherals. In large-scale computing systems such as a data center or supercomputer, the number of processor cores can be in the hundreds of thousands to millions. Each hardware component may have a number of associated parameters such as clock speed, temperature, idle time, etc. Some of these parameters may be reported and/or measured by the computing system itself. Other parameters may be monitored by an associated monitoring system.
These parameters are referred to in this disclosure as metrics and may be defined at a component level such as available space on a given magnetic disk or may be at a subsystem level such as amount of available storage space in a storage area network. Metrics may also be defined at a system level, such as number of transactions per second in the data base, delay in returning results for a query, or the length of execution time of a particular function. A monitoring system for a large computing system may measure and/or collect thousands, millions, or even billions of time-series (that is, the metrics are measured over time) metrics. Monitoring metrics allows for problems to be quickly identified and resolved, hopefully before problems negatively affect business outcomes such as alienating users, missing revenue, decreasing productivity, etc.
Currently, problems are detected by skilled system administrators who manually create rules to generate alerts for specific metrics. For example, an administrator may set a threshold for available disk space such that an alert will be generated when available disk space decreases below 10% of total disk space. For many metrics, the “correct” threshold may not be known a priori. Instead, the administrator may have to observe the metric over time and infer a threshold based on the historical metric values.
Administrators may watch scores of metrics, such as in a dashboard, and use experience and intuition to determine if any of the metrics indicate the onset of a problem. However, regardless of how many computer screens are used, the number of metrics that can be visually tracked is limited.
Because of the limitations of visual and programmatic oversight by human administrators, big data principles have been applied to the problem of monitoring systems. Automated processes may evaluate every single metric, a significant advance compared to the tiny fraction that a human administrator can review, and determine normal historical behavior for each metric.
The background description provided here is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.