This application is the National Stage filing of PCT Application Ser. No. PCT/US12/49101 filed on Aug. 1, 2012 and claims priority to the PCT application under 35 U.S.C. §371.
Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
The online detection of anomalous system behavior caused by operator errors, hardware/software failures, resource over/under-provisioning, and similar causes is one vital element of operations in large-scale data centers and utility clouds. Conventional detection methods currently used in industry are often based on setting thresholds. Threshold values may come from pre-defined performance knowledge or constraints (e.g., service level objectives (SLOs)) or from predictions based on long-term historical data analysis. Whenever any of the metric observation violates a threshold limit, an alarm of anomaly is triggered. Although this approach is simple for implementation and easy for visual presentation, they may not have sufficient robustness and scalability for utility cloud needs.
Therefore, as the scale and complexity of cloud-based software, applications, and workload patterns increases, anomaly detection methods for cloud monitoring should operate automatically at run time and without the need for prior knowledge about normal or anomalous behaviors. These anomaly detection methods should also be sufficiently general so as to apply to multiple levels of abstraction and sub-systems and for the different metrics used in large-scale systems. In addition, collection of status data may not always be successful and on time. Hence, anomaly detection should be robust enough to achieve a high detection rate while maintaining a low false positive rate under various scenarios, such as noise corruption or incomplete data collection.