The client/server computing environment continues to expand into web services (often referred to as “Web 2.0”), with the latest iteration of Internet supported programmatic access to services and data provided by data centers. Commercial data centers can be characterized by scale and complexity. By way of illustration, individual applications (e.g., Web 2.0 applications) may operate across thousands of servers. Utility clouds may serve more than two million enterprises, each with different workload characteristics. Accordingly, data center management is a difficult task, and failing to respond to malfunctions can lead to lost productivity and profits.
Anomaly detection deployed in many data centers compare metrics (which are being monitored in the data center) to fixed thresholds. These thresholds may be determined “offline,” e.g., using training data, and tend to remain constant during the entire monitoring process. Static thresholds are invariant to changes in the statistical distributions of the metrics that occur over time due to man, material, machine, and processes. Thus, static thresholds do not adapt to intermittent bursts or workloads that change in nature over time. Static thresholds cannot be used to effectively identify anomalous behavior unless that behavior is considered extremely large or extremely small. These factors reduce accuracy and tend to cause false alarms.
Approaches such as Multivariate Adaptive Statistical Filtering (MASF) maintain a separate threshold for data segmented and aggregated by time (e.g., hour of day, day of week). However, these techniques assume a Gaussian data distribution for determining the thresholds. This assumption is frequently violated in practice.
Academic statistical techniques cannot be implemented at the scale of data centers and cloud computing systems, and do not work well in online environments because of the high computing overheads and use of very large amounts of raw metric information. In addition, these techniques typically need prior knowledge about the application service level objectives, service implementations, and request semantics. These techniques also tend to focus only on solving certain well-defined problems at specific levels of abstraction.