Anomalies in performance metric behavior of a large-scale distributed web service may be symptoms of service problems that may lead to loss of revenue for the web service provider and reduced satisfaction of service users. Accurate detection of anomalies in performance metric behavior can affect correct diagnosis, recovery, and minimizing the impact of service problems. Both online detection (e.g., real time detection of anomalies as metric measurements are taken) and offline detection (e.g., detection of anomalies in stored measurements indicating changes in past behavior or recurring problems) may be used to discover and address service problems.
Performance metrics, such as response time or throughput, may be sampled at regularly-spaced time intervals by information technology (IT) management tools. Some IT management tools detect performance anomalies by setting thresholds for various performance metrics, e.g., an anomaly is detected when a performance metric exceeds or falls below a designated threshold. In some cases, an alarm may be generated when a performance metric either exceeds an upper threshold or falls below a lower threshold.
Performance management tools that employ threshold-based anomaly detection techniques can result in false alarms. In some situations, a performance metric may exceed its threshold in the absence of any major service problems. For example, a metric with a threshold set at the 99th percentile of its historical values is expected to exceed the threshold approximately once in every 100 time samples even in the absence of any service problems, generating a false alarm. However, a threshold-based approach may not provide a global view of detected anomalies, such as whether the performance anomaly is an isolated event or the result of a larger pattern.