The performance of a large-scale distributed web service may be gauged by monitoring various performance metrics of the service, such as throughput, response time, etc. Anomalies in the performance metric behavior are often symptoms of service problems that lead to loss of revenue for the service providers and reduced satisfaction of the service users. Therefore, performance management tools typically are used for purposes of detecting anomalies in the performance metric behavior in order to diagnosis and recover from service performance problems and minimize impact of the problems on the service providers and service users.
A traditional approach to detect a performance anomaly involves detecting when a particular performance metric passes a threshold. The threshold may be set either manually or automatically. Setting a threshold manually may be challenging due to such factors as the occurrence of different normal performance modes and a large number of monitored performance metrics. Automatically determining the threshold settings for a performance metric may involve determining a statistical distribution of a historical measurement of the performance metric. For example, thresholds may be set at the fifth percentile and ninety-fifth percentile of the historical measurements of the performance metric; three standard deviations above and below the average, or mean, of the historical measurements; etc.