Existing IT performance management tools enable detection of performance changes by thresholding on performance metrics. The tools detect the presence of a performance change when a performance metric passes a threshold. For example, a threshold can be set for each performance metric and an alarm is generated at the time samples when at least one of the performance metrics exceeds its threshold. In a specific example, an alarm can be generated when the response time for a web page exceeds a threshold of 3 seconds. Often, two thresholds including upper and lower thresholds are set, and an alarm is generated when a performance metric either exceeds the upper threshold or falls below the lower threshold.
The thresholds can be set either manually or automatically. Setting thresholds manually is challenging since, in a large-scale distributed service, typically hundreds to thousands of performance metrics exist, each with a potentially different characteristic. An alternative is automated threshold setting in which thresholds are based on the statistics such as means, standard deviations, or percentiles, and are computed using historical measurements of the metrics. For instance, the thresholds can be set at 5th and 95th percentiles of the historical measurements of a metric, or at three standard deviations above and below the average or mean of the historical measurements of a metric.
Detecting changes through thresholding is a poor approach due for several reasons. First, thresholds are misleading when the performance metric shows multiple behaviors due to cyclic variations, for example weekly or monthly variations. In such cases, a single set of thresholds, such as a single pair of upper and lower thresholds, is insufficient to capture the behavior and for basing detection decisions. Second, thresholding assumes that the impact of change is due only to the amount of the change and does not take into account the duration of the change, leading to false change detection alarms as well as missed change detections. Finally, thresholding does not provide a global view of the detected changes. For example, information regarding when the new performance metric behavior starts and ends is not clear, resulting in difficulty in determining accurate diagnosis and recovery decisions following the detection of a change.