Detection and management of performance issues in complex computing systems has traditionally been accomplished by applying thresholds that are fixed, against system-specific metric values that are collected over time. FIG. 1 illustrates a fixed threshold 100 that has been set to value 75 for a metric (e.g. disk reads per second) whose measured value normally varies in a sinusoidal manner depending on the hour of the day, as shown by line 101. Systems using a fixed threshold make simple arithmetic comparison of current metric value against the fixed threshold and alert administrators when the threshold is exceeded (over or under, depending on metric semantics). In the example shown in FIG. 1, such a system generates a false alert at 12 PM when a measurement 102 of the metric is at value 80 even though this value is less than normal (which is shown by line 101) for that hour of the day. The system also fails to generate an alert at 12 AM when the measurement 103 of the metric is at value 70 even though this value is greater than normal.
In addition to missed alerts and false alerts, systems using fixed thresholds for detection of performance anomalies suffer from a number of other shortcomings. In particular such systems are labor-intensive, error-prone, and subjective. Fixed threshold systems are labor-intensive because extensive configuration (and re-configuration) by administrators is often required to be done manually, to initialize and set up the detection mechanisms. Fixed threshold systems are error-prone in that they fail to adjust to expected fluctuations in performance and frequently either fail to signal real problems or signal falsely. Moreover, fixed thresholds are subjective in that every system must be individually configured, often in the absence of accurate historical information, so administrators must make educated (or arbitrary) guesses.
U.S. Pat. No. 6,675,128 granted to Hellerstein on Jan. 6, 2004, entitled “Methods And Apparatus For Performance Management Using Self-Adjusting Model-Based Policies” is incorporated by reference herein in its entirety as background. This patent describes using models of measurement variables to provide self-adjusting policies that reduce the administrative overhead of specifying thresholds and provide a means for pro-active management by automatically constructing warning thresholds based on the probability of an alarm occurring within a time horizon. Hellerstein's method includes components for model construction, threshold construction, policy evaluation, and action taking. Hellerstein's thresholds are computed dynamically, based on historical data, metric models, and separately specified policies for false alarms and warnings. Hellerstein describes an example in which a metric model is used to determine the metric's 95th percentile, for the time interval in which the control policy is being evaluated, which is used as the alarm threshold. Hellerstein does not appear to be interested in using a model to determine very high significance thresholds.
U.S. Pat. No. 6,675,128 does not appear to explicitly describe how a metric model is to be constructed. Hellerstein states that a model constructor 230 is used to estimate the values of unknown constants in models based on historical values of measurement data 215. Hellerstein further states that the operation of component 230 is well understood, as disclosed in the literature on time series forecasting, e.g., G. E. P. Box and G. M. Jenkins, “Time Series Analysis,” Prentice Hall, 1977.