1. Field of the Invention
The invention generally relates to the monitoring of a computer-based system, such as a web site system, to detect events or problems that may require a corrective action. More specifically, the invention relates to a methodology of monitoring such a system by computationally predicting the values of data representative of its normal operating conditions, and computationally evaluating observed values against predicted values in real time or near real time. In particular, the invention relates to improved prediction (“forecasting”) methods for use in such a methodology.
2. Description of the Related Art
Techniques for forecasting are very useful in a number of applications. One such application involves the monitoring of a computer system, such as a web site system or an email system, to detect various types of problems associated with the operation or use of the system. For instance, a forecasting algorithm may be used to predict the server response times that will be experienced by users at a particular point in time. These predicted values may then be compared to actual response time values to evaluate whether the monitored system is functioning properly. Data corresponding to measurements of an “observable” over time are commonly referred to as a time series. Observables (e.g. server response time) that are useful in evaluating system health are commonly referred to as metrics.
Time series forecasting algorithms vary greatly in complexity. The simplest techniques forecast a single value. More sophisticated techniques forecast trend, seasonal cycles (periodic behavior), and combinations thereof. Some techniques (“robust techniques”) mitigate the impact of isolated outliers in historical data on forecasts. While such techniques may isolate a single outlier in a historical data series, an abnormality in a monitored system may result in a relatively large number of consecutive data points corresponding to anomalous data. These anomalous data points are not effectively handled by conventional forecasting techniques.
One simple technique is an exponentially-weighted moving average (“exponential smoothing”); this technique forecasts a single value, calculated by averaging historical data with exponentially greater weight given to more recent observations. Such a technique, when applied to page latency (load time) associated with a web server, predicts latency by averaging historical data, giving more weight to more recent data.
Forecasting algorithms that account for trend are useful for metrics that exhibit growth or decay. For example, a web site with a steadily increasing user base will likely exhibit an increasing trend in the number of web pages requested per minute, and this trend should be exploited in forecasting the page request rate. One algorithm that predicts trend is an extension of the exponential smoothing technique commonly referred to as Holt-Winters.
Forecasting algorithms that account for seasonal variations are useful where the metric or activity being monitored tends to vary in a predictable manner over cyclical time periods or “seasons.” For example, metrics associated with the load placed on a large scale server system will commonly vary over daily, weekly, and/or yearly cycles in response to customer behavior. One algorithm that predicts seasonal variations is an extension of the Holt-Winters technique commonly referred to as Holt-Winters Seasonal (“HWS”).
Forecasting algorithms sometimes take advantage of techniques intended to discard outliers in the observed data. For example, a 911 call answer time metric is likely to be quite stable but may have infrequent anomalies. One approach to discarding such outliers is control charting. It is typical for robust techniques to be ineffective when an anomaly spans several successive observations or when an anomaly is present in the most recent observation(s).
One problem with existing forecasting methods, including those that account for seasonal variations, is that they commonly produce inaccurate results for a period of time after an anomaly occurs. For example, an anomalous event in the operation of a computer system, such as a server failure or denial of service (DoS) attack, will typically result in one or more anomalous data values in the time series of a representative metric; additionally, during the time of failure, these data points will typically be the most recent available observations. As described in the foregoing, typical outlier removal methods are not effective in this situation. Methods such as exponential smoothing will perform poorly between the start of the event until well after the event has subsided due to the influence of anomalous data. Seasonal forecasting algorithms may produce poor forecasts for several cycles (e.g., days, weeks or years); HWS typically does not completely recover within a meaningful time scale. As a result, the forecasting-based monitoring system may both fail to accurately detect problems that require attention and yield unacceptably numerous false alarms.
Another problem with the relatively more complex conventional forecasting methods is that they are typically computationally intensive. Robust spline fitting and robust LOESS, which provide some improvement in outlier mitigation, are two examples of such methods. Real-time monitoring of a complex computer system such as a web site requires timely anomaly detection (often one to five minutes) and involves processing of many concurrent metrics (hundreds or thousands), vitiating the utility of such computationally intensive methods.