Predictive models are widely used for tasks in many domains. Examples include: anticipating future customer demands in retailing by extrapolating historical trends; planning equipment acquisition in manufacturing by predicting the outputs that can be achieved by production lines once the desired machines are incorporated; and diagnosing computer performance problems by using queuing models to reverse engineer the relationships between response times and service times and/or arrival rates.
Predictive models can take many forms. Linear forecasting models, such as Box-Jenkins models, are widely used to extrapolate trends. Weather forecasting often uses systems of differential equations. Analysis of computer and manufacturing systems frequently use queuing models.
Predictive models are of two types. Off-line models estimate their parameters from historical data. This is effective for processes that are well understood (e.g., industrial control) but is much less effective for processes that change rapidly (e.g., web traffic). On-line models adjust their parameters with changes in the data and so are able to adapt to changes in the process. For this reason, a focus of the present invention is on-line models.
Another consideration is the exploitation of multiple models. For example, in computer systems, forecasting models are used to anticipate future workloads, and queuing models are employed to assess the performance of equipment at the future workload levels. Indeed, over time, it is often necessary to use many models in combination.
To illustrate this point, we consider a forecasting model for web server traffic. Consider the model described in J. Hellerstein, F. Zhang, and P. Shahabuddin, “An Approach to Predictive Detection for Service Level Management,” Integrated Network Management VI, edited by M. Sloman et al., IEEE Publishing, May 1999, the disclosure of which is incorporated by reference herein, that forecasts the number of hypertext operations per second at time t, which we denote by S(t). The following models are considered:    1. S(t) is determined entirely by its mean. That is, S(t)=mean+e(t), where e(t) is the model's “residual,” i.e., what is left after the effect of the model is removed.    2. S(t) is determined by its mean and time of day. That is, t=(i,l), where i is an interval during a 24 hour day and l specifies the day. For example, days might be segmented into five minute intervals, in which case i ranges from 1 to 288. Thus, S(i,l)=mean+mean—tod(i)+e(i,l).    3. S(t) is determined by its mean, time of day and day of week. That is, t=(i,j,l), where i is an interval during a 24 hour day, j indicates the day of week (e.g., Monday, Tuesday), and l specifies the week instance. Thus, S(i,j,l)=mean+mean—tod(i)+mean—month(k)+e(i,j,l).    4. S(t) is determined by its mean, time of day, day of week and month. Here, t=(i,j,k,l), where k specifies the month and l specifies the week instance within a month. Thus, S(i,j,k,l)=mean+mean—tod(i)+mean—day-of-week(j)+mean—month(k)+e(i,j,k,l).
It turns out that the S(i,j,k,l) model provides the best accuracy. So, this begs the question: Why not use this model and ignore the others? The answer lies in the fact that the data is non-stationary. Using the techniques employed in the above-referenced Hellerstein, Zhang, and Shahabuddin article, obtaining estimates of tod(i) requires at least one measurement of the ith time of day value. Similarly, at least one week of data is required to estimate mean—day-of-week(j) and several months of data are required to estimate mean—month(k).
Under these circumstances, a reasonable approach is to use different models depending on the data available. For example, we could use model (1.) above when less than a day of history is present; model (2.) when more than a day and less than a week is present, and so on.
Actually, the requirements are a bit more complex still. A further issue arises in that we need to detect when the characteristics of the data have changed so that a new model is needed. This is referred to as change-point detection, see, e.g., Basseville and Nikiforov, “Detection of Abrupt Changes” Prentice Hall, 1993, the disclosure of which is incorporated by reference herein. Change point detection tests for identically distributed observations (i.e., stationarity) under the assumption of independence. However, it turns out that the residuals of the above model are not independent (although they are identically distributed under the assumption of stationarity and the model being correct). Thus, still another layer of modeling is required. In the above-referenced Hellerstein, Zhang, and Shahabuddin article, a second order autoregressive model is used. That is, e(t)=a1*e(t−1)+a2*e(t−2)+y(t), where a1 and a2 are constants estimated from the data.
So the question arises: What happens after a change-point is detected? There are two possibilities. The first is to continue using the old model even though it is known not to accurately reflect the process. A second approach is to re-estimate process parameters. That is, data that had been used previously to estimate parameter values must be flushed and new data must be collected. During this period, no prediction is possible. In general, some prediction is required during this transition period. Thus, it may be that a default model is used until sufficient data is collected.
The foregoing motivates the requirements that the present invention envisions for providing adaptive prediction. First, it must be possible to add new modeling components (e.g., include time-of-day in addition to the process mean) when sufficient data is available to estimate these components and it is determined that by adding the components there is an improvement in modeling accuracy. Second, we must be able to remove modeling components selectively as non-stationarities are discovered. For example, it may be that the day-of-week effect changes in a way that does not impact time-of-day. Thus, we need to re-estimate the mean—day-of-week(j) but we can continue using the mean—tod(i).
Existing art includes: the use of multiple models, e.g., U.S. Pat. No. 5,862,507 issued to Wu et al.; multiple models, e.g., P Eide and P Maybeck, “MMAE Failure Detection System for the F-16,” IEEE Transactions on Aerospace Electronic Systems, vol. 32, no. 3, 1996; adaptive models, e.g., V. Kadirkamanathan and S. G. Fabri, “Stochastic Method for Neural-adaptive Control of Multi-modal Nonlineary Systems,” Conference on Control, p. 49–53, 1998; and the use of multiple modules that adaptively select data, e.g., Rajesh Rao, “Dynamic Appearance-based Recognition,” IEEE Computer Society Conference on Computer Vision, p. 540–546, 1997, the disclosures of which are incorporated by reference herein. However, none of these address the issue of “dynamic management of multiple on-line models” in that this art does not consider either: (a) when to exclude a model; or (b) when to include a model.
There is a further consideration as well. This relates to the manner in which measurement data is managed. On-line models must (in some way) separate measurement data into “training data” and “test data.” Training data provides a means to estimate model parameters, such as mean, mean—tod(i), mean—day-of-week(j), mean—month(k). Test data provide a means to check for change points. In the existing art, a single repository (often in-memory) is used to accumulate data for all sub-models. Data in this repository is partitioned into training and test data. Once sufficient data has been accumulated to estimate parameter values for all sub-models and sufficient training data is present to test for independent and identically distributed residuals, then the validity of the complete model is checked. A central observation is that a dynamic management of multiple models requires having separate training data for each model. Without this structure, it is very difficult to selectively include and exclude individual models. However, this structure is not present in the existing art.