In many applications, data of interest comprises multiple sequences that each evolve over time. Examples include currency exchange rates, network traffic data from different network elements, demographic data from multiple jurisdictions, patient data varying over time, and so on.
These sequences are not independent--in fact they frequently exhibit a high correlation. Therefore, much useful information is lost if each sequence is analyzed individually. It is therefore desirable to be able to analyze the entire set of sequences as a whole, where the number of sequences in the set can be very large. For example, if each sequence represents data recorded from a network element in some large network, then the number of sequences could easily be in the several thousands, and even millions.
It is typically the case that the results of an analysis are most useful immediately, based upon the portion of each sequence seen so far, without waiting for "completion". In fact, these sequences can be extremely long, and may have no predictable termination in the future. What is required is to be able to "repeat" the analysis as the next element (or batch of elements) in each data sequence is revealed. This must be done on potentially very long sequences, indicating a need for analytical techniques that have low incremental computational complexity.
TABLE 1 ______________________________________ sequence s.sub.1 s.sub.2 s.sub.3 s.sub.4 time packets-sent packets-lost packets-corrupted packets-repeated ______________________________________ 1 50 20 10 3 2 55 20 10 10 . . . . . . . . . . . . . . . N - 1 73 25 18 12 N ?? 25 18 18 ______________________________________
Table 1 above illustrates a snapshot of a set of co-evolving sequences. k=4 time sequences are illustrated, and the value of each time sequence at every time-tick (e.g., every minute) is desired. Suppose that one of the time sequences, e.g., s.sub.1, is always delayed by a little, designated by "??". The desired analysis is to do the best prediction for the last "current" value of this sequence, given all the past information about this sequence, and all the past and current information for the other sequences. It is desired to do this at every point of time, given all the information up to that time.
More generally, given a missing or delayed value in some sequence, it is desirable to be able to estimate it as best as possible using all other information available from this and other related sequences. Using the same analysis, "unexpected values" when the actual observation differs greatly from its estimate computed as above can also be found. Such an "outlier" may be indicative of an interesting event in the specific time series affected.
A closely associated problem to solve is the derivation of (quantitative) correlations, e.g., "the number of packets-lost" is perfectly correlated with "the number of packets corrupted", or "the number of packets-repeated" lags "the number of packets-corrupted" by 1 time-tick.
Methodologies are known that analyze single time sequences. One example is the "Box-Jenkins" methodology, also referred to as the "Auto-Regression Integrated Moving Average", disclosed in, for example, George Box et al., "Time Series Analysis: Forecasting and Control", Prentice Hall, Englewood Cliffs, N.J., 1994, 3rd Edition. However, the Box-Jenkins methodology focuses on a single time sequence rather than multiple co-evolving time sequences.
Based on the foregoing, there is a need for a method and apparatus that can analyze co-evolving sequences to solve the above-described problems. The analysis should be able to adapt to changing correlations, be on-line and scalable, be able to make predictions in time that are independent of the number N of past time-ticks, and scale up well with the number of time sequences k.