Modern computing technology enables to gather and process large quantities of data in a variety of fields such as finance, commerce, operations etc. In some cases, efficient and quick analysis of such high speed data streams can be very valuable in order to detect a change in trends or condition as early as possible. Click-through stream mining in e-commerce, where the goal of the application is to predict shopping behavior or the effect of advertising, is one notable example. Additional examples of high speed data streams include computerized production environment monitoring applications whose goal is failure detection, traffic monitoring applications that give driving recommendations or on-line alerts, and power grid applications for detecting changes in load profiles and forecast. In all those scenarios analysis is best done on-line, at the speed at which the data is arriving, as a delay in analysis would often translate into a delayed response which can be costly.
In almost each of these scenarios, the data streams are affected in one way or another by human behavior, which itself changes in response to the physical world (time of day or season), fashion, fads, psychological reasons, action by trendsetters, current events, or the economy. Any data stream analysis algorithm must therefore take into account and respond to the non-stationary nature of data distribution.
Furthermore, in many application domains, the change in the underlying distribution of the data is the most interesting event of all. In e-commerce, it can be the result of a change in the competitive scenario. In computerized environment monitoring, it can signal the spread of a new type of failure—such as a new computer virus. Lastly, in stock trading it may signal the move from a bull to a bear market or vice versa. Changes in the mechanism which generates the data are denoted concept drifts. They are especially important because they evoke a need for new responses, different from those dictated by models which were learned before the change occurred.
Most data streams mining algorithms acknowledge the need to handle concept drifts. Two approaches are prevalent: One is to discard old observations. The other is to relearn the model, or parts of the model, when a concept drift becomes evident. However, most data stream mining algorithms rely on a decline in the performance of the model as an indication for concept drift detection. This method, while sometimes effective, has no statistical backing and therefore can be expected to yield inferior results comparing to statistical based change point detection algorithms.
From a statistical point of view, the change point detection problem can be solved optimally by computing the prefix of the current sequence of samples which maximizes the probability that the suffix was sampled from a different distribution. This can be done subject to a set of assumptions on the distribution of the samples (e.g., that it is Normal) and of changes (e.g., that their arrival rate is Poissonian). This approach is, however, impractical for a large number of samples. The state of the art in statistical change point detection on data streams is therefore to use the Page-Hinkley test (PHT), whose run-time is linear in the number of samples. In a streaming setup that would mean maintaining a test statistic of constant size and performing O(1) updates to it per new sample. Naturally, run-time performance like this can only be achieved at a significant cost in terms of false alarm rate, the number of samples needed to detect a change, and the accuracy at which the change point is detected.
The present invention relates to an alternative to PHT which relies on the best practice of solving the more informed problem of testing whether two sets of samples were derived from the same distribution. The algorithms of the invention make use of the unique convergence properties of two sample tests to probabilistically find the point which maximizes their value. That point closely approximates the change point. As both analysis and experiments show, the probabilistic algorithm of the invention maintains just O (1) candidate change points and their related aggregate information. Therefore, it only requires O (1) update operations per new sample, which is comparable with PHT. However, because the two sample tests used by the invention are much more powerful than PHT, and because the probabilistic algorithm of the invention does not degrade that power significantly, the algorithm of the invention is far better than PHT both in terms of false negative to false positive rate and in terms of the accuracy at which it locates to the change point. This superiority is further exemplified in a simplistic application in which the algorithm monitors the mean of a piece-wise stationary data stream at far better accuracy than the one achieved using PHT or others previous approaches.
Notations
Let Xn={x0, x1, . . . , xn} be a prefix of an open-ended stream of samples such that xiεD. For each point i in the prefix denote the samples x0, . . . , xi-1 the head of the prefix and the samples xi, . . . , xn the tail of the prefix. When for some point in the stream the head and the tail follow different distributions that point is denoted xc.
All of the tests described herein measure a test statistic on the stream and indicate a change whenever that statistic exceeds a user provided constant λ. The timeliness of a test is the minimal n larger than c at which the test statistic exceeds λ. The run length of a test is the n for which the test statistic first exceeds A even though no change occurred (i.e., n<c). Since the run length is dependent on random variations in the data we usually refer to the average run length (ARL), which is its average over multiple executions. In all of the algorithms discussed herein the test indicates not only the fact of the change but also the point xmax at which it suspects the change occurred. The difference of that point from the actual change point, |max−c|, is the test accuracy.
Let f be a two sample test statistic, we denote fi (n) the same test statistic as applied to the head and the tail of a prefix of size n, relative to the ith point. We notice here that because fi (n) is not independent of either fi (n−1) or fj (n) for j≠i the original statistical meaning of f is lost. The test statistics retain, however, important convergence properties, as discussed further below.
The Page-Hinkley Test (PHT)
The Page-Hinkley test (PHT) is based on a concept of log-likelihood ratio. The key statistical property of this ratio is that a change in the mean of the data is reflected as a change in the sign of the mean value of the log-likelihood ratio. That is, the ratio exhibits a negative drift before the change, and a positive drift after the change. This difference in behavior is the key to detect the change.
PHT assumes that the observed samples follow a normal distribution. It also assumes that the true mean μ before change is known. This is usually not the case in real-life data, but it is possible to estimate the mean by averaging the observed samples.
Let μn denote the sample mean of the samples x0, x1, . . . , xn. PHT involves a cumulative variable
            U      n        =                  ∑                  i          =          0                n            ⁢              (                              x            i                    -                      μ            n                    -                      δ            2                          )              ,defined as the difference between the observed samples xiε{} and their sample mean μn cumulated up to step n, where δ is a minimum change magnitude to be detected which is selected a priori. The minimum value
      m    n    =            min              0        ≤        k        ≤        n              ⁢          (              U        k            )      of this variable is also computed and updated on-line. The difference between the variable and its minimum value, Un−mn, is the test statistic that is monitored. When this difference is greater than the given threshold λ, the test alerts that an increase in the mean has occurred. Increasing λ causes fewer false alarms, but might delay or miss altogether the detection of some change points. Given that a change is detected, the estimated change point, xmax, is the sample at which the minimum value mn was last obtained.
Since the mean can either decrease or increase, PHT can be executed twice to detect changes in both directions (see Alg.1).
Algorithm 1-Page-Hinkley Test (PHT) Detection of an increase in the mean:               Define      ⁢                          ⁢              U        n              =                  ∑                  i          =          0                n            ⁢              (                              x            i                    -                      μ            n                    -                      δ            2                          )              ,            U      0        =    0           Define    ⁢                  ⁢          m      n        =            min              0        ≤        k        ≤        n              ⁢          (              U        k            )          Alert when Un − mn > λDetection of a decrease in the mean:               Define      ⁢                          ⁢              T        n              =                  ∑                  i          =          0                n            ⁢              (                              x            i                    -                      μ            n                    +                      δ            2                          )              ,            T      0        =    0           Define    ⁢                  ⁢          M      n        =            max              0        ≤        k        ≤        n              ⁢          (              T        k            )         Alert when Mn − Tn > λThe χ Two-Sample Test
The χ2 two-sample test is a standard statistical tool for comparing two samples over the same categorical domain C. For two samples, one of size S, with Si samples in every category Ciε and the other of size R with Ri samples respectively in every category Ciε the χ2 test requires that a simple statistic, Eq. 1, be computed.
                              χ          2                =                              ∑                          j              =              1                                                    ℂ                                              ⁢                                                                      (                                                                                                              S                          /                          R                                                                    ⁢                                              R                        j                                                              -                                                                                            R                          /                          S                                                                    ⁢                                              S                        j                                                                              )                                2                                                              R                  j                                +                                  S                  j                                                      .                                              (        1        )            
The predominant characteristic of the χ2 test is that if the two samples are derived from the same (unknown) distribution, the statistic, itself a random variable, follows a known distribution—the χ2 distribution with −1 degrees of freedom. If, on the other hand, the two samples come from distributions in which the mean of some categories are different, then the statistic tends to grow as the two samples grow.
When applied to the head and the tail of the prefix of a stream, as denoted above, the χ2 test statistic, χi2, can be rewritten according to Eq. 1 as:
                                          χ            i            2                    ⁡                      (            n            )                          =                              ∑                          j              =              1                                                    ℂ                                              ⁢                                                                      (                                                                                                              i                          /                                                      (                                                          n                              -                              i                                                        )                                                                                              ⁢                                              R                        j                                                              -                                                                                                                        (                                                          n                              -                              i                                                        )                                                    /                          i                                                                    ⁢                                              S                        j                                                                              )                                2                                                              R                  j                                +                                  S                  j                                                      .                                              (        2        )            
For simplifying the explanation, we consider below the simple case in which there are only two categories. Applying the χ2 test for more than two categories directly generalizes the method of the invention, and can be applied by any person skilled in the art.
The Student's Two-Sample t-Test
Like the two sample χ2 test, the Student's two-sample t-test determines if the mean has changed between two samples. However, Student's t-test applies to real valued samples rather than categorical ones. Let nS, {circumflex over (X)},S, and νS be the number of samples, the sample mean, and the unbiased estimator of the variance of one sample, and let nR, and {circumflex over (X)},R be the same aggregates for the other sample, respectively. The Student's t-test statistic is:
                              T          =                                                                      X                  ^                                S                            -                                                X                  ^                                R                                                                                                          v                    S                                                        n                    S                                                  +                                                      v                    R                                                        n                    R                                                                                      ,                            (        3        )            When the test is applied to the head and the tail of a prefix of a stream Ti can be written as:
                                          T            i                    ⁡                      (            n            )                          =                                                                              X                  ^                                S                            -                                                X                  ^                                R                                                                                                          v                    S                                    i                                +                                                      v                    R                                                        n                    -                    i                                                                                .                                    (        4        )            
The aggregates i, {circumflex over (X)},S, and νS require no update when a new sample is taken. The aggregates n, {circumflex over (X)},R and νR can be updates incrementally by using the aggregates sumRn and sumRn2. The sample mean
            X      ^              +      R        =                    1                  n          -          i          -          1                    ⁢                        ∑                      j            =            i                    n                ⁢                                  ⁢                  x          j                      =                  sum        ⁢                                  ⁢                  R          n                            n        -        i        -        1            where sumRn=sumRn−1+xn. The unbiased estimator of the variance νR=
                    1                  n          -          i          -          1                    ⁢                        ∑                      j            =            i                    n                ⁢                                  ⁢                  x          j          2                      -                            n          -          i                          n          -          i          -          1                    ⁢                        (                                    X              ^                        R                    )                2              =                    sumR        n        2                    |                  n          -          i          -          1                      -                            n          -          i                          n          -          i          -          1                    ⁢                        (                                    X              ^                        R                    )                2            wheresumRn2=sumRn−12+xn2.
The test is considered valid when each sample is indeed random, the samples are independent, and the samples follow a normal distribution with an unknown mean.
The predominant characteristic of Student's t-test is that if both samples are derived from the same unknown distribution, then the test statistic has a known distribution—Student's t distribution with the degrees of freedom calculated using
                    (                                            v              S                        ⁢                          /                        ⁢                          n              S                                +                                    v              R                        ⁢                          /                        ⁢                          n              R                                      )            2                                            (                                          v                S                            ⁢                              /                            ⁢                              n                S                                      )                    2                ⁢                  /                ⁢                  (                                    n              S                        -            1                    )                    +                                    (                                          v                R                            ⁢                              /                            ⁢                              n                R                                      )                    2                ⁢                  /                ⁢                  (                                    n              R                        -            1                    )                      .If, on the other hand, the two samples come from distributions in which the mean is different, then the value computed by the test statistic tends to grow with every increase in sample sizes.Confidence Intervals on the Mean
Let R be a sample of size n which follows the binomial distribution Bin (n, p). If {circumflex over (p)} is the sample mean of R, then the normal approximation interval estimates that, with probability greater than 1−α, the value of p is in the range
                              p          ^                ±                              Z                          1              -                              α                ⁢                                  /                                ⁢                2                                              ⁢                                                                                          p                    ^                                    ⁡                                      (                                          1                      -                                              p                        ^                                                              )                                                  n                                      .                                              (        5        )            
Here, Z1−α/2 denotes the 1−α/2 percentile of a standard normal distribution N (0, 1).
If R follows the normal distribution N (μ, σ2), and {circumflex over (p)} and sd are the unbiased estimators of the mean and the standard deviation of R the approximation interval estimates that with probability greater than 1−α the value of the actual mean μ is in the range:
                              p          ^                ±                              t                          1              -                              α                ⁢                                  /                                ⁢                2                                      *                    ⁢                                    sd                              n                                      .                                              (        6        )            
Here, t*1−α/2 denotes the 1−α/2 percentile of Student's t distribution.