Sensor Data
Up to now, real-time detailed monitoring of equipment and/or an environment with sensors has only been economically feasible for large, expensive, safety- and mission-critical installations. However, the rapid progress of computer technology, and more specifically, the advent of low-cost sensor networks, cheap wireless communications, and powerful embedded processors, has made it possible to implement equipment condition monitoring (ECM) technology for much cheaper equipment, such as electrical motors, turbines, power switchgear, HVAC equipment, as well as for an ever expanding range of industrial processes, such as oil refining, food processing, product manufacturing, and large scale environments.
The resulting increase in the amount of sensor data that are constantly streaming from sensor networks would quickly overwhelm any human supervisor tasked with monitoring the data. The only viable solution to the problem of processing the sensor data quickly and accurately is to develop automated change detection (ACD) methods. Whereas such automated methods are not likely to reach the competence and versatility of a well-trained human supervisor, automated methods can still be very effective and accurate, when designed to look for specific events in the sensor data stream.
One of the most important among these events is an abrupt change in the sensor data. Detecting such abrupt changes is not a trivial problem, because all but the simplest data streams vary, even when no change in the process that generates the data has occurred. This might be caused either by the natural variability of the process, e.g., when the data comes from a dynamical system, or noise that is due to measurement errors, hidden variables and the like. In such cases, the detection of abrupt changes is done in a statistical sense, i.e., the problem reduces to detecting a difference between probability distributions from which the data are sampled, before and after the change. In manufacturing applications, this task is often called statistical process control (SPC). In SPC, the objective is to detect a departure from the in-control distribution of the data to some other, out-of-control distribution.
CUSUM
When the in-control and out-of-control distributions have known parametric forms and the respective parameters are known, it has been shown that the cumulative sum (CUSUM) procedure is optimal, Page, “Continuous Inspection Schemes,” Biometrika 41, pp. 100-114, 1954, and Basseville et al., “Detection of Abrupt Changes: Theory and Application,” Englewood Cliffs, N.J.: Prentice Hall, 1993.
However, explicit modeling of the in-control and all possible out-of-control distributions is typically a laborious and expensive process, and might even be intractable. Therefore, it is desired to provide a method that can detect any changes only by inspecting the sensor data streams and reasoning about their probability distributions.
Abrupt Change Detection
At a current time t, a d-dimensional data vector from a sensor sample stream is x1. The problem of abrupt change detection is to determine whether such a change has occurred at or before the current time t. An important assumption for this problem is that the change is assumed to be permanent, i.e., after the change has occurred, all subsequent readings come from a new distribution. This is the typical situation for industrial equipment when the change is destructive, e.g., equipment fails.
All sensor samples before the change are assumed to be independent and identically-distributed (i.i.d.) random variables sampled from a distribution p0(x). Similarly, all samples after the change are assumed to be i.i.d. variables sampled from a distribution p1(x).
For cases when the distributions p0(x) and p1(x) are known, Page describes the CUSUM procedure that accumulates the log-likelihood of the current samples with respect to the two distributions, and makes a decision based on the auxiliary variable gt=St−m1 formt=min≦j≦tSj, St=Σti=1si,and
      s    i    =      log    ⁢                                        p            1                    ⁡                      (                          x              t                        )                                                p            0                    ⁡                      (                          x              t                        )                              .      
A change is declared to have occurred if gi>h for a predetermined threshold h. This decision can be shown to be optimal with respect to maximizing the detection probability, for a given false-positive rate.
However, the CUSUM method has the significant disadvantage that both distributions p0(x) and p1(x) must be known before hand. Specifying an accurate probability distribution p0(x) for the normal operation of industrial equipment or normal conditions in an environment is typically hard and laborious even for the very engineers who designed the equipment. Specifying all possible distributions p1(x) for the out-of-control case might be outright impossible. Furthermore, the correct parametric forms for these distributions might not be available.
These limitations of CUSUM have spurred extensive research on alternative methods for ACD that are more data driven and do not rely on pre-specified distributions. An important direction of research is to use nonparametric statistics, such as rank statistics, Brodsky et al., “Nonparametric Methods in Change-Point Problems,” Kluwer, 1991.
Machine Learning
Another line of research has focused on ACD methods based on machine learning. With machine learning, the fundamental idea is to ‘learn’ (fit) two probability distributions from the samples before and after a hypothesized change point, and then to test for differences between the two distributions, often using information-theoretic distance measures such as the Kullback-Leibler divergence, and Rényi divergence, Guha et al., “Streaming and sublinear approximation of entropy and information distances,” Proceedings of SODA'06, pp. 733,742, ACM Press, 2006. However, there are a number of problems with such methods.
The first problem is to learn the two distributions from the samples. When the two distributions are known to be Gaussian, the sample means and variances for two sub-windows can be determined, and the two distributions can be compared using Student's t-statistic, Gosset, “The probable error of the mean,” Biometrika, 1908.
The much more important case is when the two probability density functions (pdfs) of the distributions are not Gaussian. For example, when the distributions are multi-modal due to the system switching between several distinct modes. Gaussian mixture models, which are otherwise an excellent choice for modeling multi-modal distributions, are fairly poor solutions for this particular problem because they are parametric and their fitting requires multiple iterative adjustments of the respective parameters, Hastie et al., “The Elements of Statistical Learning,” Springer, 2001. This is prohibitively time consuming when considering many possible change points. Thus, such methods are not suitable for time-critical applications.
A much better alternative is to use memory-based methods, such as Parzen's kernel density estimate, Parzen “On estimation of a probability density function and mode,” Ann. Math. Stat. 33, pp. 1065-1076, 1962, also known as the Nadaraya-Watson estimate of a probability density function, see Hastie et al. In that method, the probability density p(x) is represented as a normalized sum of kernel values:
                                          p            ⁡                          (              x              )                                =                                    1              n                        ⁢                                          ∑                                  i                  =                  1                                n                            ⁢                                                          ⁢                              w                ⁡                                  (                                      x                    -                                          x                      i                                                        )                                                                    ,                            (        1        )            where ω is a suitably selected kernel function, and xi, i=1, n are samples drawn from the distribution to be modeled. Popular choices for the kernel are Gaussian and tri-cubic distributions.
The second problem is comparing the two distributions after they have been fit from the samples. Some popular methods employ information-theoretic distance measures, such as the well known Kullback-Leibler (KL) divergence:
            D      KL        ⁡          (                        p          0                ⁢                                            ⁢                  p          1                    )        =      ∫          ∫              …        ⁢                              ∫                          x              ∈              Ω                                ⁢                                                    p                0                            ⁡                              (                x                )                                      ⁢            log            ⁢                                                            p                  0                                ⁡                                  (                  x                  )                                                                              p                  1                                ⁡                                  (                  x                  )                                                      ⁢                                                  ⁢                                          ⅆ                x                            .                                          
The main difficulty when using the KL divergence is the need to integrate over the entire domain Ω of the samples x. This can be time consuming, even in one-dimensional domains, and might be impossible in multivariate cases. Using other popular information-theoretic distance measures, such as the Rényi divergence, the Jensen-Shannon distance, the Bregman divergence, and the Hellinger-Matsushita-Bhattacharya distance, leads to similar integration-related difficulties.
As a result, much research has focused on approximate computation of these distances. For example, Guha et al. describe several polynomial-time approximation schemes (PTAS) that can compute approximations to most of the above distances in polynomial time. While valuable from a theoretical point of view, similar PTAS methods are not very likely to result in practical methods that can be used for monitoring in practical applications.
Another method, based on machine learning uses two sub-windows of a buffer storing samples from which the two pdfs are estimated. If the window size is made large, then the asymptotic fit to the true pdf from which the data were sampled is good. However, if the size of the window is large, the new samples start to affect the post-change distribution very slowly, resulting in increasing the detection times when an actual change in distributions occurs, making it difficult to detect abrupt changes.