Automated monitoring of conditions of machines and equipment uses methods that process very large streams of sensor data including many individual readings obtained by sampling various sensors at high rates. The rapidly decreasing costs of data acquisition, communication, and storage technologies has made it economically feasible to accumulate vast amounts of data in the form of multivariate time series data, where each component variable) of the time series can be viewed as a separate dimension of an observation vector that is indicative of a state of the system being monitored.
One of the main uses of such data is to automatically detect anomalous conditions that might signify a fault in the system. Such faults can include loose or broken components, incorrect sequence of operations, unusual operating conditions, etc. In most cases, the immediate discovery of such anomalous conditions is very desirable in order to ensure safety, minimize waste of materials, or perform maintenance to avoid catastrophic failures.
One possible way of discovering anomalies is to specify explicitly the conditions that are considered anomalous, in the form of logical rules that describe when a variable is out of its normal range. For some systems, this approach can be very successful, for example when monitoring processes where some parameters such as temperature, pressure, humidity, etc. are actively regulated, and their normal ranges are known.
When such ranges are not available, the normal operating limits may be obtained by means of a data-driven approach, where data variables are measured under normal conditions, and descriptors of the normal operating ranges are extracted from this data. Examples of such descriptors are logical rules, or probability distributions. For example, if x denotes a vector of instantaneous measured variables from the monitored system, and ƒ(x) is a probability density function over the domain of x, which corresponds to the probability that the value x corresponds to normal operation of the system, then this probability density can be evaluated continuously, and an alarm can be signaled when ƒ(x) is less than a predetermined threshold τ.
The question then becomes how to determine a suitable estimate of the probability density function ƒ(x), given a database X=[x1, x2, . . . xN] of observed data, where xt is the observation columns vector determined at time t, for t=1, . . . N. The vector xt includes M variables, such that xit, is the value of the ith variable at time t, for i=1, . . . , M.
There are many methods for estimating a probability density function over a domain from acquired samples of data points in that domain. Parametric methods make an explicit assumption about, the type of the distribution, and then estimate the parameters of the distribution. For example, if the function is a Gaussian distribution, the parameters are a mean p and a covariance matrix S of the distribution. In this case.μ=ΣNt=1xt/N and S=(X−μ)(X−μ)T/(N−1),where T is a transpose operator.
When the number of variables M is very large, as is typical for many industrial systems, the resulting estimate is likely to be inaccurate, and inconvenient to use. It might not be very accurate, because the correct probability distribution might be very different from Gaussian distribution. The estimate is likely to be inconvenient for use, because the covariance matrix S, although symmetric, can contain on the order of M2 numbers, and when M is very large, for example numbering in the thousands or millions. Thus, maintaining S in a memory becomes practically unmanageable. Moreover, a full covariance matrix with independent entries cannot be estimated unless the number of readings N is larger than the dimensionality of the data vector M, and at least M+1 of the data points are in general position, i.e., linearly independent.
Whereas other estimation models and methods, such as mixtures of Gaussian distributions can be used to overcome the accuracy problem of a single multivariate Gaussian distribution, those methods still suffer from the problem associated with working with large covariance matrices, which is even worse when more than one Gaussian component is considered.
In contrast to parametric models, non-parametric density estimation methods, such as the Parzen kernel density estimate (PKDE), do not assume a particular parametric form for the distribution, but estimate the densityƒ(x)=ΣNt=1K(x−xt)/N as the sum of individual components, one for each of the acquired data points, via a suitable kernel function K. However, the choice of the kernel function is typically not easy, and that method also needs to retain all N acquired data points in memory, which is problematic when that number is large, and even infinite.
Another common shortcoming of these methods is that they cannot easily handle data of mixed type, for example when some variables are continuous, and others are discrete.
A more efficient approach to dealing with the high dimensionality of the data vector, when the number of data vectors is large, is to try to decompose (factor) the probability distribution ƒ(x) into P individual probability distributions over subsets of the data vector x, such thatƒ(x)ΠPp=1ƒp(x(p)),where ƒp(x(p)) is a probability density function over the subset x(p) the data vector. Let πp denote the projection operator from x to x(p), that is x(p)=πp(x). Let V={1, 2, . . . , M} be the set of all indices of data variables, Vp be the set of indices of variables in part p, and Mp=|Vp| be the number of variables in part p. Then, it is desired to obtain a suitable partitioning of V into sets Vp, such thatV∪Pp=1Vp, and, correspondingly,M=ΣPp=1Mp.
By changing the size of each part, the number of parameters that need to be estimated and stored in memory can be controlled. For example, if Gaussian models are fit to each part, then the covariance matrix for part contains on the order of M2p elements. That approach also handles variables of mixed type, where continuous and discrete variables can be put in different parts, and different parametric models can be fit to the parts, for example Gaussian, Bernoulli, and multinomial models.
However, using the smallest possible parts is less effective for the purposes of anomaly detection. A trivial factoring, where each variable is in its own part, such that P=M, VP={p}, MP=1, would indeed result in very compact representation of the probability density, but would fail to capture the dependency between variables, and would not be able to detect so called contextual anomalies. These variables are manifested by readings of one variable that are possible overall, but not when another variable takes on a specific value. For example, the measured air temperature can be 90° F., and this by itself does not necessarily signal anomalous climate, conditions, but if the value of another variable, signifying the calendar month, is set to December, and the measurement location is in the northern hemisphere, then both readings together would clearly signify an anomaly.
Therefore it is desired to determine a partitioning method that has a reasonable balance between the size of the identified parts, the number of points available for the estimation of the individual density functions in each part, and the accuracy of the resulting density.