Data sequences often contain redundancy, context dependency and state dependency. Often the relationships within the data are complex, non-linear and unknown, and the application of existing control and processing algorithms to such data sequences does not generally lead to useful results.
Statistical Process Control (SPC) essentially began with the Shewhart chart and since then extensive research has been performed to adapt the chart to various industrial settings. Early SPC methods were based on two critical assumptions:
i) there exists a priory knowledge of the underlying distribution (often, observations are assumed to be normally distributed); and
ii) the observations are independent and identically distributed (i.i.d.).
In practice, the above assumptions are frequently violated in many industrial processes.
Current SPC methods can be categorized into groups using two different criterea as follows:                1) methods for independent data where observations are not interrelated versus methods for dependent data;        2) methods that are model-specific, requiring a priori assumptions on the process characteristics and its underlying distribution, and methods that are model-generic. The latter methods try to estimate the underlying model with minimum a priori assumptions.        
FIG. 1 is a chart of relationships between different SPC methods and includes the following:
Information Theoretic Process Control (ITPC) is an independent-data based and model-generic SPC method. It utilizes information theory principles, such as maximum entropy, subject to constraints derived from dynamics of the process. It provides a theoretical justification for the traditional Gaussian assumption and suggests a unified control chart, as opposed to traditional SPC that require separate charts for each moment.
Traditional SPC methods, such as Shewhart, Cumulative Sum (CUSUM) and Exponential Weighted Moving Average (EWMA) are for independent data and are model-specific. It is important to note that these traditional SPC methods are extensively implemented in industry. The independence assumptions on which they rely are frequently violated in practice, especially since automated testing devices increase the sampling frequency and introduce autocorrelation into the data. Moreover, implementation of feedback control devices at the shop floor level tends to create structured dynamics in certain system variables. Applying traditional SPC to such interrelated processes increases the frequency of false alarms and shortens the ‘in-control’ average run length (ARL) in comparison to uncorrelated observations. As shown later in this section, these methods can be modified to control autocorrelated data.
The majority of model-specific methods for dependent data are time-series based. The underlying principle of such model dependent methods is as follows: assuming a time series model family can best capture the autocorrelation process, it is possible to use that model to filter the data; and; then apply traditional SPC schemes to the stream of residuals. In particular, the ARIMA (Auto Regressive Integrated Moving Average) family of models is widely applied for the estimation and filtering of process autocorrelation. Under certain assumptions, the residuals of the ARIMA model are independent and approximately normally distributed, to which traditional SPC can be applied. Furthermore, it is commonly conceived that ARIMA models, mostly the simple ones such as AR(1), can effectively describe a wide variety of industry processes.
Model-specific methods for autocorrelated data can be further partitioned into parameter-dependent methods that require explicit estimation of the model parameters, and to parameter-free methods, where the model parameters are only implicitly derived, if at all.
Several parameter-dependent methods have been proposed over the years for autocorrelated data, and proposed the Special Cause Chart (SCC) in which the Shewhart method is applied to the stream of residuals. They showed that the SCC has major advantages over Shewhart with respect to mean shifts. The SCC deficiency lies in the need to explicitly estimate all the ARIMA parameters. Moreover, the method performs poorly for a large positive autocorrelation, since the mean shift tends to stabilize rather quickly to a steady state value, and the shift is poorly manifested on the residuals.
Some approaches implemented traditional SPC for autocorrelated data using CUSUM methods, extended the method by using the EWMA method with a small difference. Their model had a random error added to the ARIMA model. The drawback of these models is in the exigency of an explicit parameter estimation and estimation of their process-dependence features. It was demonstrated that for certain autocorrelated processes, the use of traditional SPC yields an improved performance in comparison to ARIMA-based methods.
The Generalized Likelihood Ratio Test—GLRT—method takes advantage of residuals transient dynamics in the ARIMA model, when a mean shift is introduced. The generalized likelihood ratio may be applied to the filtered residuals. The method may be compared to the Shewhart, CUSUM and EWMA methods for autocorrelated data, inferring that the choice of the adequate time-series based SPC method depends strongly on characteristics of the specific process being controlled. Moreover, in and it is emphasized in conclusion that modeling errors of ARIMA parameters have strong impacts on the performance (e.g., the ARL) of parameter-dependent SPC methods for autocorrelated data. If the process can be accurately defined by an ARIMA time series, the parameter independent SPC methods are superior in comparison to non-parametric methods since they allow efficient statistical analysis. If such a definition is not possible, then the effort of estimating the time series parameters becomes impractical. Such a conclusion, amongst other reasons, triggered the development of parameter-free methods to avoid the impractical estimation of time-series parameters.
A parameter-free model was proposed as an approximation procedure based on EWMA. They suggested using the EWMA statistic as a one step ahead prediction value for the IMA(1,1) model. Their underlying assumption was that even if the process is better described by another member of the ARIMA family, the IMA( 1,1) model is a good enough approximation. An approach however, compared several SPC methods and showed that Montgomery's approximation performed poorly. He proposed employing the EWMA statistic for stationary processes, but adjusted the process variance according to the autocorrelation effects.
An approach discussed the weighted batch mean (WBM) and the unified batch mean (UBM) methods. The WBM method assigns weights for the observations mean and defines the batch size so that the autocorrelation among batches reduces to zero. In the UBM method the batch size is defined (with unified weights) so that the autocorrelation remains under a certain level.
Runger and Willemain demonstrated that weights estimated from the ARIMA model do not guarantee a performance improvement and that it is beneficial to apply the simpler UBM method. In general, parameter-free methods do not require explicit ARIMA modeling, however, they are all based on the implicit assumption that the time-series model is adequate to describe the process. While this can be true in some industrial environments, such an approach cannot capture more complex and non-linear process dynamics that depend on the state in which the system operates, for example processes that are described by Hidden Markov Models (HMM).
The Problem of Pattern Classification
In general, the goal of pattern recognition, is to classify objects of interest into one of a number of categories or classes. The objects of interest are called patterns, and they may be printed as letters or characters, biological cells, electronic wave-forms or signals, “states” of a system or any number of other things that one may desire to classify. If there exists some set of labeled patterns, namely their class are known, then one has a problem in supervised pattern recognition. The basic procedure followed in design of a supervised pattern recognition system involves a portion of a set of labeled patterns being extracted and used to derive a classification algorithm. These patterns are called the training set. The remaining patterns are then used to test the classification algorithm; these patterns are collectively referred to as the test set. Since the correct classes of the individual patterns in the test set are also known, the performance of the algorithm can be evaluated. In supervised pattern recognition problems, the results are preferably evaluated by a “teacher” or “supervisor” whose output dictates suitable modifications to the algorithm—hence the term supervised pattern recognition. Once a desired level of performance is achieved (which is measured in terms of a misclassification rate), the algorithm can be used on initially unlabeled patterns. At this point, the feedback loop involving the teacher is formally broken. Nonetheless it is usually advisable to have some spot-checking of results. Such checks can be accommodated either by providing an alternative classification algorithm or a human observer if possible. In some situations it may be feasible to wait a certain length of time until the correct classification is known. If the classes of all of the available patterns are unknown, and perhaps even the number of these classes is unknown, then one has a problem in unsupervised pattern recognition or clustering. In clustering problems, one attempts to find classes of patterns with similar properties where sometimes even these properties may be undefined. The unsupervised pattern recognition or clustering problem is a much more difficult one than the supervised pattern recognition problem. Nevertheless, useful algorithms have been developed in this area and success depends to a large extent on the ability to learn the structure of pattern measurement data in high-dimensional spaces. The present disclosure focuses on a supervised pattern recognition scheme.
The Patterns Recognition Approach:
In the typical pattern recognition approach, observations first undergo feature transformation and then classification in order to arrive at an output decisions. An observation vector x is first transformed, by the feature transformation, into another vector y whose components are called features. The features are intended to be fewer in number than the observations but should collectively contain most of the information needed for classification of the patterns. By reducing the observations to a smaller number of features, one hopes to design a decision rule that is more reliable. The feature vector y can be represented in a feature space Y similar to the way that observation vectors are represented in the observation space. The dimension of the feature space, however, is usually much lower than the dimension of the observation space. Procedures that analyze data in an attempt to define appropriate features are called feature extraction procedures. The feature vector y is passed to a classifier whose purpose is to make a decision about the pattern. The classifier essentially induces a partitioning of the feature space into a number of disjoint regions. If the feature vector corresponding to a pattern falls into region Ri, the pattern is assigned to class Wi.
In general, the symbol x is used herein to represent observation vectors and y is used to represent feature vectors.
There are several ways to perform patterns recognitions. We classify the pattern recognition methods into different classes, as shown in the tree depicted in attached FIG. 1A. We will detail those branches in the tree that are related to the present disclosure.
The first classification is between supervised pattern recognition vs. unsupervised pattern recognition:
In supervised pattern recognition, the types and the number of the existing classes are known. In addition, the classes in the training set are tagged.
By contrast, in unsupervised pattern recognition, the classes of all of the available patterns are unknown, and in some cases even the number of these classes is unknown. Consequently, in such situation the classes in the training set are not tagged and the problem becomes a clustering problem.
The present disclosure concerns problems of supervised pattern recognition, since, as will be explained below in the description of the specific embodiments, the construction algorithm may make use of the different tagged classes in the training set to generate a different context-tree model for each class, for example, in the promoter recognition problem there are two tagged classes: “promoters” and “non-promoters”. We thus continue to detail the supervised pattern recognition branch.
The second classification distinguishes between statistical and logical methods.
Logical Methods are usually used when the classification problems involves nominal data, for instance description, that are discrete and without any natural notion of similarity or even ordering. The decision tree is an example of a logical method. This branch is irrelevant to the present disclosure.
Statistical Methods use statistical tools and they are based on feature vectors of real-valued and discrete-valued numbers. There can be a natural measure of distance between theses vectors. In this category, which is relevant to the present disclosure, we make another distinction between Unknown probabilistic models and Known probabilistic models.
In unknown probabilistic models, the underlying probabilistic model is unknown. In many cases researchers make use of discriminant function to address these types of problems. Since we assume that a general context-tree model can well represent the different classes (although the parameters of the tree are unknown and need to be estimated from the training set), we do not consider this branch of methods.
Known probabilistic models—the distribution function or a general probabilistic model, such as transition probabilistic tree, is assumed known. We assume that a general context-tree model can well represent the different classes. In this category, which is relevant to the present disclosure, we distinguish between the following two types of models:
Known parameters—models based on known parameters. This is often the easiest albeit the more rare problem. In this case, researches typically use Bayesian decision theory to classify the unknown object.
Unknown parameters—in these cases, researches often estimate the parameters by known methods such as the maximum likelihood estimation (where parameters are assumed to be fixed), Bayesian estimation (where parameters are assumed to be random variable), and Gibbs sampling. To this branch of methods the present disclosure belongs. This branch includes some other state-dependent models such as: and Markov models, Hidden Markov Models, Neural nets etc. Note that once the parameters of the model are estimated then conventional methods of classification can be used such as those based on Bayesian decision theory.
Giving the above classification, note that Markov models are the closest methods to the suggested disclosure presented here. In the following, we briefly sketch the Markov models.
Markov Models
Markov models are based on a finite memory assumption, i.e., that each symbol depends only on its k formers, where k is fixed. The simplest model is first-order Markov model, which assume that each symbol at time t depends only on the symbol at time t-1: P(xi=W(i)|x1=W(1), x2=W(2), . . . , xi-1=W(i-1))=P(xi=W(i)|xi-1=W(i-1)), where state i at time t is denoted by Wi(t).
In order to calculate the probability that the model generates a particular sequence, the successive probabilities should simply be multiplied.
Markov models of higher order simply extend the size of the memory. The suggested methods of the present disclosure can be viewed as a varying-order Markov model, since the order of the memory doesn't have to be fixed as explained latter.
In general, Markov Models assume that the states are accessible. In many cases, however, the perceiver does not have access to the states. Consequently, Markov Model should be augmented to Hidden Markov Model, which is a Markov model with invisible states. Hidden Markov models have a number of parameter whose values are set so as to best explain training patterns for the known category.
An alternative model to the Markovian is the context-tree that was suggested by for data compression purposes and modified later in. The tree presentation of a finite-memory source is advantageous since states are defined as contexts—graphically represented by branches in the context-tree with variable length—and hence, requires less estimation efforts than those required for a Markov presentation. The context-tree is an irreducible set of conditional probabilities of output symbols given their contexts. The tree is conveniently estimated by context algorithm. The algorithm generates an asymptotically minimal tree fitting the data. The attributes of the context-tree along with the ease of its estimation make it suitable for a model-generic classifier, as explain later.