Many industrial applications that rely on pattern recognition and/or the classification of objects, such as automated manufacturing inspection or sorting systems, utilize supervised learning techniques. A supervised learning system, as represented in FIG. 1, is a system that utilizes a supervised learning algorithm 4 to create a trained classifier 6 based on a representative input set of labeled training data 2. Each member of the set of training data 2 consists of a vector of features, xi, and a label indicating the unique class, ci, to which the particular member belongs. Given a feature vector, x, the trained classifier, f, will return a corresponding class label, f(x)=ĉ. The goal of the supervised learning system 4 is to maximize the accuracy or related measures of the classifier 6, not on the training data 2, but rather on similarly obtained set(s) of testing data that are not made available to the learning algorithm 4. If the set of class labels for a particular application contains just two entries, the application is referred to as a binary (or two-class) classification problem. Binary classification problems are common in automated inspection, for example, where the goal is often to determine if manufactured items are good or bad. Multi-class problems are also encountered, for example, in sorting items into one or more sub-categories (e.g., fish by species, computer memory by speed, etc.). Supervised learning has been widely studied in statistical pattern recognition, and a variety of learning algorithms and methods for training classifiers and predicting performance of the trained classifier on unseen testing data are well known.
Referring again to FIG. 1, given a labeled training data set 2 (D={xi, ci}), a supervised learning algorithm 4 can be used to produce a trained classifier 6 (f(x)=ĉ). A risk or cost, αij, can be associated with mistakenly classifying a sample as belonging to class i when the true class is j. Traditionally, correct classification is assigned zero cost, αii=0. A typical goal is to estimate and minimize the expected loss, namely the weighted average of the costs the classifier 6 would be expected to incur on new samples drawn from the same process. The concept of loss is quite general. Setting αij=1 when i and j differ, and αii=0 when they are identical (so-called zero/one loss) is equivalent to treating all errors as equal and leads to minimization of the overall misclassification rate. More typically, different types of errors will have different associated costs. More complicated loss formulations are also possible. For example, the losses au can be functions rather than constants. In every case, however, some measure of predicted classifier performance is defined, and the goal is to maximize that performance, or, equivalently, to minimize loss.
There are several prior art techniques for predicting classifier performance. One such technique is to use independent training and testing data sets. A trained classifier is constructed using the training data, and then performance of the trained classifier is evaluated based on the independent testing data. In many applications, collection of labeled data is difficult and expensive, however, so it is desirable to use all available data during training to maximize accuracy of the resulting classifier.
Another prior art technique for predicting classifier performance known as “conventional k-fold cross-validation”, or simply “k-fold cross-validation” avoids the need for separate testing data, allowing all available data to be used for training. In k-fold cross-validation, as illustrated in FIGS. 2A and 2B, the training data {xi, ci} are split at random into a k subsets, Di, 1≦i≦k, of approximately equal size (FIG. 2B, step 11). For iterations i=1 to k (steps 12-17), a supervised learning algorithm is used to train a classifier (step 14) using all the available data except Di. This trained classifier is then used to classify all the samples in subset Di (step 15), and the classified results are stored (step 16). In many cases, summary statistics can also be saved (at step 16) instead of individual classifications. With constant losses, for example, it suffices to save the total number of errors of various types. After k iterations, true (ci) and estimated (ĉi) class labels (or corresponding sufficient statistics) are known for the entire data set. Performance estimates such as misclassification rate, operating characteristic curves, or expected loss may then be computed (step 18). If the total number of samples is n, then the expected loss per sample can be estimated as ΣαĈiCi/n, for example. When k=n−1, k-fold cross-validation is also known as “leave-one-out cross-validation”. A computationally more efficient variant known as “generalized cross-validation” may be preferred in some applications. Herein we refer to these and similar prior art techniques as “conventional cross validation” without differentiating between them.
In k-fold cross-validation, data samples are used to estimate performance only when they do not contribute to training of the classifier, resulting in a fair estimate of performance. Additionally, for large enough k, the training set size (approximately (k−1)/k·n, where n is the number of labeled training data samples) during each iteration above is only slightly less than that of the full data set, leading to only mildly pessimistic estimates of performance.
Many supervised learning algorithms lead to classifiers with one or more adjustable parameters controlling the operating point. For simplicity, discussion is herein restricted to binary classification problems, where ci is a member of one or the other of two different classes. However, it will be appreciated that the principles discussed herein may be extended to multiple-class classification problems. In a binary classification, a false positive is defined as mistakenly classifying a sample as belonging to the positive (or defect) class when it actually belongs to the negative (or good) class. Similarly, a true positive is defined as correctly classifying a sample as belonging to the positive class. False positive rate (also known as false alarm rate) may then be defined as the number of false positives divided by the number of members of the negative class. Similarly, sensitivity is defined as the number of true positives divided by the number of members of the positive class. With these definitions, performance of a binary classifier with an adjustable operating point can be summarized by an operating characteristic curve, sometimes called a receiver operating characteristic (ROC) curve, exemplified by FIG. 3. Varying the classifier operating point is equivalent to choosing a point lying on the ROC curve. At each operating point, estimates of the rates at which misclassifications of either type occurs are known. If the associated costs, αij, are also known, an expected loss can be computed for any operating point. For monotonic operating characteristics, a unique operating point that minimizes expected loss can be chosen. As noted above, k-fold cross-validation provides the information required to construct an estimated ROC curve for binary classifiers.
In addition to making effective use of all available data, k-fold cross-validation has the additional advantage that it also allows estimating reliability of the predicted performance. The k-fold cross-validation algorithm can be repeated with a different pseudo-random segregation of the data into the k subsets. This approach can be used, for example, to compute not just the expected loss, but also the standard deviation of this estimate. Similarly, non-parametric hypothesis testing can be performed (for example, k-fold cross-validation can be used to answer questions such as “how likely is the loss to exceed twice the estimated value?”).
Prior art methods for predicting classifier performance assume that the set of training data is representative. If it is not, and in particular if the process giving rise to the training data samples is characterized by temporal variation (e.g., the process drifts or changes with time), then the trained classifier may perform much more poorly than predicted. Such discrepancies or changes in performance can be used to detect temporal variation when it occurs, but it would be preferable to detect temporal variation in the process during the training phase. Supervised learning does not typically address this problem.
Two techniques that do explicitly deal with the prediction of temporal variation in a process are time series analysis and statistical process control. Time series analysis attempts to understand and model temporal variations in a data set, typically with the goal of either predicting behavior for some period into the future, or correcting for seasonal or other variations. Statistical process control (SPC) provides techniques to keep a process operating within acceptable limits and for raising alarms when unable to do so. Ideally, statistical process control could be used to keep a process at or near its optimal operating point, almost eliminating poor classifier performance due to temporal variation in the underlying process. In practice, this ideal is rarely approached because of the time, cost, and difficulty involved. As a result, temporal variation may exist within predefined limits even in well controlled processes, and this variation may be sufficient to interfere with the performance of a classifier created using supervised learning. Neither time series analysis nor statistical process control provides tools directly applicable for analysis and management of such classifiers in the presence of temporal process variation.
Prior art methods for predicting classifier performance are applicable when either a) the underlying process which generated the set of training data has no significant temporal variation, or b) temporal variation is present, but the underlying process is stationary and ergodic, and samples are collected over a long enough period that they are representative. In many cases where there is explicit or implicit temporal variation in the underlying process the assumption that the set of training data is representative of the underlying process is not justified, and k-fold cross-validation can dramatically overestimate performance. Consider, for example, the processes illustrated in FIGS. 4A, 4B, and 4C. “State” in these figures is meant only for purposes of illustration. The actual state will be of high, often unknown dimension and is itself rarely known. The process illustrated in FIG. 4A has no temporal variation. The process illustrated in FIG. 4B is a stationary process with random, ergodic fluctuations. The process illustrated in FIG. 4C shows steady drift accompanied by random fluctuations about the local mean. Conventional k-fold cross-validation will correctly predict classifier performance for the process illustrated in FIG. 4A given sufficient training data. For the process illustrated in FIG. 4B, correct results will also be attained if the data set is collected over a sufficiently long period so that states are sampled with approximately the equilibrium distribution. Failing this, performance will typically be overestimated. For the process illustrated in FIG. 4C, actual performance may match predicted performance initially, but will degrade as points further into the future are sampled. This list of sample processes is for purposes of illustration only and is by no means exhaustive.
The determination of whether the set of training data is representative of the process often requires the collection of additional labeled training data, which can be prohibitively expensive. As an example, consider fabrication of complex printed circuit assemblies. Using SPC, individual solder joints on such printed circuit assemblies may be formed with high reliability, e.g. with defect rates on the order of 100 parts-per-million (ppm). Defective joints may therefore be quite rare. Large printed circuit assemblies can exceed 50,000 joints, however, so the economic impact of defects would be enormous without the ability to automatically detect joints that are in need of repair. Supervised learning is often used to construct classifiers for this application. Thousands of defects are desirable for training, but since good joints outnumber bad joints by 10,000 to 1, millions of good joints must be examined in order to obtain sufficient defect samples for training the classifier. This poses a significant burden on the analyzer (typically a human expert) tasked with assigning true class labels, so collection of training data is time-consuming, expensive, and error prone. In addition, the collection of more training data than necessary slows the training process without improving performance. Accordingly, it is desirable to use the smallest training data set possible that yields the desired performance.
For the reasons described above, it would be desirable to be able to detect the presence or possible presence of temporal variation in the process from indications in the training data itself. It would be further desirable to be able to predict expected future classifier performance even in the presence of temporal variation in the underlying process. Finally, it would be useful to project the performance gain likely to result from collection of additional training data, and for exploring various options for its use (for example, to answer the question of whether it would be better to simply add to the existing training data or to periodically retrain the classifier based on a sliding window of training data samples).