Many decision systems are applied in a variety of dynamic situations where a system must adapt or learn its internal structure or parameters for each situation in order to optimize or maintain its performance. For example, an image-based decision system that inspects semiconductor wafers for defects or measures system alignment and registration precision can expect the image characteristics to change as new layers are added to the wafer during various processing stages. Other dynamic applications include live cell analysis that tracks a group of cells over time or patient specific medical image analysis (such as MRI, CT, X-ray) that needs to account for the differences between imaging system setups or patients' characteristics and response to the imaging system. There are also many non-image based dynamic decision system applications, including data mining and decision support in security and financial applications, speech recognition, mobile robots, medical diagnosis and treatment support, etc.
Ideally, a learning process would take place in a supervised learning fashion wherein domain experts provide the truth labels associated with input data (labeled data). [Sergios Thodoridis and Konstantinos Koutroumbas, Pattern Recognition, Academic Press, 1999 pp 3-7] This results in learning data for which the desired outcome has been provided and the system learns to match particular distributions of inputs to the associated outputs (desired response). In practice, however, such truth labeling presents significant problems in time, expense, and availability of data that often preclude its use. Furthermore, explicit learning after a system is placed into production interrupts normal system operation, which could significantly impact the productivity of the system.
It is desirable for the system to be able to learn online while performing productive work without taking time away from production for learning and requiring minimal user intervention. To do this, the system must automatically learn from unlabeled data to improve the system's performance on future data. This is related to but different from the prior art approach known as unsupervised learning [Sergios Thodoridis and Konstantinos Koutroumbas, Pattern Recognition, Academic Press, 1999 pp 351-383]
I. Learning Environment
Ideally, a decision system that needs to adapt to changed application characteristics would have an opportunity to learn about the application data in a highly supervised environment. A domain expert would inspect a large number of subjects and identify the ones it should alarm on and the ones it should not and provide the information to the decision system. In practice, however, there are reasons why this may not be practical or possible. Data collection can be time-intensive, especially when the system is continually being exposed to new conditions. Expert truth labelers are sometimes not available in a timely fashion. In addition, for assembly-line processes, such as manufactured part defect inspection, it may simply be impossible to take the system offline for sufficient time to enable the domain expert to label a large volume of data and present it to the system for learning. To deal with this problem, the invention separates learning into two phases, startup learning 102 and online learning 106 shown in FIG. 1. The startup learning approach assumes a very small amount of labeled data 100 and so imposes strong constraints on the shape of the data distribution models which can be learned so as to minimize the variance that results from limited sample size. Online learning 106 performs further learning by automatically acquiring large volumes of learning data, which is not labeled data. A decision process 104 produces the decision output.
Feature Distribution Modeling
One of the prior art concepts of decision or classification is the representation of the data being classified, such as images or sampled sounds or more abstract objects, in terms of a set of features each of which is represented as discrete or continuous values. [L. Breiman, J. Friedman, R. Olshen, C. Stone, “Classification and Regression Trees”, CRC Press LLC, 1998, pp 1-17] For example, a mechanism for inspecting people for potential health problems might use as its features cholesterol level, blood pressure, and resting pulse rate.
For a particular population, the set of samples has a particular distribution across the feature space. FIG. 2 shows an example blood pressure feature probability density distribution for healthy 200 and unhealthy 202 patients.
A higher-level feature (or combination of features) that was specifically designed to be an indicator of a healthy patient (i.e. any condition on which we wish to alarm) is derived. This could be accomplished by incorporating a large variety of lower-level features to arrive at some output for which increasing value indicates higher probability. To identify the percent of the population that were most likely to be healthy based on the blood pressure feature, we would use the cumulative distribution function (CDF). The CDF of a one-dimensional probability density distribution f(x′) is described as:             c      f        ⁡          (      x      )        =            ∫              -        ∞            x        ⁢                  f        ⁡                  (                      x            ′                    )                    ⁢                           ⁢              ⅆ                  x          ′                    
In this example, x is the blood pressure value and the value of Cf for the healthy population distribution 200 is the percent of the population likely to be healthy.
To do automated decision making, first measure (or model) the probability density distributions for the populations of interest. Classification is done by thresholding these distributions. There are two basic paradigms for modeling densities: functional and empirical. A functional model is constrained to achieving a particular shape, and need only learn the parameters. For example, a distribution model might be constrained to a normal distributions and need only learn the mean and variance. An empirical distribution uses actual data to construct a distribution and therefore generally requires much more data in order to reduce variance due to limited sample size.
A prior art class of models called kernel-based models [Theodoridis, S., Koutroumbas, K., “Pattern Recognition”, Academic Press, pp.41-44, 1999], has densities in the one-dimensional case that take the form       f    ⁡          (      x      )        =            ∑      i                             ⁢                   ⁢                  w        i            ⁢              g        ⁡                  (                      x            -                          x              i                                )                    under the condition that ∫f(x)dx=1where g(x) is the kernel distribution, xi is the location of atom i, and wi is the weight of atom i. A commonly used kernel distribution is a Gaussian distribution. When g(x) is a Gaussian distribution,             ∑      i                             ⁢                   ⁢          w      i        =  1.An example of a one-dimensional kernel-based model for 3 atoms is shown in FIG. 3. In the example, there are three atoms 300, 302, 304 at locations 0.2, 0.5, and 0.7 with weights of 0.7, 0.4, and 0.8 respectively. The individual component density for atom 1, 301, atom 2, 303, and atom 3, 305, are weighted and summed to construct the overall model density 306. A simple way to produce a model of this type from a set of sampled empirical data is to set all the weights the same and use the samples themselves as atoms. The variance of the underlying Gaussian distributions can be estimated as the variance in the samples themselves.