1. Field of Invention
This invention relates generally to systems and methods for assigning input patterns into two classes and relates specifically to a method for estimating an optimal Bayes decision boundary for discriminating between a class-of-interest and a class-other when training samples or otherwise, are provided a priori only for the class-of-interest and without any a priori knowledge of any other classes that may exist in the data set to be classified.
2. Prior Art—FIGS. 1, 2, 3, and 4
Pattern recognition is used in a variety of engineering and scientific areas. Interest in the area of pattern recognition has been renewed recently due to emerging new applications which are very challenging [A. K. Jain, R. W. Duin, and J. Mao, “Statistical Pattern Recognition: A Review”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, January 2000, pp. 4]. These applications include classification of remotely sensed images (thematic mapping, crop inventorying), document classification (searching for text documents), financial forecasting, organization and retrieval of multimedia data bases, and recognition of objects of interest in images—such as real-time identification of high valued military targets in imagery, and screening of x-rays and MRI's for medical conditions.
Most of the literature on pattern recognition is restricted to fully supervised pattern recognition applications where training samples are available which completely characterize all of the classes (objects) to be recognized in the data set to be classified. Using these training samples, optimal discriminant functions can be derived which provide minimum error in recognizing these known classes (objects) a data set.
However, in the real world, there are many applications where prior knowledge, through training samples or otherwise, is only available for a single class; the classes-of-interest. The distribution of the other-class may be unknown, may have changed, may be inaccurate due to insufficient numbers of samples used to estimate the distribution of the other-class. In addition, the cost of obtaining labeling samples, for purposes of defining all the classes in a given dataset, by collecting ground truth or otherwise, may be very expensive or impossible to obtain. Often one is only interested in one class or a small number of classes.
The simplest technique for handling the problem of unknown classes consists of thresholding based on a measure of similarity of a measurement to the class-of-interest [B. Jeon and D. A. Landgrebe, “Partially Supervised Classification With Optimal Significance Testing,” Geoscience and Remote Sensing Symposium, 1993, pp. 1370-1372]. If the similarity measure (the statistical probability) is lower than some threshold, the sample is assumed to belong to an unknown class; otherwise, it is assigned to the class-of-interest. Even if an optimal threshold is selected, this procedure does not ensure minimum probability of error in classification.
The Adaptive Bayes Decision Rule
Another approach for handling unknown classes is to use a modified form of the Bayes decision rule. Bayes decision theory is a fundamental approach to the problem of pattern recognition [R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, New York: John Wiley & Sons, 1973, pp. 11-17]. The approach is based on the assumption that the decision problem can be poised in probabilistic terms where all of the relevant probability values are known. Specifically, the application of a standard Bayesian classifier usually requires estimation of the posterior probabilities of each class. If information about the probability distributions of classes is available, the posterior probability can be calculated for each measurement and each measurement is attributed to the class with the highest posterior probability.
However, this traditional approach is not feasible when the presence of an unknown other-class has to be considered. Traditional approaches require that the statistics of the other-class be known in advance or a training set be available for estimating the statistics of the other-class”. Minter [T. C. Minter, “A Discriminant Procedure for Target Recognition in Imagery Data”, Proceedings of the IEEE 1980 National Aerospace and Electronic Conference—NAECON 1980, May 20-22, 1980] proposed an alternative formulation of the standard Bayes decision rule to addresses this problem.
The decision making process for Bayes pattern recognition can be summarized as follows: Given a set of measurement vectors, X={X1,X2, . . . XN}, it desired is to associate the measurements with either the classes-of-interest or the other-class with minimum probability error. The measurement, X, can conveniently be represented as a d-dimensional vector in the measurement space. This vector will be called the measurement vector or simply a sample or a pattern and will be denoted as X=(x1,x2, . . . xd)T where d is the number of measurements or the dimensionality of the measurement space.
For the moment, let us assume that complete information is available on the class-of-interest and the other-class. Using training samples from these two classes, we can estimate conditional probability density functions for the two classes, P(X/Cint) for the class-of-interest, and P(X/Cother) for the other-class. We will assume the prior probabilities for the two classes, PCint and PCother, are known. Using these conditional probability estimates, the standard maximum likelihood decision rule for this two class pattern recognition problem is:If: PCintP(X/Cint)≧PCotherP(X/Cother),(1)                Classify X as the class-of-interestwhere            P(X/Cint)=Conditional probability density function of the class-of-interest    P(X/Cother)=Conditional probability density function of class-other    PCint=prior probability of the class-of-interest    PCother=prior probability of class-other
Illustrated in FIG. 1, is a maximum likelihood classifier where we have assumed normal distributions for the class-conditional probability density functions P(X/Cint) 12 and P(X/Cother) 14 and the uni-variate Gaussian density function is defined as
                              P          ⁡                      (                          X              /                              C                i                                      )                          =                              1                          2              ⁢                              π                                  1                  /                  2                                            ⁢                              σ                i                                              ⁢                      ⅇ                                                            -                  1                                /                2                            ⁢                                                (                                                            x                      -                                              μ                        i                                                                                    σ                      i                                                        )                                2                                                                        (        2        )            
Referencing FIG. 1, it can be seen that the decision boundary 10 is located at the point where the two conditional probability density functions are equal.
The density function parameters in FIG. 1 are μCint=7, μCother=13, σ2Cint=3, and σ2Cother=3. The prior probabilities are PCint=0.5 and PCother=0.5.
An equivalent decision rule, to that shown in eq. (1), is obtained by dividing both sides of eq. (1) by the unconditional probability of X, which is P(X). We get
                                          If            ⁢                          :                        ⁢                                                            P                                      C                    int                                                  ⁢                                  p                  ⁡                                      (                                          X                      /                                              C                        int                                                              )                                                                              P                ⁡                                  (                  X                  )                                                              ≥                                                    P                                  C                  other                                            ⁢                              P                ⁡                                  (                                      X                    /                                          C                      other                                                        )                                                                    P              ⁡                              (                X                )                                                    ;                            (        3        )                            Classify X as the class-of-interestwhereP(X)=PCintP(X/Cint)+PCotherP(X/Cother)  (4)Referencing FIG. 2, a graph is shown of P(X) 16, as defined in eq. (4).(3) is the Bayes decision rule. It can also be defined in terms of posterior probabilities as:If: P(Cint/X)≧P(Cother/X),  (5)        Classify X as the class-of-interestwhere P(Cint/X) and P(Cother/X) are the posterior probability functions for the class-of-interest and the other-class respectively which are defined as:        
                              P          ⁡                      (                                          C                int                            /              X                        )                          =                                            P                              C                int                                      ⁢                          P              ⁡                              (                                  X                  /                                      C                    int                                                  )                                                          P            ⁡                          (              X              )                                                          (        6        )                                          P          ⁡                      (                                          C                other                            /              X                        )                          =                                            P                              C                other                                      ⁢                          P              ⁡                              (                                  X                  /                                      C                    other                                                  )                                                          P            ⁡                          (              X              )                                                          (        7        )            
Referencing FIG. 3, a graph is shown of the class-of-interest 20 and class-other posterior distribution functions 22, as defined in eq. (6) and (7). Again, referencing FIG. 3, it can be seen that the two class-conditional posterior distribution functions have maximum values of one and the two functions are equal to one-half 24 at the decision boundary 18.
Noting that the two posterior probability functions sum to one, orP(Cint/X)+P(Cother/X)=1  (8)We can re-arrange eq. (8) to getP(Cother/X)=1−P(Cint/X)  (9)
Substituting eq. (9) into (5) and simplifying, we obtain a decision rule which is equivalent to the standard Bayes decision function, but only involves the posterior distribution function for the class-of-interest, namelyIf: P(Cint/X)≧½,  (10)                Classify X as the class-of-interest        Otherwise classify X as class-otherwhere        
                              P          ⁡                      (                                          C                int                            /              X                        )                          =                                            P                              C                int                                      (                          P              ⁡                              (                                  X                  /                                      C                    int                                                  )                                                          P            ⁡                          (              X              )                                                          (        11        )            
Eq. (10) is referred to as the adaptive Bayesian decision rule. Referencing FIG. 4, a graph is shown of the class-of-interest posterior distribution function, P(Cint/X) 28, as defined in eq. (11). The decision boundary 26 is located at the point where the class-of-interest posterior distribution function is equal to one-half, 30. The a priori probability PCint is assumed to be known.
The adaptive Bayesian decision rule, eq. (10), is adaptive in the sense that it adapts the decision boundary to provide optimal discrimination between class-of-interest and any unknown class-other which may exist in the data set to be classified. Implementing the adaptive Bayes rule requires that we obtain estimates for the two density functions in eq. (11). The class-conditional probability density function, P(X/C int), in eq. (11), can be estimated using labeled sample from the class-of-interest. The unconditional probability density function, P(X) in eq. (11), is not conditioned of a class and can be estimated using unlabeled samples from the data set to be classified. The a priori probability, PCint, is assumed to be known. A number of nonparametric density function estimation techniques are available for estimating P(X). Using estimates for P(X/Cint) and P(X), the posterior distribution of the class-of-interest, eq. (10) can be defined and we can then classify the input-data-set using the adaptive Bayes rule using eq. (10).
In addition, it is shown below that the class-of-interest posterior distribution, P(Cint/X), can be approximated using a least squares estimator.
Approximating the Class-of-Interest Posterior Distribution Function Using Nonparametric Density Estimation Techniques
Density functions P(X/Cint) and P(X), eq. (11), can be estimated using any of several non-parametric density techniques such as histogramming, Parzen kernel density estimation, and Kth nearest neighbor estimation. Gorte [B. Gorte and N. Gorte-Kroupnova, “Non-parametric classification algorithm with an unknown class”, Proceedings of the International Symposium on Computer Vision, 1995, pp. 443-448], Mantero [P. Mantero, “Partially supervised classification of remote sensing images using SVM-based probability density estimation”, IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 3, March 2005, pp. 559-570], and Guerrero-Curieses [A. Guerrero-Curieses, A Biasiotto, S. B. Serpico, and G. Moser, “Supervised Classification of Remote Sensing Images with Unknown Classes,” Proceedings of IGARSS-2002 Conference, Toronto, Canada, June 2002] investigated the use of the Kth nearest neighbor probability estimation [R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, New York: John Wiley & Sons, 1973, pp. 95-98] in approximating the class-of-interest posterior distribution function, P(Cint/X), and its use in classifying remotely sensed data using the adaptive Bayes decision rule, eq. (10),. Kth nearest neighbor has two disadvantages. The first disadvantage is that a Kth nearest neighbor estimate of the class-of-interest posterior probability function P(Cint/X) is very dependent on the value selected for K. Fukunaga [K. Fukunaga, D. M. Hummels, “Bayes Error Estimation Using Parzen and k-NN Procedures”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, Number 5, September 1987, p. 634-643] concluded there is no optimal method for selecting a value for K. The approach often used is to evaluate the classification accuracy obtained using various values of K and select the value of K that maximizes classification accuracy. However, this approach requires that labeled samples be available from all the classes for use in evaluating classification accuracy. The second disadvantage is that Kth nearest neighbor is computationally slow as a result of the need to repeatedly compute the distance, from the measurement vector to be classified, to the other measurements vectors in the data set.
Least Squares Estimation of the Class-of-Interest Posterior Distribution Function
Minter [T. C. Minter, “A Discriminant Procedure for Target Recognition in Imagery Data”, Proceedings of the IEEE 1980 National Aerospace and Electronic Conference—NAECON 1980, May 20-22, 1980], proposed a least squares criterion for estimating the class-of-interest posterior distribution, P(Cint/X), in eq. (10). The class-of-interest posterior distribution can be approximated by minimizing the mean square difference between the estimated posterior distribution function and the true posterior distribution function for the class-of-interest. This is accomplished using the following least squares criterion:J=∫({circumflex over (P)}(Cint/X)−P(Cint/X))2P(X)dX+K  (12)where
                              P          ⁡                      (                                          C                int                            /              X                        )                          =                                            P                              C                int                                      ⁢                          P              ⁡                              (                                  X                  /                                      C                    int                                                  )                                                          P            ⁡                          (              X              )                                                          (        13        )            
In eq. (12), {circumflex over (P)}(Cint/X) is the estimated class-of-interest posterior distribution function, P(Cint/X) is the true (but unknown) class-of-interest posterior distribution, and K is an arbitrary constant. However, the least squares criteria, shown in eq. (12), cannot be minimized directly since the true class-of-interest posterior distribution function, P(Cint/X), is unknown.
The least square criterion is reformulated below to provide an equivalent criterion that can be minimized and used to estimate the class-of-interest posterior distribution function {circumflex over (P)}(Cint/X).
First, expanding the least squares criteria, eq. (12), we getJ=∫({circumflex over (P)}(Cint/X)2−2{circumflex over (P)}(Cint/X)P(Cint/X)+P(Cint/X)2)P(X)dX+K  (14)J=∫({circumflex over (P)}(Cint/X)2P(X)dX−∫2{circumflex over (P)}(Cint/X)P(Cint/X)P(X)dX+∫P(Cint/X)2P(X)dX+K  (15)
                    J        =                  ∫                      (                                                                                                                              P                        ^                                            ⁡                                              (                                                                              C                            int                                                    /                          X                                                )                                                              2                                    ⁢                                      P                    ⁡                                          (                      X                      )                                                        ⁢                                      ⅆ                    X                                                  -                                  ∫                                      2                    ⁢                                                                  P                        ^                                            ⁡                                              (                                                                              C                            int                                                    /                          X                                                )                                                              ⁢                                                                                            P                                                      C                            int                                                                          ⁢                                                  P                          ⁡                                                      (                                                          X                              /                                                              C                                int                                                                                      )                                                                                                                      P                        ⁡                                                  (                          X                          )                                                                                      ⁢                                          P                      ⁡                                              (                        X                        )                                                              ⁢                                          ⅆ                      X                                                                      +                                  ∫                                                                                    P                        ⁡                                                  (                                                                                    C                              int                                                        /                            X                                                    )                                                                    2                                        ⁢                                          P                      ⁡                                              (                        X                        )                                                              ⁢                                          ⅆ                      X                                                                      +                                  K                  ⁢                                                                          ⁢                  J                                            =                              ∫                                  (                                                                                                                                          P                            ^                                                    ⁡                                                      (                                                                                          C                                int                                                            /                              X                                                        )                                                                          2                                            ⁢                                              P                        ⁡                                                  (                          X                          )                                                                    ⁢                                              ⅆ                        X                                                              -                                          ∫                                              2                        ⁢                                                                              P                            ^                                                    ⁡                                                      (                                                                                          C                                int                                                            /                              X                                                        )                                                                          ⁢                                                  P                                                      C                            int                                                                          ⁢                                                  P                          ⁡                                                      (                                                          X                              /                                                              C                                int                                                                                      )                                                                          ⁢                                                  P                          ⁡                                                      (                            X                            )                                                                          ⁢                                                  ⅆ                          X                                                                                      +                                          ∫                                              P                        ⁢                                                                              (                                                                                          C                                int                                                            /                              X                                                        )                                                    2                                                ⁢                                                  P                          ⁡                                                      (                            X                            )                                                                          ⁢                                                  ⅆ                          X                                                                                      +                    K                                                                                                          (        16        )            Now letK′=2PCint=2PCint∫P(X/Cint)dX  (17)and we get:J=∫({circumflex over (P)}(Cint/X)2P(X)dX−2PCint∫[{circumflex over (P)}(Cint/X)−1]P(X/Cint)dX+K′  (18)
Next we define the expected value with respect to the labeled samples from the class-of-interest as:ECint(∘)=∫(∘)P(X/Cint)dX  (19)
The expected value with respect to the unlabeled samples from P(X) (the data to be classified) is defined as:E(∘)=∫(∘)P(X)dX  (20)
Using these definitions, the least square criteria, eq. (18), can be rewritten as:J=E[{circumflex over (P)}(Cint/X)2]+2PCintECint[{circumflex over (P)}(Cint/X)−1]+K′  (21)
We will approximate the class-of-interest posterior distribution, {circumflex over (P)}(Cint/X), using the following linear combination of functions-of-the-measurements.
Let{circumflex over (P)}(Cint/X)≅ATF(X)  (22)where F(X) is as vector containing functions-of-the-measurements, orF(X)=(f(X)1, f(X)2, . . . f(X)n)T  (23)and A is a vector of weights for the f(X)'sA=(aq, a2, . . . an)T  (24)Substituting eq. (22) for {circumflex over (P)}(Cint/X) in eq. (21) we get:J=E[(ATF(X))2]+2PCintECint[ATF(X)−1]+K′  (25)
This formulation of the least square error criteria, eq. (25), is equivalent to the original least squares criterion, eq. (12), however, eq. (25) can be evaluated since there are no unknowns. In addition, eq. (25) can be evaluated using only labeled samples from the class-of-interest and unlabeled samples from P(X), which is the data set to be classified.
An estimate of the parameters of the weighting vector A, eq. (24), is obtained by minimization of the least-square criterion, defined in eq. (25), with-respect-to the vector A.
Differentiating J in eq. (25) with-respect-to A and setting to zero we get:
                                          δ            ⁢                                                  ⁢            J                                δ            ⁢                                                  ⁢            A                          =                                            2              ⁢                              E                ⁡                                  [                                      (                                                                  F                        ⁡                                                  (                          X                          )                                                                    ⁢                                                                        F                          ⁡                                                      (                            X                            )                                                                          T                                            ⁢                      A                                        )                                    ]                                                      +                          2              ⁢                              P                                  C                  int                                            ⁢                                                E                                      C                    int                                                  ⁡                                  [                                      F                    ⁡                                          (                      X                      )                                                        ]                                                              =          0                                    (        26        )            Rearranging yieldsE[(F(X)F(X)T)]A=PCintECint[F(X)]  (27)and finally we getA=PCintE[(F(X)F(X)T)]−1·ECint[F(X)]  (28)
Given a set of N unlabeled samples (X1,X2, . . . XN) from the data set to be classified and M labeled samples from the class-of-interest, (X1(Cint),X2(Cint), . . . XM(Cint)), the weighting vector A may be estimated as follows:
                    A        =                                                                              P                                      C                    int                                                  ⁡                                  [                                                            1                      N                                        ⁢                                                                  ∑                                                  i                          =                          1                                                N                                            ⁢                                              (                                                                              F                            ⁡                                                          (                                                              X                                i                                                            )                                                                                ⁢                                                                                    F                              ⁡                                                              (                                                                  X                                  i                                                                )                                                                                      T                                                                          )                                                                              ]                                                            -                1                                      ·                          1              M                                ⁢                                    ∑                              j                =                1                            M                        ⁢                          [                              F                ⁡                                  (                                                            X                      j                                        ⁡                                          (                                              C                        int                                            )                                                        )                                            ]                                                          (        29        )            
Using the parameter vector A, estimated in eq. (29), the adaptive Bayes decision rule, eq. (10), can now be written asIf: ATF(X)≧½,  (30)                Classify X as the class-of-interest        Otherwise, classify X as class-otherwhere eq. (22) has been substituted for P(Cint/X), in eq. (10).Least Squares Approximation of the Posterior Distribution of the Class-of-Interest Using a Polynomial        
The choice of functions used to approximate the posterior distribution function P(Cint/X) is important. Minter [T. C. Minter, “A Discriminant Procedure for Target Recognition in Imagery Data”, Proceedings of the IEEE 1980 National Aerospace and Electronic Conference—NAECON 1980, May 20-22, 1980] proposed using a multi-dimensional polynomial to approximate the class-of-interest posterior probability distribution function, {circumflex over (P)}(Cint/X). The class-of-interest posterior distribution function, {circumflex over (P)}(Cint/X), can be approximated with a polynomial of any order—first, second, third, etc. However, the order of the polynomial used to fit the class-of-interest posterior distribution also determines the order of the decision boundary used to separate the two classes, the class-of-interest and the class-other.
For example, if we have a two dimension measurement vector, we can approximate the class-of-interest posterior probability distribution function using a second order polynomial function, of the form:{circumflex over (P)}(Cint/X)≅a0+a1x1+a2x2+a3x1x2+a4x12+a5x22  (31)or using vector notation{circumflex over (P)}(Cint/X)≅ATF(X)  (32)whereA=(a0, a1, a2, a3, a4, a5)T  (33)andF(X)=(1, x1, x2, x1x2, x12, x22)  (34)
Use of the second order function in eq. (31) implies the decision boundary will be quadratic. If the distributions of the two class density functions are Gaussian with unequal covariances, a quadratic decision boundary is optimal [R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, New York: John Wiley & Sons, 1973, pp. 30].
If the expected decision boundary is highly complex, an even higher order polynomial may be required.
The use of polynomials in approximating the class-of-interest posterior probability distribution function, {circumflex over (P)}(Cint/X), has two disadvantages. First, a priori knowledge of the complexity of the decision boundary is required to select the appropriate order polynomial. Second, the size of the F(X) vector, eq. (34), is a function of the number of measurements and the order of the polynomial used. For a second order polynomial, the number of elements in F(X), eq. (34), is (1−2d+d(d−1)/2) where d is the number of dimensions or number of measurements. When the size of F(X) becomes too large, the inversion of the F(X)F(X)T matrix, eq. (29), becomes problematic and limits the usefulness of polynomial approximations of {circumflex over (P)}(Cint/X). For example, for a 25 dimension measurement vector (d=25) and a second order polynomial, the vector F(X) has 351 elements and the F(X)F(X)T matrix, eq. (29), is a 351×351 matrix. Cross-product terms account for most of the 351 elements in vector F(X). Inverting such a large matrix is computationally expensive and prone to numerical errors. In addition, classification of one of these twenty-five dimension measurement vectors would require the multiplication of a 351×351 matrix and a 351×1 vector, which is also computationally expensive.