1. Field of Invention
This invention relates generally to systems and methods for extracting optimal features for discriminating between two classes, a class-of-interest and a class-other, when training samples or otherwise, are provided a priori only for the class-of-interest thus eliminating the requirement for any a priori knowledge of the other classes in the input-data-set while exploiting the potentially robust and powerful feature extraction capability provided by fully supervised feature extraction approaches.
2. Prior Art
Pattern recognition requires that objects be described in terms of a set of measurable features. In pattern recognition, the patterns are generally represented as a vector of feature values (measurements). The selection of features can have considerable impact on classification performance.
Typically, the objective in pattern recognition is to minimize the classification error rate. The classification error rate is closely related to class separation, training sample size, dimensionally, and measurement noise. Hughes [G. F. Hughes, “On the Mean Accuracy of Statistical pattern Recognizers,” IEEE Trans. Info. Theory, vol. IT-14, no. 1 (1968): pp. 55-63] showed that classification accuracy normally increases as the number of features increases until reaching an optimal number of features, at which point classification accuracy declines with the addition of new features. Adding features beyond an optimal number only adds noise to the parameters used in estimating the decision boundary, thus reducing classifier performance. The optimal number of features is a function of training sample size, measurement noise, and class separation.
Fukunaga [K. Fukunaga, “Effects of sample size in classifier design,” IEEE Trans. Pattern Anal. Machine Intell., vol. 11, pp. 873-885, August 1989] showed that the number of training samples, required for a linear classifier to achieve a certain amount of error, is proportional to the number of dimensions. For quadratic classifiers, it is proportional to the square of the number of dimensions.
Feature selection seeks to reduce the original number of features used in classification while maintaining acceptable classification accuracy. Less discriminatory features are eliminated, leaving a subset of the original features which retain sufficient information to discriminate well among classes. Feature selection typically involves use of iterative search techniques which evaluate various combinations of features to select the best combination. If a large number of features are involved, feature selection tends to be computationally intensive [A. K. Jain, R. W. Duin, and J. Mao, “Statistical Pattern Recognition: A Review”, IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, January 2000, pp.14-17].
Feature extraction is a more general method in which the original set of features is transformed using a linear combination of the original vectors [A. K. Jain, R. W. Duin, and J. Mao, “Statistical Pattern Recognition: A Review”, IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, January 2000, pp. 12-14]. In particular, linear feature extraction techniques find a set of vectors that effectively represents the information content of an observation while reducing dimensionality. Many feature extraction methods derive the linear transformation in a single iteration, thus utilizing significantly less computer resources.
Interest in feature selection and extraction is being driven by: 1) the availability of new high-spectral resolution imaging sensors; 2) multi-sensor fusion (concatenating data from multiple sensors); 3) integration of data from multiple data models (sensor data is modeled using different approaches with the model parameters from multiple models being used as features); 4) the need for low dimensional (two or three dimensions) data representations to aid in visual exploration of data; and 5) data mining [A. K. Jain, R. W. Duin, and J. Mao, “Statistical Pattern Recognition: A Review”, IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 14, January 2000].
In order to avoid the difficulties associated with too many features, preprocessing of the data using linear combinations of features to reduce dimensionality is highly desirable.
Most of the literature on feature extraction is restricted to pattern recognition applications where training samples are available which completely characterize all of the classes (objects) to be recognized in a data set. Using these training samples, optimal sets of features can be extracted which provide minimum error in classifying a data set.
However, in the real world, there are many applications where a priori knowledge, through training samples or otherwise, is only available for a single class, the classes-of-interest. The distribution of the other class may be 1) unknown, 2) may have changed, 3) may be inaccurate due to insufficient numbers of samples used to estimate the distribution of the other classes, or 4) the cost of obtaining labeling samples, for purposes of defining all the classes in a given dataset, by collecting ground truth or otherwise, may be very expensive or impossible to obtain. Often one is only interested in one class or a small number of classes.
This invention relates generally to systems and methods for extracting optimal features for discriminating between two classes, a class-of-interest and a class-other, when training samples or otherwise, are provided a priori only for the class-of-interest thus eliminating the requirement for any a priori knowledge of all of the other classes in a data set while exploiting the potentially robust and powerful discriminating capability provided by fully supervised feature extraction approaches. Adaptive Bayes Feature extraction is accomplished by exploiting the ability of the adaptive Bayes classifier to define optimal decision boundaries using only labeled samples from the class-of-interest and unlabeled samples from the data to be classified. Optimal features are derived from vectors which are normal to the decision boundary defined by the adaptive Bayes classifier.
The Adaptive Bayesian Approach to Pattern Recognition
Bayes decision theory is a fundamental approach to the problem of pattern recognition. The approach is based on the assumption that the decision problem can be poised in probabilistic terms where all of the relevant probability values are known. Specifically, the application of a standard Bayesian statistical supervised classifier usually requires estimating, the posterior probabilities of each class [R. O. Duda and P. E. Hart, Pattern classification and Scene Analysis, New York: John Wiley & Sons, 1973, pp. 10-31]. If information about probability distributions of classes is available, the posterior probability is calculated for every measurement and each measurement is attributed to the class with the maximum posterior probability.
The decision making process for using Bayes pattern recognition to classify data into two known classes can be summarized as follows: Given a set of measurement vectors, it desired is to associate the measurements with either the classes-of-interest or a class-other with minimum probability error. The set of measurements, X, can conveniently be represented as a vector in the measurement space. This vector will be called the measurement vector or simply a sample or a pattern and will be denoted as X=(x1, x2, . . . xd)T where d is the number of measurements or the dimensionality of the measurement space.
Let us defined the probability density functions of the two classes as follows—the class-of-interest is P(X/Cint) and the class-other is P(X/Cother). Each class has associated a priori probabilities of PCint and PCother, respectively.
The standard maximum likelihood decision rule for this two class pattern recognition problem is:If: PCintP(X/Cint)≧PCotherP(X/Cother),  (1)                Classify X as the class-of-interest        Otherwise, classify X as the class-otherwhere            P(X/Cint)=Conditional probability density function of the class-of-interest    P(X/Cother)=Conditional probability density function of class-other    PCint=a priori probability of the class-of-interest    PCother=a priori probability of class-other
An equivalent decision rule, to eq. (1), can be obtained by dividing both sides of eq. (1) by the unconditional probability of X, which is P(X), or:
                                                        If              ⁢                              :                            ⁢                                                                    P                                          C                      int                                                        ⁢                                      P                    ⁡                                          (                                              X                        /                                                  C                          int                                                                    )                                                                                        P                  ⁡                                      (                    X                    )                                                                        ≥                                                            P                                      C                    other                                                  ⁢                                  P                  ⁡                                      (                                          X                      /                                              C                        other                                                              )                                                                              P                ⁡                                  (                  X                  )                                                              ;                ⁢                                  ⁢                  Classify          ⁢                                          ⁢          X          ⁢                                          ⁢          as          ⁢                                          ⁢          the          ⁢                                          ⁢          class          ⁢                      -                    ⁢          of          ⁢                      -                    ⁢          interest                ⁢                                  ⁢                  Otherwise          ,                      classify            ⁢                                                  ⁢            X            ⁢                                                  ⁢            as            ⁢                                                  ⁢            the            ⁢                                                  ⁢            class            ⁢                          -                        ⁢            other                                              (        2        )            whereP(X)=PCintP(X/Cint)+PCotherP(X/Cother)  (3)Eq. (2) is the Bayes decision rule. It can be defined in terms of posterior probabilities as:If: P(Cint/X)≧P(Cother/X),  (4)                Classify X as the class of interest        Otherwise, classify X as the class-otherwhere P(Cint/X) and P(Cother/X) are the posterior probability distribution functions for the class-of-interest and the class-other. The posterior probability distribution functions are defined as:        
                              P          ⁡                      (                                          C                int                            /              X                        )                          =                                            P                              C                int                                      ⁢                          P              ⁡                              (                                  X                  /                                      C                    int                                                  )                                                          P            ⁡                          (              X              )                                                          (        5        )            
                              P          ⁡                      (                                          C                other                            /              X                        )                          =                                            P                              C                other                                      ⁢                          P              ⁡                              (                                  X                  /                                      C                    other                                                  )                                                          P            ⁡                          (              X              )                                                          (        6        )            
Minter [T. C. Minter, “A Discriminant Procedure for Target Recognition in Imagery Data”, Proceedings of the IEEE 1980 National Aerospace and Electronic Conference—NAECON 1980, May 20-22, 1980] proposed an alternative to the Bayes decision rule. It was noted that the posterior probability functions sum to 1, namelyP(Cint/X)+P(Cother/X)=1  (7)Rearranging equation (7) we getP(Cother/X)=1−P(Cint/X)  (8)Substituting eq. (8) into eq. (4) and simplifying we get an alternative Bayes decision function which only involves the posterior probability function for the class-of-interest, P(Cint/X), namelyIf: P(Cint/X)≧½,  (9)                Classify X as the class-of-interest        Otherwise, classify X as class-otherwhere again        
                              P          ⁡                      (                                          C                int                            /              X                        )                          =                                            P                              C                int                                      ⁢                          P              ⁡                              (                                  X                  /                                      C                    int                                                  )                                                          P            ⁡                          (              X              )                                                          (        10        )            
Equation (9) is the referred to as the adaptive Bayes decision rule or the adaptive Bayes classifier. This formulation of the Bayes decision rule is particularly useful since it only requires a priori knowledge of the of the conditional density function for the class-of-interest, P(X/Cint), and the a priori probability for the class-of-interest, PCint. The unconditional probability function, P(X) can be estimated using any of a number of nonparametric density function techniques using only unlabeled samples from the data set of to be classified. However, an alternative least squares estimation technique is presented below for estimating of the posterior probability distribution function, P(Cint/X), which uses only labeled sample from the class-of-interest and unlabeled samples from the data set to be classified.
The adaptive Bayesian decision rule, eq. (9), is adaptive in the sense that it adapts the decision boundary to provide optimal discrimination between the class-of-interest and any unknown class-other which may exist in the data set to be classified.
The adaptive Bayes decision rule, equation (9), can also be exploited for use in feature extraction. In particular, it will be shown that normal vectors to the decision boundary can be constructed to the decision boundary provided by adaptive Bayes decision rule. These normal vectors can then be used to extract discriminate features. This approach allows optimal features to be extracted using only labeled sample from the class-of-interest and unlabeled samples from the data set to be classified.
Procedures for Estimation of the Adaptive Bayes Decision Rule
First we will review two approaches for estimating the class-of-interest posterior distribution function, P(Cint/X), used by the adaptive Bayes decision rule, eq. (9). The first approach uses nonparametric density function estimation techniques to approximate P(Cint/X). The second approach approximates the posterior distribution function P(Cint/X) in eq. (9) using a least squares estimation procedure.
Approximating the Class-of-Interest Posterior Distribution Function Using Nonparametric Density Estimation Techniques
The density functions P(X/Cint) and, P(X) in eq. (10), can be estimated using any of several non-parametric density techniques such as histogramming, Parzen kernel density estimation, and Kth nearest neighbor estimation. Gorte [B. Gorte and N. Gorte-Kroupnova, “Non-parametric classification algorithm with an unknown class”, Proceedings of the International Symposium on Computer Vision, 1995, pp. 443-448], Mantero [P. Mantero, “Partially supervised classification of remote sensing images using SVM-based probability density estimation”, IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 3, March 2005, pp. 559-570], and Guerrero-Curieses [A. Guerrero-Curieses, A Biasiotto, S. B. Serpico, and G. Moser, “Supervised Classification of Remote Sensing Images with Unknown Classes,” Proceedings of IGARSS-2002 Conference, Toronto, Canada, June 2002] investigated the use of the Kth nearest neighbor probability estimation [R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, New York: John Wiley & Sons, 1973, pp. 95-98] to approximate the class-of-interest posterior distribution function, P(Cint/X). They used a Kth nearest neighbor approximation of P(Cint/X) to classify remotely sensed data using the adaptive Bayes decision rule, eq. (9).
Kth nearest neighbor has two disadvantages. The first disadvantage is that a Kth nearest neighbor estimate of the class-of-interest posterior probability function P(Cint/X) is very dependent on the value selected for K. Fukunaga [K. Fukunaga, D. M. Hummels, “Bayes Error Estimation Using Parzen and k-NN Procedures”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, Number 5, September 1987, p. 634-643] concluded there is no optimal method for selecting a value for K. The approach often used is to evaluate the classification accuracy obtained using various values of K and select the value of K that maximizes classification accuracy. However, this approach requires that labeled samples be available from all the classes for use in evaluating classification accuracy. The second disadvantage is that Kth nearest neighbor is computationally slow as a result of the need to repeatedly compute the distance, from the measurement vector to be classified, to the other measurements vectors in the data set.
Least Squares Estimation of the Adaptive Bayes Decision Function
Minter [T. C. Minter, “A Discriminant Procedure for Target Recognition in Imagery Data”, Proceedings of the IEEE 1980 National Aerospace and Electronic Conference—NAECON 1980, May 20-22, 1980], proposed a least squares criterion for estimating the class-of-interest posterior distribution in eq. (9). The posterior distribution for the class-of-interest can be approximated by minimizing the mean square difference between the estimated posterior distribution function and the true posterior distribution function for the class-of-interest. This is accomplished using the following least squares criterion:J=∫({circumflex over (P)}(Cint/X)−P(Cint/X))2P(X)dX+K  (11)where
                              P          ⁡                      (                                          C                int                            /              X                        )                          =                                            P                              C                int                                      ⁢                          P              ⁡                              (                                  X                  /                                      C                    int                                                  )                                                          P            ⁡                          (              X              )                                                          (        12        )            
In eq. (11), {circumflex over (P)}(Cint/X) is the estimated posterior distribution for the class-of-interest, and P(Cint/X) is the true (but unknown) posterior distribution of the class-of-interest. K is an arbitrary constant.
However, the least squares criteria, eq. (11), cannot be minimized directly since the true posterior distribution function, P(Cint/X), is unknown.
However, the least square criterion, eq. (11), can be reformulated to provide an equivalent criterion that can be minimized to estimate the class-of-interest posterior distribution function {circumflex over (P)}(Cint/X) in eq. (9) as is shown below.
First, expanding the least squares criteria, eq. (11), we getJ=∫({circumflex over (P)}(Cint/X)2−2{circumflex over (P)}(Cint/X)P(Cint/X)+P(Cint/X)2)P(X)dX+K  (13)J=∫({circumflex over (P)}(Cint/X)2−2{circumflex over (P)}(Cint/X)P(Cint/X)+P(Cint/X)2)P(X)dX+K  (14)J=∫({circumflex over (P)}(Cint/X)2P(X)dX−∫2{circumflex over (P)}(Cint/X)P(Cint/X)P(X)dX+∫P(Cint/X)2P(X)dX+K  (15)
                    J        =                  ∫                      (                                                                                                      P                      ^                                        ⁡                                          (                                                                        C                          int                                                /                        X                                            )                                                        2                                ⁢                                  P                  ⁡                                      (                    X                    )                                                  ⁢                                  ⅆ                  X                                            -                              ∫                                  2                  ⁢                                                            P                      ^                                        ⁡                                          (                                                                        C                          int                                                /                        X                                            )                                                        ⁢                                                                                    P                                                  C                          int                                                                    ⁢                                              P                        ⁡                                                  (                                                      X                            /                                                          C                              int                                                                                )                                                                                                            P                      ⁡                                              (                        X                        )                                                                              ⁢                                      P                    ⁡                                          (                      X                      )                                                        ⁢                                      ⅆ                    X                                                              +                              ∫                                                                            P                      ⁡                                              (                                                                              C                            int                                                    /                          X                                                )                                                              2                                    ⁢                                      P                    ⁡                                          (                      X                      )                                                        ⁢                                      ⅆ                    X                                                              +              K                                                          (        16        )            J=∫({circumflex over (P)}(Cint/X)2P(X)dX−∫2{circumflex over (P)}(Cint/X)PCintP(X/Cint)P(X)dX+∫P(Cint/X)2P(X)dX+KNow letK′=2PCint=2PCint∫P(X/Cint)dX  (17)and we get:J=∫({circumflex over (P)}(Cint/X)2P(X)dX−2PCint∫[{circumflex over (P)}(Cint/X)−1]P(X/Cint)dX+K′  (18)where K′ is another constant, as defined in eq. (17).
Next we define the expected value with respect to the labeled samples from the class-of-interest as:ECint(∘)=∫(∘)P(X/Cint)dX  (19)
The expected value with respect to the unlabeled samples from P(X) (the data to be classified) is defined as:E(∘)=∫(∘)P(X)dX  (20)
Using these definitions, the least square criteria, eq. (18), can be rewritten as:J=E[{circumflex over (P)}(Cint/X)2]+2PCintECint[{circumflex over (P)}(Cint/X)−1]+K′  (21)
The posterior distribution of the class-of-interest {circumflex over (P)}(Cint/X) is approximated using the following linear combination of “functions of the measurements”.{circumflex over (P)}(Cint/X)≅ATF(X)  (22)where F(X) is as vector containing functions of the measurements orF(X)=(f(X)1, f(X)2, . . . f(X)n)T  (23)and A is a vector of weights for the f(X)'sA=(a1, a2, . . . an)T  (24)Substituting eq. (25) for {circumflex over (P)}(Cint/X) in eq. (24) we get:J=E[(ATF(X))2]+2PCintECint[ATF(X)−1]+K′  (25)
This formulation of the least square error criteria, eq. (25), is equivalent to the original least squares criterion, eq. (11), however, eq. (25) can be evaluated since there are no unknowns. In addition, eq. (25) can be evaluated using only labeled samples from the class-of-interest and unlabeled samples from the data set to be classified, P(X).
An estimate of the parameters of the weighting vector A, in eq. (24), is obtained by minimization of the least-square criterion, defined in eq. (25).
Differentiating the J in eq. (25) with-respect-to A and setting to zero we get:
                                          δ            ⁢                                                  ⁢            J                                δ            ⁢                                                  ⁢            A                          =                                            2              ⁢                              E                ⁡                                  [                                      (                                                                  F                        ⁡                                                  (                          X                          )                                                                    ⁢                                                                        F                          ⁡                                                      (                            X                            )                                                                          T                                            ⁢                      A                                        )                                    ]                                                      +                          2              ⁢                              P                                  C                  int                                            ⁢                                                E                                      C                    int                                                  ⁡                                  [                                      F                    ⁡                                          (                      X                      )                                                        ]                                                              =          0                                    (        26        )            Rearranging yieldsE[(F(X)F(X)T)]A=PCintECint[F(X)]  (27)and finally we getA=PCintE[(F(X)F(X)T)]−1·ECint[F(X)]  (28)
Given a set of N unlabeled samples (X1, X2, . . . XN) from the data set to be classified and M labeled samples from the class-of-interest, (X1(Cint), X2(Cint), . . . XM(Cint)) the parameter vector A may be estimated as follows:
                    A        =                                                                              P                                      C                    int                                                  ⁡                                  [                                                            1                      N                                        ⁢                                                                  ∑                                                  i                          =                          1                                                N                                            ⁢                                                                                          ⁢                                              (                                                                              F                            ⁡                                                          (                                                              X                                i                                                            )                                                                                ⁢                                                                                    F                              ⁡                                                              (                                                                  X                                  i                                                                )                                                                                      T                                                                          )                                                                              ]                                                            -                1                                      ·                          1              M                                ⁢                                    ∑                              j                =                1                            m                        ⁢                                                  ⁢                          [                              f                ⁡                                  (                                                            X                      j                                        ⁡                                          (                                              C                        int                                            )                                                        )                                            ]                                                          (        29        )            Using the parameter vector A, estimated in eq. (29), the adaptive Bayes decision rule, eq. (9) can be rewritten asIf: ATF(X)≧½,  (30)                Classify X as class-of-interest        Otherwise classify X as class-otherwhere eq. (22) has been substituted for {circumflex over (P)}(Cint/X), in the adaptive Bayes decision rule, eq. (9).Functions for Least Squares Approximation of the Posterior Distribution of the Class-Of-Interest        
Below, two approaches are presented for approximating the class-of-interest posterior distribution, P(Cint/X) in eq. (9), using the least square criteria. The first method approximates the class-of-interest posterior distribution function using a polynomial. The second method approximates the class-of-interest posterior distribution function using Parzen kernels.
Polynomial Approximation of the Posterior Distribution Function
Minter [T. C. Minter, “A Discriminant Procedure for Target Recognition in Imagery Data”, Proceedings of the IEEE 1980 National Aerospace and Electronic Conference—NAECON 1980, May 20-22, 1980] proposed using a polynomial to approximate the class-of-interest posterior probability {circumflex over (P)}(Cint/X).
The class-of-interest posterior distribution function, {circumflex over (P)}(Cint/X), can be approximated with a polynomial of any order—first, second, third, etc. However, the order of the polynomial used to fit the class-of-interest posterior distribution also determines the order of the decision boundary used to separate the class-of-interest and the class-other.
For example, if we have a two dimension measurement vector, we can approximate the class-of-interest posterior probability distribution function using a second order polynomial function, of the form:{circumflex over (P)}(Cint/X)≅a0+a1x1+a2x2+a3x1x2+a4x12+a5x22  (31)or using vector notation{circumflex over (P)}(Cint/X)≅ATF(X)  (32)whereA=(a0, a1, a2, a3, a4, a5)T  (33)andF(X)=(1, x1, x2, x1x2, x12, x22)  (34)
Use of the second order function in eq. (31) implies the decision boundary will be quadratic. If the distributions of the two class density functions are Gaussian with unequal covariance matrices, a quadratic decision boundary is optimal [R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, New York: John Wiley & Sons, 1973, pp. 30].
If the expected decision boundary is highly complex, an even higher order polynomial may be required.
The use of polynomials in approximating the class-of-interest posterior probability distribution function, {circumflex over (P)}(Cint/X), has two disadvantages. First, a priori knowledge of the complexity of the decision boundary is required to select the appropriate order polynomial.
Second, the size of the F(X) vector, eq. (34), is a function of the number of measurements and the order of the polynomial used. For a second order polynomial, the number of elements in F(X), eq. (34), is (1+2d+d(d−1)/2) where d is the number of dimensions or number of measurements. When the size of F(X) becomes too large, the inversion of the F(X)F(X)T matrix, eq. (29), becomes problematic and limits the usefulness of polynomial approximations of {circumflex over (P)}(Cint/X). For example, for a 25 dimension measurement vector (d=25) and a second order polynomial, the vector, F(X) eq. (34), has 351 elements and the F(X)F(X)T matrix, eq. (29), is a 351×351 matrix. Cross-product terms account for most of the 351 elements in vector F(X) . Inverting such a large matrix is computationally expensive and prone to numerical errors. In addition, classification of one of these twenty-five dimension measurement vectors would require the multiplication of a 351×351 matrix and a 351×1 vector, which is also computationally expensive.
However, if the dimensionality of the data is low (for example, if d=8 and a second order polynomial is used) then the size of the F(X)F(X)T matrix is only a 45×45 matrix. Most computers can easily invert a matrix of this size. Therefore polynomials are useful in approximating of {circumflex over (P)}(Cint/X) for low dimension data.
Approximating the Class-of-Interest Posterior Distribution Function with Parzen Kernels Densities
The kernel method of estimating density functions is a well-known and much studied technique for nonparametric density estimation [R. O. Duda and P. E. Hart, Pattern Cassification and Scene Analysis, New York: John Wiley & Sons, 1973, pp. 88-95]. The need for nonparametric techniques stems from a wide range of applications in which the experimenter is unwilling to assume a parametric family for the true underlying probability density function. The basic foundation for nonparametric density estimation was Fix and Hodges' [E. Fix and J. L. Hodges, “Discriminatory analysis, nonparametric discrimination,” U. S. Air Force School of Aviation Medicine, Randolph Field, Tex Project 21-49-004, Rep. 4, Contract AF-41-(128)-31, February 1951] original work. They based their results on the concept that the value of a density function at a point can be estimated using the number of sample observations that fall within a small region around that point.
Rosenblatt [M. Rosenblatt, “Remarks on some nonparametric estimates of a density function,” Ann. Math. Statist., vol 27, pp.832-837, 1956], Whittle [P. Whittle, “On the smoothing of probability density functions,” J. Roy. Statist., Ser B, vol. 20, pp. 334-343, 1958], Parzen [17], and Cacoullos [T. Cacoullos, “Estimation of a multivariate density,” Ann. Inst. Statist. Math., vol. 18, pp. 179-189, 1966] generalized these results and developed the Parzen kernel class of estimators. Conditions on the kernel function were derived to ensure asymptotically unbiased, and uniformly consistent estimators.
Given R samples, S={X1, . . . XR} drawn from a population with probability density function P(X), the Parzen density estimate {circumflex over (P)}(X) of the unknown probability function a sample X is defined as
                              P          ⁡                      (            X            )                          =                              1            R                    ⁢                                    ∑                              i                =                1                            M                        ⁢                                                  ⁢                                          1                h                            ⁢                              K                ⁡                                  (                                                            X                      -                                              X                        i                                                              h                                    )                                                                                        (        35        )            where K(·) is a window or kernel function and h is the window width, smoothing parameter, or simply the kernel size. The samples in the set S={X1, . . . XR} are used as the function kernels for Ki.
Often it is convenient to assume a d-dimension Gaussian form for kernels, or
                                          P            ^                    ⁡                      (                          X              /                              k                i                                      )                          =                              1                          2              ⁢                              π                                  d                  /                  2                                            ⁢                                                                  H                                                                    1                  /                  2                                                              ⁢                      ⅇ                                                            -                  1                                /                2                            ⁢                                                (                                      X                    -                                          X                      i                                                        )                                T                            ⁢                                                H                                      -                    1                                                  ⁡                                  (                                      X                    -                                          X                      i                                                        )                                                                                        (        36        )            where H is the kernel smoothing parameter and the data points Xi,, from S={X1, . . . XR} replaces the mean of {circumflex over (P)}(X/ki). H is defined as
                    H        =                  [                                                                      h                  11                  2                                                            …                                                              h                                      1                    ⁢                    d                                    2                                                                                    ⋮                                            ⋱                                            ⋮                                                                                      h                                      d                    ⁢                                                                                  ⁢                    1                                    2                                                            …                                                              h                  dd                  2                                                              ]                                    (        37        )            
This expanded form of h accounts for both measurement scaling and correlation between the measurements. A procedure for estimating H will be defined later.
Using eq. (36), the multi-dimensional Parzen density estimate at X is defined as
                                          P            ^                    ⁡                      (            X            )                          =                              1            R                    ⁢                                    ∑                              i                =                1                            M                        ⁢                                                  ⁢                                          P                ^                            ⁡                              (                                  X                  /                                      k                    i                                                  )                                                                        (        38        )            
A modified Parzen kernel estimator can be used to approximate the class-of-interest posterior distribution, P(Cint/X).
The approximation for the class-of-interest posterior distribution function P(Cint/X) using ATF(X), is defined as follows:
The function-of-features, f(X)'s, (in the vector F(X), eq. (23)) are defined as
                                          f            ⁡                          (              X              )                                i                =                              P            ⁡                          (                              X                /                                  k                  i                                            )                                            P            ⁡                          (              X              )                                                          (        39        )            The vector, F(X) is then defined as
                              F          ⁡                      (            X            )                          =                              [                                                            P                  ⁡                                      (                                          X                      /                                              k                        1                                                              )                                                                    P                  ⁡                                      (                    X                    )                                                              ,                                                P                  ⁡                                      (                                          X                      /                                              k                        2                                                              )                                                                    P                  ⁡                                      (                    X                    )                                                              ,                              …                ⁢                                                                  ⁢                                                      P                    ⁡                                          (                                              X                        /                                                  k                          R                                                                    )                                                                            P                    ⁡                                          (                      X                      )                                                                                            ]                    T                                    (        40        )            where P(X/ki) is defined in eq. (36) and P(X) is defined in eq. (38).The parameters of the weighting vector, A, are defined asA=(a1, a2, . . . aR)T  (41)Taking the product of ATF(X), we get
                                          A            T                    ⁢                      F            ⁡                          (              X              )                                      =                                            a              1                        ⁢                                          P                ⁡                                  (                                      X                    /                                          k                      1                                                        )                                                            P                ⁡                                  (                  X                  )                                                              +                                    a              2                        ⁢                                          P                ⁡                                  (                                      X                    /                                          k                      2                                                        )                                                            P                ⁡                                  (                  X                  )                                                      ⁢                                                  ⁢            …                    +                                    a              R                        ⁢                                          P                ⁡                                  (                                      X                    /                                          k                      R                                                        )                                                            P                ⁡                                  (                  X                  )                                                                                        (        42        )            or
                                          A            T                    ⁢                      F            ⁡                          (              X              )                                      =                              ∑                          i              =              1                        R                    ⁢                                          ⁢                                    a              i                        ⁢                                          P                ⁡                                  (                                      X                    /                                          k                      i                                                        )                                                            P                ⁡                                  (                  X                  )                                                                                        (        43        )            
Thus, ATF(X) is seen to be weighted linear combination of posterior distributions estimated using Parzen kernels. The weighting vector A is estimated using the least squares estimator, eq. (29).
The number of computations required to compute the Parzen kernel estimate of the class-of-interest posterior distribution function ATF(X), eq. (43), increases linearly as the number of features increases. This linear relationship between the number of features and the number of computations means that the Parzen kernel estimator can be used to extract useful features from high dimensional data.
Estimating the Smoothing Parameter, H, for the Parzen Kernels
A number of authors have studied the problem of determining a value for the Parzen smoothing parameter h. Fukunaga [K. Fukunaga, D. M. Hummels, “Bayes Error Estimation Using Parzen and k-NN Procedures”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, Number 5, September 1987, p. 634-643] evaluated a least squares approach for estimating h. However, Fukunaga's least square estimator of h is not directly related to the set of kernels selected and was shown to provide inconsistent results.
U.S. Pat. No. 6,317,517, issued to Liang, et al., disclosed a method for using classification accuracy as a means for selecting an optimal value for smoothing parameter h. However this approach requires that training samples be available for all the classes.
Jones [M. C. Jones and D. A. Henderson, “Maximum likelihood kernel density estimation,” Technical Report 05/01, Department of Statistics, Open University] and Grim [J. Grim, J. Novovicova, P. Pudil, and P. Somol, “Initialing Normal Mixtures of Densities,” Proceedings of the 14th International Conference on Pattern Recognition-Volume 1, p. 886, 1998] suggested an approach for estimating the kernel smoothing parameter, H, using maximum likelihood estimation. They used the well known log-likelihood criterion for finite mixtures [R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, New York: John Wiley & Sons, 1973, pp. 189-201, pp. 189-201] to estimate an optimal smoothing matrix for the unknown density. In particular, a smoothing parameter, Hi, is estimated for each kernel density, ki i=1 . . . R, using an iterative equation. An optimal smoothing parameter, H, was obtained for the unknown density function by computing a weighted average of the individual kernels Hi's. This procedure provides useful results but is computationally intensive.
An alternative maximum likelihood kernel density estimation technique is presented below which provides a maximum likelihood estimate of the smoothing matrix for the unknown density function. An iterative technique is derived which estimates a value for H which is common all kernels.
One interpretation of the kernel density estimator is that it is a special case of the mixture density model of the form
                              P          ⁡                      (            X            )                          =                              ∑                          i              =              1                        M                    ⁢                                          ⁢                                    P                              k                i                                      ⁢                          1                              2                ⁢                                  π                                      d                    /                    2                                                  ⁢                                                                          H                                                                            1                    /                    2                                                                        ⁢                          ⅇ                                                                    -                    1                                    /                  2                                ⁢                                                      (                                          X                      -                                              X                        i                                                              )                                    T                                ⁢                                                      H                                          -                      1                                                        ⁡                                      (                                          X                      -                                              X                        i                                                              )                                                                                                          (        44        )            where the data points Xi, i=1 . . . R replace the Gaussian means, Pki is the a priori probability of the kernel, and all the kernels share a common smoothing parameter H.
First we will let the a priori probability of the kernels be
                              P                      k            i                          =                  1          R                                    (        45        )            
Now, suppose we are given a set Ψ={X1, X2, . . . XN} of N unlabeled samples drawn independently from the mixture density,
                              P          ⁡                      (                          X              ❘              H                        )                          =                              ∑                          i              =              1                        R                    ⁢                                          ⁢                                    P                              k                i                                      ⁢                          P              ⁡                              (                                                      X                    /                                          k                      i                                                        ,                  H                                )                                                                        (        46        )            where the smoothing parameter H is fixed but unknown.
The likelihood of the observed samples is by definition the joint probability
                              P          ⁡                      (                          Ψ              ❘              H                        )                          =                              ∏                          j              =              1                        N                    ⁢                                          ⁢                      P            ⁡                          (                                                X                  j                                /                H                            )                                                          (        47        )            
The maximum likelihood estimate of H is that value of Ĥ that maximizes P(Ψ|H). Let L be the logarithm of the likelihood, then
                    L        =                              ∑                          j              =              1                        N                    ⁢                                          ⁢                      log            ⁢                                                  ⁢                          P              ⁡                              (                                                      X                    j                                    /                  H                                )                                                                        (        48        )            
Differentiating L, eq. (48), with respect to H and setting the resulting expression to zero, we get
                                          δ            ⁢                                                  ⁢            L                                δ            ⁢                                                  ⁢            H                          =                                            ∑                              j                =                1                            N                        ⁢                                                  ⁢                                          ∑                                  i                  =                  1                                R                            ⁢                                                          ⁢                                                                                          P                                              k                        i                                                              ·                                          P                      ⁡                                              (                                                                                                            X                              j                                                        /                                                          k                              i                                                                                ,                          H                                                )                                                                                                  P                    ⁡                                          (                                                                        X                          j                                                /                        H                                            )                                                                      [                                                                  ⁢                                                                            -                                              1                        2                                                              ·                                          H                                              -                        1                                                                              +                                                            1                      2                                        ⁢                                                                  H                                                  -                          1                                                                    ⁡                                              (                                                  Xj                          -                                                      X                            i                                                                          )                                                              ⁢                                                                  (                                                                              X                            j                                                    -                                                      X                            i                                                                          )                                            T                                        ⁢                                          H                                              -                        1                                                                                            ]                                              =          0                                    (        49        )            Solving eq. (49) for H, we obtain the following maximum likelihood estimator for H
                              H          ^                =                              ∑                          j              =              1                        N                    ⁢                                          ⁢                                    ∑                              i                =                1                            R                        ⁢                                                  ⁢                                                                                P                                          k                      i                                                        ·                                      P                    ⁡                                          (                                                                        X                          j                                                /                                                  k                          i                                                                    )                                                                                        P                  ⁡                                      (                                          X                      j                                        )                                                              ·                              [                                                      1                    2                                    ⁢                                      (                                                                  X                        j                                            -                                              X                        i                                                              )                                    ⁢                                                            (                                                                        X                          j                                                -                                                  X                          i                                                                    )                                        T                                                  ]                                                                        (        50        )            
Eq. (50) is an iterative maximum likelihood estimator for the smoothing parameter, H. Given an initial value for H0, an updated value is obtained for H. After each update of H, the logarithm of the likelihood, eq. (48) is evaluated. This process is repeated until there is no further change in the logarithm of the likelihood, eq. (48).
Adaptive Feature Selection Using Decision Boundaries Provided by the Adaptive Bayes Pattern Recognition Algorithm
Lee and Landgrebe [C. Lee and D. L. Landgrebe, “Feature Extraction Based on Decision Boundaries,” IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 15, no. 4, April 1993] and [C. Lee and D. L. Landgrebe, “Decision Boundary Feature Extraction for Nonparametric Classification,” IEEE Trans. On Systems, Man, and Cybernetics, vol. 23, no. 2, March/April 1993] showed that optimal features can be extracted for use in classification based directly on decision boundaries. They showed that discriminant features are directly related to decision boundaries and these features can be derived from decision boundaries. Their algorithm requires that labeled training samples (or alternatively, class density functions) be available for all classes. It will be shown below that this concept, coupled with the previously presented adaptive Bayes classifier, eq. (9), can be applied to perform Adaptive Bayes Feature Extraction.
Two non-parametric adaptive feature selection techniques are presented below that extracts linear combination of discriminant features using only labeled samples from the class-of-interest and unlabeled samples from the data set to be classified and the decision boundary provide by adaptive Bayes classifier.
The Adaptive Bayes Decision Boundary
The adaptive Bayres decision boundary is the locus of all the points where the posterior probability function in eq. (9) is equal to ½, or:P(Cint/X)−½=0  (51)or, using the least-squares approximation of {circumflex over (P)}(Cint/X) the decision boundary, eq.(30), is the locus of all the points where;ATF(x)−½=0  (52)
FIG. 1 shows an example of the decision boundary for two classes, 10 and 12, where the covariance matrices are equal. The decision boundary, 14, in this example, is linear.
In this example, the mean vectors and covariance matrices of the two bi-variant Gaussian classes are as follows:
                                          μ                          Class_of              ⁢              _Interest                                =                      [                                                            1                                                                              2                                                      ]                          ,                              Σ                          Class_of              ⁢              _Interest                                ⁢                                          =                      [                                                            1                                                  0.5                                                                              0.5                                                  1                                                      ]                                              (        53        )            
                                          μ                          Class              ⁢                              -                            ⁢              other                                =                      [                                                            2                                                                              1                                                      ]                          ,                                  ⁢                              Σ                          Class              ⁢                              -                            ⁢              other                                =                      [                                                            1                                                  0.5                                                                              0.5                                                  1                                                      ]                                              (        54        )            
These distributions are shown in FIG. 1 as two sigma “ellipses of concentration”, 10 and 12.
Also shown in FIG. 1 is a unit normal vector to the decision boundary 16. This normal vector is the primary discriminant feature vector for this example. A linear combination of features based on this vector will provide maximum discrimination between these two classes. The intrinsic dimensionality of this problem is 1 since a second feature vector, orthogonal to the first normal, would provide no additional information for discriminating between the two classes.
The unit normal vector 16 in FIG. 1 is defined as
                    N        =                              1                          2                                ⁢                      (                                          -                1                            ,              1                        )                                              (        55        )            Decision Boundary Feature Matrix
The effectiveness of the unit normal vector as a discriminant feature is roughly proportional to the area of the decision boundary that has the same normal vector.
Let N(X) be the unit normal vector to the decision boundary at a point X on the decision boundary for a given pattern classification problem. Lee and Landgrebe [C. Lee and D. L. Landgrebe, “Feature Extraction Based on Decision Boundaries,” IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 15, no. 4, April 1993] define a decision boundary feature matrix, MDBFM as
                              M          DBFM                =                              1            K                    ⁢                                    ∫              S                                                                    ⁢                                          N                ⁡                                  (                  X                  )                                            ⁢                                                N                  t                                ⁡                                  (                  X                  )                                            ⁢                              p                ⁡                                  (                  X                  )                                            ⁢                                                          ⁢                              ⅆ                X                                                                        (        56        )            where p(X) is the probability density function and K is
                    K        =                              ∫            S                                                          ⁢                                    p              ⁡                              (                X                )                                      ⁢                                                  ⁢                          ⅆ              X                                                          (        57        )            and S is the decision boundary, and the integral is performed over the decision boundary.
In the FIG. 1, it is obvious that the vector normal to the decision boundary is the same for any point on the decision boundary. The decision boundary feature matrix for the decision boundary in FIG. 1 is given by
                              M          DBFM                =                              1            K                    ⁢                                    ∫              S                                                                    ⁢                                          N                ⁡                                  (                  X                  )                                            ⁢                                                N                  T                                ⁡                                  (                  X                  )                                            ⁢                              p                ⁡                                  (                  X                  )                                            ⁢                                                          ⁢                              ⅆ                X                                                                        (        58        )            
                              M          DBFM                =                                            1              K                        ⁢                          NN              T                        ⁢                                          ∫                S                                                                              ⁢                                                p                  ⁡                                      (                    X                    )                                                  ⁢                                                                  ⁢                                  ⅆ                  X                                                              =                      NN            T                                              (        59        )            and the decision boundary feature matrix, MDBFM, for the normal vector in FIG. 1 is
                              M          DBFM                =                                            1                              2                                      ⁢                                                            (                                                            -                      1                                        ,                    1                                    )                                t                            ·                              1                                  2                                                      ⁢                          (                                                -                  1                                ,                1                            )                                =                                    1              2                        ⁡                          [                                                                    1                                                                              -                      1                                                                                                                                  -                      1                                                                            1                                                              ]                                                          (        60        )            and the rank of MDBFM, isRank(MDBFM)=1  (61)
In this example, since the rank of the decision boundary feature matrix, MDBFM, is one, the intrinsic discriminant dimensionality of MDBFM is one.
Solving for the eigenvalues and eigenvectors of the decision boundary feature matrix, defined in eq. (60), we get:
                              eigenvalues          ⁢                      {                          M              DBFM                        )                          =                  [                                                    1                                                                    0                                              ]                                    (        62        )            and the two eigenvectors are
                              eigenvector          ⁢                                    {                              M                DBFM                            )                        1                          =                  [                                                    0.707                                                                                      -                  0.707                                                              ]                                    (        63        )            
                              eigenvector          ⁢                                    {                              M                DBFM                            )                        2                          =                  [                                                    0.707                                                                    0.707                                              ]                                    (        64        )            
We see that the eigenvalue for the first eigenvector is one and the eigenvalue of the second eigenvector is zero. This indicates that the first eigenvector, eq. (63), provides all of the discrimination between the classes.
FIG. 2 is an example of two bivariate Gaussian classes, 18 and 20, with the following mean vectors and covariance matrices:
                                          μ                          Class_of              ⁢              _Interest                                =                      [                                                            1                                                                              2                                                      ]                          ,                                  ⁢                              Σ                          Class_of              ⁢              _Interest                                =                      [                                                            1                                                  0.5                                                                              0.5                                                  1                                                      ]                                              (        65        )            
                                          μ                          Class              ⁢                              -                            ⁢              other                                =                      [                                                            1                                                                              2                                                      ]                          ,                                  ⁢                              Σ                          Class              ⁢                              -                            ⁢              other                                =                      [                                                            1                                                                      -                    0.5                                                                                                                    -                    0.5                                                                    1                                                      ]                                              (        66        )            Again, these distributions are shown in FIG. 2 as “ellipses of concentration”.
Since the covariance matrices are unequal in this example, the decision boundary 22 is a quadratic curve. It can be seen that the directions of the unit normals 24 to this decision boundary vary along the decision boundary. This implies that two features are needed to achieve the same classification accuracy as in the original feature space.
Decision Boundaries and Effective Decision Boundaries
It was seen in FIGS. 1 and 2, that the decision boundary of a two-class problem is the locus of points for which the class-of-interest posteriori probability is equal to one-half. Although a decision boundary can be extended to infinity, only a portion of the decision boundary affects the classification results. We define the effective decision boundary as the intersection of the decision boundary with the region where most of the data is located. FIGS. 3 and 4 show examples of decision boundaries, 30 and 38, and effective decision boundaries, 32 and 40.
In a similar way, we define an effective boundary feature matrix (EDBFM) in the same as the decision boundary feature matrix, except that only the effective decision boundary is considered instead of the entire decision boundary. MEDBFM is defined as:
                              M          DBFM                =                              1                          K              ′                                ⁢                                    ∫                              S                ′                                                                                  ⁢                                          N                ⁡                                  (                  X                  )                                            ⁢                                                N                  ⁡                                      (                    X                    )                                                  T                            ⁢                              p                ⁡                                  (                  X                  )                                            ⁢                                                          ⁢                              ⅆ                X                                                                        (        67        )            where p(X) is the probability density function and
                              K          ′                =                              ∫                          S              ′                                                                      ⁢                                    p              ⁡                              (                X                )                                      ⁢                                                  ⁢                          ⅆ              X                                                          (        68        )            and S′ is the effective decision boundary, and the integral is performed over the effective decision boundary.Procedure for Finding the Decision Boundary Feature Matrix
As was shown in FIGS. 3 and 4, the effective decision boundaries, 32 and 40, displayed as bold lines, plays a significant role in discriminating between the classes, whereas the parts of the decision boundary displayed as dashed lines, 30 and 38, are rarely used in discriminating between the classes. Feature extraction based on the effective decision boundary instead of the complete decision boundary will result in fewer features while achieving nearly the same classification accuracy.
The basic concept for calculating the effective decision boundary feature matrix is illustrated in FIG. 5. First the unlabeled samples from the data set under consideration are classified as either class-of-interest or class-other using the adaptive Bayes decision rule, eq. (9). As shown in FIG. 5, a subset the classified samples lying near the decision boundary 46 is extracted. Samples lying near the decision boundary are readily identified, since the value of their posterior probability, {circumflex over (P)}(Cint/X), is close to ½.
For each sample classified as “class-of-interest”, the nearest sample classified as “class-other”, on the other side of the decision boundary, is located. Vectors 50 are constructed between pairs of samples as shown in FIG. 5. A search is made along this vector for the point where the vector crosses the decision boundary. At the point where the vector crosses the decision boundary, the posterior probably, {circumflex over (P)}(Cint/X), is equal to ½. Unit normal vectors to the decision boundary 48 are computed at these points. This procedure is repeated for all samples classified as “class-other”. These unit normals are then used to derive a set of orthogonal feature extraction vectors.
FIG. 6 shows the results of an example of the application of this boundary finding procedure to find points 58 on the decision boundary 56 using the data shown in FIG. 3. These points 58 are used to calculate normals to the decision boundary.