1. Field of the Invention
This invention relates generally to pattern recognition and particularly to feature selection problem in statistical pattern recognition, more specifically, the invention provides a method for ranking features in order of importance and for selecting the features that are important to the classification of objects and recognition of patterns.
2. Description of the Related Art
The human brain is a supreme pattern recognition system. We rely on its ability to successfully process and recognize patterns to accomplish a variety of tasks. When walking down a street, we avoid tripping over curbs or bumping into walls. When driving, we stay in a lane and stop at red lights. We recognize words when reading newspapers or sounds when listening to the radio. We can fumble in our pocket to find a quarter for the vending machine. Although the exact output needed from our brain to accomplish any of these tasks is certainly not restrained by definite limits, identifying the object's proper class is of primary importance.
Today, simple pattern recognition systems are readily found, such as bar code scanners, magnetic strip scanners, or bill exchangers found in vending machines. However, the recent growth in research and development of automated pattern recognition systems and more advanced compute facilities has allowed more complex problems such as character recognition or automated speech recognition to be addressed. As the problems become more complex, the practical applicability of existing theories, as explained in the following paragraph, is exposed and the need to create techniques to deal with these practical applications and their associated pitfalls arises.
An important class of methods used to determine the identity of objects is referred to as statistical pattern recognition (see PATTERN CLASSIFICATION, Duda and Hart et al. (1973); INTRODUCTION TO STATISTICAL PATTERN RECOGNITION, Fukunaga (1990); and A PROBABILISTIC THEORY OF PATTERN RECOGNITION, Devroye, Györfi et al. (1996)), which is well known in the art. These methods make use of a probabilistic model to define the identity of each class. For example, a decision processor determines the object's identity by comparing an observation of that object against a set of class probabilistic models, hence the name statistical pattern recognition. The success of statistical pattern recognition depends on: 1) creating a consistent discrimination rule for the decision processor; 2) choosing the best probabilistic model and being able to accurately estimate the parameters of that model; and 3) finding an optimal set of feature attributes that makes the class identities distinguishable from each other.
The pitfalls associated with the object's observation or set of feature attributes affects many aspects of statistical pattern recognition theory when it is practically applied on real-world problems, like automatic speech recognition (see “A Review of Large-Vocabulary Continuous-Speech Recognition,” IEEE SIGNAL PROCESSING MAGAZINE, Vol. 13, No. 5, Young (1996); and STATISTICAL METHODS FOR SPEECH RECOGNITION, Jelinek (1997)). A feature based statistical pattern recognition system uses a set of feature attributes, known as the feature set, to describe the properties on an object. These feature attributes, for example, can be a collection of measurements taken from different sensors. For many applications, the feature attributes that allow the statistical models for each of the classes to be mutually distinguishable are unknown.
An improper feature set will create difficulties for the decision processor and statistical models. The size of the feature set—the number of feature attributes that describe the properties of an object—is a chief concern. If the feature set is too small, then there will be an inadequate amount of discriminative information to separate the classes. The lack of discriminative information will result in a decision processor that performs poorly. On the other hand, if the feature set is too large the ratio between the number of observed objects and the number of parameters in the statistical models to be estimated become disproportional. This is well known in the art as the “curse of dimensionality.” An unrestrained expansion on the size of the feature set will also result in a decision processor that performs poorly.
A principal reason that the “curse of dimensionality” exists is that the number of observations available to estimate the parameters of the statistical models is finite or very expensive to obtain in sufficient quantities for many practical pattern recognition problems. As the number of feature attributes, also known as the dimension of the observation space, increases, the available data becomes sparse in the observation space. This sparsity of the data makes it more difficult to obtain accurate estimates of the model parameters that are needed for good robust decision processors. For example, in a speech recognition problem the basic sound units may be represented as 50 different phonemes. Measuring 20 examples of each phoneme, for a total of 1,000 observations, using 36 different feature attributes, where d=36 is the dimension of the observation space, let's model each phoneme with a normal distribution, which has 0.5(d2+3d) model parameters. The number of parameters that must be estimated with 36,000 feature attributes is 35,100. It follows that the model parameters estimated from the observed data might not be very accurate and the corresponding decision processor may have poor performance.
Another pitfall associated with a large feature set size is related to the increased complexity of the pattern recognition system. From an economic perspective, decision processors that run fast and require little computer memory are usually preferable. The dimension of the observation space is a factor in establishing a lower bound on the number of floating point operations needed for the decision processor to assign a class identity to an observation. The dimension of the observation space often controls the number of parameters for statistical models, which directly affects the amount of computer memory needed to store that model.
The selection of a proper feature set affects all types of decision processors, whether neural network or classical approaches. The selection of a proper feature set can be broken down into a variable selection and a feature selection problem. Variable selection is a more abstract problem of selecting input variables for the observation space that are the most predictive of an outcome. Feature selection is the problem of finding the optimal subset of features derived from the observation space. The difference between variable selection and feature selection problems is very small, and therefore, confusion often arises because these terms are used interchangeably.
The variable selection problem is simply the identification of feature attributes that provide discriminatory information that allow the class identities to become mutually distinguishable. However, as stated, arbitrarily creating a high dimensional observation space creates problems for the decision processor and statistical models, so the variable selection problem establishes the observation space that feeds the feature selection problem, where a prime objective of feature selection problem is to reduce the dimension of the observation space so that the optimal feature attributes are found. This enables the pattern recognition system designer to manage the complexity of the decision processor, as well as, control the robustness and accuracy of the pattern recognition system.
The feature selection problem in a statistical pattern recognition system can be illustrated with the following example. Without loss of generality, consider a problem where the objective is to determine an observation's identity as coming from one of two classes. The observation is a vector x is in a Euclidean space Ed, which is a collection of d-tuples (x1, x2, . . . , xd) of real numbers xi for i=1, 2, . . . , d with a metric                     ∑                  i          =          1                d            ⁢                          ⁢              x        i        2              .A random variable X is defined to provide a connection between any element of Ed and the value assigned to an observation vector x, also known as a feature vector. Following the framework given by Devroye (see A PROBABILISTIC THEORY OF PATTERN RECOGNITION, Devroye, Györfi, et al. (1996)), the decision rule, also known as a classifier, is a function that maps the observation vector x∈Ed to a class label l from the set {ω1, ω2}. That is, the classifier is defined asg(x):Ed→{ω1,ω2}.  (1)
When each observation has an identity, then the identity of that observation can be joined together to create observation-label pairs (x, l), where x∈Ed and l∈{ω1, ω2}. A pair of random variables (X, L) is defined to provide a connection between any element of Ed×{ω1, ω2} and the value assigned to an observation-label pair (x, l). At a fundamental level, this random value pair (X, L) describes the pattern recognition problem. By defining the distribution for that random valued pair (X, L), a pattern recognition problem takes on a probabilistic framework, i.e. statistical pattern recognition.
Equation (1) does not provide insight as to the proper choice for the classifier g(•). A reasonable criterion to adopt for choosing g(•) is one that minimizes the average probability of making an error when choosing the class identity of an observation.
Since a classification error occurs when g(X)≠L, the average error probability for the classifier g(•) given by equation (1) is defined as                                                         err              ⁡                              (                g                )                                      =                                          Pr                ⁢                                  {                                                            g                      ⁡                                              (                        X                        )                                                              ≠                    L                                    }                                            =                                                1                  -                                                            ∑                                              i                        =                        1                                            2                                        ⁢                                                                                  ⁢                                          Pr                      ⁢                                              {                                                                              L                            =                                                          ω                              i                                                                                ,                                                                                    g                              ⁡                                                              (                                X                                )                                                                                      =                                                          ω                              i                                                                                                                                                  ⁢                      X                                                                      =                x                                              }                .                            (        2        )            The best possible classifier, also known as the Bayes classifier (see PATTERN CLASSIFICATION AND SCENE ANALYSIS, Duda and Hart (1973); and INTRODUCTION TO STATISTICAL PATTERN RECOGNITION, Fukunaga (1990)), is a mapping function g*(•) that minimizes the average error probability, that is                                           g            *                    ⁡                      (            X            )                          =                                            arg              ⁢                                                          ⁢              min                                      g              :                                                E                  d                                →                                  {                                                            ω                      1                                        ,                                          ω                      2                                                        }                                                              ⁢          Pr          ⁢                                    {                                                g                  ⁡                                      (                    X                    )                                                  ≠                L                            }                        .                                              (        3        )            
The classifier defined from equation (3) operates in a d-dimensional space, because the observation vector x has d feature attributes. A generic approach to feature selection is to find p features, where p is an integer less than d, that would allow the optimal classifier in p dimensions to be created. The dimension of the observation space, for example, can be reduced by an arbitrary measurable function defined asT(x):Ed→Ep.  (4)
The function T(•) reduces a d-dimensional observation vector x to a p-dimensional vector y. When the transformation is applied to the pattern recognition problem, the problem changes from a distribution on (X, L) to a distribution on (Y=T(X), L). A classifier that employs an arbitrary measurable function to reduce the dimension of the observation space is defined as                               g          ⁡                      (            y            )                          :                              (                                          E                d                            ⁢                              →                                  y                  =                                      T                    ⁡                                          (                      x                      )                                                                                  ⁢                              E                p                                      )                    →                                    {                                                ω                  1                                ,                                  ω                  2                                            }                        .                                              (        5        )            The Bayes classifier, which operates in dimension p and minimizes the average error probability, is defined as                               g          *                =                                            arg              ⁢                                                          ⁢              min                                      g              :                                                (                                                            E                      d                                        ⁢                                          →                      T                                        ⁢                                          E                      p                                                        )                                →                                  {                                                            ω                      1                                        ,                                          ω                      2                                                        }                                                              ⁢          Pr          ⁢                                    {                                                g                  ⁡                                      (                    Y                    )                                                  ≠                L                            }                        .                                              (        6        )            
The Bayes classifiers, defined by equations (3) and (6), are not fundamentally different. The transformation T(•) and the integer p are explicitly shown as undetermined variables in this minimization problem. The two formulations show that feature selection problem is not a separate problem from finding the optimal decision rule. However, the difference between the two formulations does suggest a way to break the problem into more manageable pieces. It may be easier to work the problem if the search for the “best” decision rule is separate from the search for the “best” transformation function. This can be accomplished, for example, if a functional form for the decision rule g(•) is assumed before searching for the “best” linear transformation.
This simplification may be far from optimal for any distribution (X, L) and the given classifier g(X). However, the optimal transformation for each p, where 1≦p<d, given a classifier g(•) would be defined as                                                         T              ^                        p                    ⁡                      (            X            )                          =                                            arg              ⁢                                                          ⁢              min                                      T              :                                                E                  d                                →                                  E                  p                                                              ⁢          Pr          ⁢                                    {                                                g                  ⁡                                      (                                          T                      ⁡                                              (                        X                        )                                                              )                                                  ≠                L                            }                        .                                              (        7        )            In so far as the functional form of g(•) is appropriate for the distribution of (X, L) and (Y=T(X), L), equation (7) would be a reasonable simplification.
To illustrate an example of a transformation function T(•), consider reducing a d-dimensional observation vector x to a p-dimensional vector y by means of a linear transformation, that is, y=Ψpx where Ψp is a p×d matrix. Desirable properties for this matrix would be p linearly independent row vectors, that is, rank(Ψp)=p, and the row vectors of Ψp have an L2-norm of one.
Without loss of generality, consider the extreme case of finding the “best” linear transformation when p=1, that is, find a vector zT=Ψp=1 to transforms a d-dimensional observation x into a scalar y for a pattern recognition problem with two equally likely classes. Assume the class identities are modeled by a normal distribution, that is, the class conditional density for the ith class is defined aspi(x)=N(x;μi,Σi)  (8)where μi is the class mean vector and Σi is the class covariance matrix for the multi-variate normal distribution. The optimal decision rule that determines the class identity given the observation vector x in d-dimensions is defined asif p1(x)>p2(x), then x belongs to class 1 otherwise class 2.  (9)Under this rule, the number of incorrectly identified observations will be minimized.
The decision rule can be rewritten by using discriminant functions gi(x) that are a function of the class conditional density function (8). The discriminant function for each class is defined asgi(x)=−½xTΣi−1x+μiTΣi−1x−½μiTΣi−1μi−½loge|Σi|, i=1, 2.  (10)The optimal decision rule using these discriminant functions, also known as Bayes classifier in dimension d, is defined asif g(x)=g1(x)−g2(x)>0, then x belongs to class 1 otherwise class 2.  (11)
Given any vector z to transform the observations x into observations y by means of the linear transformation y=zTx, the class conditional densities for observations y are normal and defined aspi(y)=N(y;μi=zTμi,σi2=zTΣiz)  (12)where the scalars μi, σi2 are the mean and variance of the normal distribution.
Given the vector z, the classifier for observation y=zTx is defined in terms of its discriminant functions gi(•) asif g(y)=g1(y)−g2(y)>0, then y belongs to class 1 otherwise class 2.  (13)The discriminant functions gi(•) for equation (13) are defined as                                                         g              i                        ⁡                          (              y              )                                =                                                                      (                                      y                    -                                          μ                      i                                                        )                                2                                            2                ⁢                                  σ                  i                  2                                                      -                                          log                e                            ⁢                              σ                i                                                    ,                                  ⁢                  i          =          1                ,        2                            (        14        )            where y=zTx, μi=zTμi, σi2=zTΣiz, for i=1, 2.
However, the classifier defined by (13) is not the Bayes classifier unless z is the “best” linear transformation vector, which can be found by solving the following given the classifier g(•) defined by equation (13).
A universal solution for the problem defined by equation (15),                                           z            *                    =                                                    arg                ⁢                                                                  ⁢                min                                                              z                  ∈                                      E                    d                                                  ,                                                                          z                                                        =                  1                                                      ⁢            Pr            ⁢                          {                                                g                  ⁡                                      (                                                                  z                        T                                            ⁢                      X                                        )                                                  ≠                L                            }                                      ,                            (        15        )            is not obvious and Antos (see “Lower Bounds for Bayes Error Estimation,” IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, Antos, Devroye et al. (1999)) provides a proof that no one can “claim to have a universally superior feature selection” method. That is to say, claiming to have a feature selection method that finds the optimal set of features for any pattern recognition problem defined by any distribution type (X, L) is not possible.
The most obvious approach to solve the problem defined by equation (15) would consist of defining the error probability function for the classifier and then minimizing that function, that is, generate an analytic form for Pr{g(zTX)≠L}. However, this approach of finding an analytic form for error probability function has it flaws. In general, the complexity and form for the error probability function depends on the distribution for (X, L). Even when the distributions are normal, one is hard pressed to find an analytic solution for equation (15).
An alternative approach is to define a criterion function that, under optimal conditions, has a minimum or maximum that is correlated with the minimum of the error probability function. The selection and development of the appropriate criterion function that is correlated with the error probability function is not obvious. Several methods have been proposed and developed that address the feature selection problem. Many of these methods, such as Fisher's Linear Discrimant Analysis (LDA), Heteroscedastic LDA, Average Symmetric Divergence LDA, and Bhattacharyya LDA, positively advance the science and work well when assumptions used to develop the method are met by the pattern recognition problem. These methods have some limitations, such as a requirement, that the class covariance matrices be the same for all classes or that the class distributions be normal in order to have an analytic form for a key integral, or require matrix inversion causing potential problems when those matrices are ill conditioned. A description of the four methods previously mentioned follows.
Fisher (“The Use of Multiple Measurements in Taxonomic Problems,” ANNALS OF EUGENICS, Fisher (1936)) first introduced a method, which is known as Fisher's Linear Discriminant Analysis, that uses a linear function to discriminant between two species of flowers, Iris setosa and Iris versicolor. He proposed that the best linear function to separate the classes is the ratio between the squared difference of the class means and the variance within the classes. In 1948, Rao (“The Utilization of Multiple Measurements in Problems of Biological Classification,” JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B (METHODOLOGICAL), Vol. 10, No. 2, Rao (1948)) extended this approach for classification problems with multiple classes.
Fisher's linear discriminant is found by maximizing one of the criterion functions in equation (16),                               f          ⁡                      (            Ψ            )                          =                                                                                              Ψ                  ⁢                                                                          ⁢                                      S                    B                                    ⁢                                      Ψ                    T                                                                                                                                Ψ                  ⁢                                                                          ⁢                                      S                    W                                    ⁢                                      Ψ                    T                                                                                        ⁢                                                  ⁢            or            ⁢                                                  ⁢                          f              ⁡                              (                Ψ                )                                              =                                                                                      Ψ                  ⁢                                                                          ⁢                                      S                    T                                    ⁢                                      Ψ                    T                                                                                                                                Ψ                  ⁢                                                                          ⁢                                      S                    W                                    ⁢                                      Ψ                    T                                                                                        .                                              (        16        )            Equation (17),                                           S            B                    =                                                    1                N                            ⁢                                                ∑                                      i                    =                    1                                    C                                ⁢                                                                  ⁢                                                                            n                      i                                        ⁡                                          (                                                                        μ                          i                                                -                        μ                                            )                                                        ⁢                                                            (                                                                        μ                          i                                                -                        μ                                            )                                        T                                    ⁢                                                                          ⁢                  with                  ⁢                                                                          ⁢                  μ                                                      =                                          ∑                                  i                  =                  1                                C                            ⁢                                                          ⁢                              μ                i                                                    ,                                  ⁢                              S            W                    =                                    1              N                        ⁢                                          ∑                                  i                  =                  1                                C                            ⁢                                                          ⁢                                                n                  i                                ⁢                                  Σ                  i                                                                                        (        17        )            defines the matrix SB, which is the between class scatter matrix, and the matrix SW, which is the within class scatter matrix. ST is the total scatter matrix where ST=SB+SW. If there is a limited amount of training, then the criterion function containing the matrix ST may provide a more accurate discriminant since more data went into its estimation.
The maximization of equation (16) is obtained by computing the eigenvectors corresponding to the largest eigenvalues of Sw−1SB or Sw−1ST. There are at most rank(SB)−1 eigenvectors that have non-zero eigenvalues. The rank(SB) is at most C, the number of classes to discriminate. Therefore, there are at most C−1 linear independent eigenvectors for the transformation matrix Ψ that can be used to reduce the dimension of the problem from Ed to EC-1. Thus, the p×d matrix Ψp is limited to p≦C−1.
It is known that Fisher's method for Linear Discriminant Analysis (LDA) does not achieve Bayes error probability under many conditions. One case is when the class distributions are normal with class covariance matrices that are not equal and not proportional. Kumar and Andreou (“Investigation of Silicon Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition,” ELECTRICAL AND COMPUTER ENGINEERING, Johns Hopkins Univ., Kumar (1997)) proposed a maximum likelihood approach for pattern recognition problems where Fisher's LDA does not achieve Bayes error probability. This approach is known as Heteroscedastic Discriminant Analysis (HDA). Saon, et al. (MINIMUM BAYERS ERROR FEATURE SELECTION FOR CONTINUOUS SPEECH RECOGNITION, Saon, Padmanabhan et al. (2000)) proposed a method very similar in form to Kumar's approach. Neither of these approaches has a closed form solution. Therefore, a solution is obtained through a numerical optimization such as a gradient ascent procedure.
Saon's approach, known as Heteroscedastic LDA, is defined by a criterion function that is the product of the squared distance to the class covariance weighted by the a-priori class probability. The function is defined as                               H          ⁡                      (                          Ψ              P                        )                          =                                            ∏                              i                =                1                            C                        ⁢                                                  ⁢                                          (                                                                                                                        Ψ                        p                                            ⁢                                              S                        B                                            ⁢                                              Ψ                        P                        T                                                                                                                                                                                        Ψ                        p                                            ⁢                                              Σ                        i                                            ⁢                                              Ψ                        P                        T                                                                                                                )                                            P                ⁢                                                                  ⁢                                  ω                  i                                                              =                                                                                    Ψ                  p                                ⁢                                  S                  B                                ⁢                                  Ψ                  P                  T                                                                                                  ∏                                  i                  =                  1                                C                            ⁢                                                          ⁢                                                                                                            Ψ                      p                                        ⁢                                          Σ                      i                                        ⁢                                          Ψ                      P                      T                                                                                                          P                  ⁢                                                                          ⁢                                      ω                    i                                                                                                          (        18        )            where SB is defined by equation (17) and pω, is the class a-priori probability. In order to make equation (18) applicable for a numerical optimization algorithm, like a gradient ascent algorithm, equation (18) is rewritten as equation (19).
The criterion function that Heteroscedastic LDA must solve using a numerical optimization algorithm is defined by                                           Ψ            ^                    p                =                                            arg              ⁢                                                          ⁢              max                                                      Ψ                ∈                                  {                                                            E                      p                                        ×                                          E                      d                                                        }                                            ,                                                rank                  ⁡                                      (                    Ψ                    )                                                  =                p                                              ⁢                      {                                                            log                  e                                ⁡                                  (                                                                                Ψ                      ⁢                                                                                          ⁢                                              S                        B                                            ⁢                                              Ψ                        T                                                                                                  )                                            -                                                ∑                                      i                    =                    1                                    C                                ⁢                                                                  ⁢                                                      p                                          ω                      i                                                        ⁢                                                            log                      e                                        ⁡                                          (                                                                                                                            ΨΣ                            i                                                    ⁢                                                      Ψ                            T                                                                                                                      )                                                                                            }                                              (        19        )            and its gradient is defined by                               ∇                      H            ⁡                          (                              Ψ                p                            )                                      =                              2            ⁢                                          (                                                      Ψ                    p                                    ⁢                                      S                    B                                    ⁢                                      Ψ                    p                    T                                                  )                                            -                1                                      ⁢                          Ψ              p                        ⁢                          S              B                                -                      2            ⁢                                          ∑                                  i                  =                  1                                C                            ⁢                                                          ⁢                                                                                          p                                              ω                        i                                                              ⁡                                          (                                                                        ΨΣ                          i                                                ⁢                                                  Ψ                          p                          T                                                                    )                                                                            -                    1                                                  ⁢                                  Ψ                  p                                ⁢                                                      Σ                    i                                    .                                                                                        (        20        )            Heteroscedastic LDA has the same limitations in some cases as Fisher's LDA, because it is derived from the same criterion function.
It is known that the divergence between two distributions for the class models is a reasonable measure of class separability or “degree of difficulty” to discriminant between two classes. Ultimately, if the divergence between two distributions could be modified to make it a metric, then it would be a very useful for classification. The best that can be done is to make the measure positive and symmetric. Kullback (INFORMATION THEORY AND STATISTICS, Kullback (1968)) defines the symmetric divergence between two classes with density functions pi(x) as                               D          ⁡                      (                          i              ,              j                        )                          =                              ∫                                                            p                  i                                ⁡                                  (                  x                  )                                            ⁢              log              ⁢                                                                    p                    i                                    ⁡                                      (                    x                    )                                                                                        p                    j                                    ⁡                                      (                    x                    )                                                              ⁢                              ⅆ                x                                              -                      ∫                                                            p                  j                                ⁡                                  (                  x                  )                                            ⁢              log              ⁢                                                                    p                    i                                    ⁡                                      (                    x                    )                                                                                        p                    j                                    ⁡                                      (                    x                    )                                                              ⁢                                                ⅆ                  x                                .                                                                        (        21        )            
The use of the divergence measure as a way to find the linear transformation matrix has a long history. Tou and Heydorn proposed the method for a two-class problem in 1967. Babu and Kalra extended the approach to handle multiple classes in 1972. Decell and Quirein provided a correct definition for the gradient of the divergence and provided some mathematical analysis for this technique. Saon and Padmanabhan applied this technique to a voice mail transcription task.
In order utilize a symmetric divergence measure as a method to find a linear transformation matrix, the following assumptions about the classification problem are made. First, by assuming normal class conditional densities, that is, pi(x)=N(x;μi, τi), a closed form solution for the integrals of equation (21) is established and defined as                               D          ⁡                      (                          i              ,              j                        )                          =                                            1              2                        ⁢            trace            ⁢                          {                                                Σ                  i                                      -                    1                                                  ⁡                                  [                                                            Σ                      j                                        +                                                                  (                                                                              μ                            i                                                    -                                                      μ                            j                                                                          )                                            ⁢                                                                        (                                                                                    μ                              i                                                        -                                                          μ                              j                                                                                )                                                T                                                                              ]                                            }                                +                                    1              2                        ⁢            trace            ⁢                          {                                                Σ                  j                                      -                    1                                                  ⁡                                  [                                                            Σ                      i                                        +                                                                  (                                                                              μ                            i                                                    -                                                      μ                            j                                                                          )                                            ⁢                                                                        (                                                                                    μ                              i                                                        -                                                          μ                              j                                                                                )                                                T                                                                              ]                                            }                                -                      d            .                                              (        22        )            Second, by defining the average pair-wise symmetric divergence this method is extended to handle classification problems with more than two classes. The average pair-wise symmetric divergence is defined as                               D          _                =                              2                          C              ⁡                              (                                  C                  -                  1                                )                                              ⁢                                    ∑                              1                ≤                i                ≤                j                ≤                C                                      ⁢                                                  ⁢                                          D                ⁡                                  (                                      i                    ,                    j                                    )                                            .                                                          (        23        )            By assuming normal class distributions and creating a criterion function that is an average, the scope of problems that this approach may be optimal against has been limited.
Given a p×d matrix Ψp and with the assumption of normal class conditional densities, pi(x)=N(x;μi, Σi), the class conditional densities in the projected space are also normal, that is, pi(y=Ψpx)=N(y;Ψpμi, ΨpΣiΨpT). The average pair-wise divergence of equation (23) in the projected space can be rewritten in terms of Ψp, such that                                           D            _                    ⁡                      (                          Ψ              p                        )                          =                                            2                              C                ⁡                                  (                                      C                    -                    1                                    )                                                      ⁢            trace            ⁢                          {                                                ∑                                      i                    =                    1                                    C                                ⁢                                                                  ⁢                                                                            (                                                                        Ψ                          p                                                ⁢                                                  Σ                          i                                                ⁢                                                  Ψ                          p                          T                                                                    )                                                              -                      1                                                        ⁢                                      Ψ                    p                                    ⁢                                      S                    i                                    ⁢                                      Ψ                    p                    T                                                              }                                -          p                                    (        24        )            where       S    i    =                    ∑                  j          ≠          i                    ⁢                          ⁢              Σ        j              +                  (                              μ            i                    -                      μ            j                          )            ⁢                                    (                                          μ                i                            -                              μ                j                                      )                    T                ·                              D            _                    ⁡                      (                          Ψ              p                        )                              is known as the projected average pair-wise divergence.
Since the symmetric divergence D(i, j) is positive, then the average symmetric divergence {overscore (D)} is positive. Using the data processing inequality (ELEMENTS OF INFORMATION THEORY, Cover and Thomas (1991)) and the fact that {overscore (D)} is positive, then {overscore (D)} (Ψp) is positive. Therefore, Ψp can be found by maximizing {overscore (D)} (Ψp), because the objective is to make the difference between {overscore (D)}−{overscore (D)}(Ψp) as small as possible. This can be formulated as                                           Ψ            ^                    p                =                                            arg              ⁢                                                          ⁢              max                                                      Ψ                ∈                                  {                                                            E                      p                                        ×                                          E                      d                                                        }                                            ,                                                rank                  ⁡                                      (                    Ψ                    )                                                  =                p                                              ⁢                                                    D                _                            ⁡                              (                Ψ                )                                      .                                              (        25        )            
The problem above can be solved by using a numerical optimization algorithm, such as a gradient ascent. This type of algorithm requires the gradient of {overscore (D)}(Ψ) and is defined as                               ∇                                    D              _                        ⁡                          (              Ψ              )                                      =                              2                          C              ⁡                              (                                  C                  -                  1                                )                                              ⁢                                    ∑                              i                =                1                            C                        ⁢                                                  ⁢                                                                                (                                                                  ΨΣ                        i                                            ⁢                                              Ψ                        T                                                              )                                                        -                    1                                                  ⁡                                  [                                                            Ψ                      ⁢                                                                                          ⁢                                              S                        i                                                              -                                          Ψ                      ⁢                                                                                          ⁢                                              S                        i                                            ⁢                                                                                                    Ψ                            T                                                    ⁡                                                      (                                                                                          ΨΣ                                i                                                            ⁢                                                              Ψ                                T                                                                                      )                                                                                                    -                          1                                                                    ⁢                                              ΨΣ                        i                                                                              ]                                            .                                                          (        26        )            
When it is not possible to obtain a closed-form expression for the error probability, then it is reasonable to define an upper bound for that error. The Bhattacharyya bound, a special case of the Chernoff bound, does just that when the class conditional densities are assumed to have normal distributions. The use of the Bhattacharyya distance as a basis to find a linear transformation for pattern recognition problems with two classes is described by Fukunaga. Saon and Padmanabhan proposed a method to find a p×d matrix Ψp based on the union of Bhattacharyya upper bounds. Their criterion function can handle pattern recognition problems with more than two classes.
The Bhattacharyya bound is a special case of the Chernoff bound. The Chernoff bound is an upper bound on Bayes error by making use of the geometric mean. The Bayes error for a two-class problem isp(error)=∫p(error,x)dx=∫p(error|x)p(x)dx  (27)where the error probability given x is the minimum of the class posterior probabilities under Bayes decision rule, that is, p(error|x)=min[p(ωi|x), p(ωj|x)]. Since the probability distribution functions are positive, then by the geometric mean inequality, min[a, b]≦asb1-s with s∈[0,1], the Bayes error can be bounded above byp(error)≦pωispωj1-s∫is(x)pj1-s(x)dx  (28)where pωi, pωj are the class a-priori probability and pi(x), pj(x) are the class conditional densities.
The integral found in equation (28) has a closed form solution when the class conditional density functions are normal, that is, pi(x)=N(x;μi, Σi). This is called the Chernoff bound and is defined asετ(i,j;s)=pωispωj1-se−τ(i,j;s)  (29)where the Chernoff distance between two classes is defined as                               τ          ⁡                      (                          i              ,                              j                ;                s                                      )                          =                                                            s                ⁡                                  (                                      1                    -                    s                                    )                                            2                        ⁢                                                                                (                                                                  μ                        i                                            -                                              μ                        j                                                              )                                    T                                ⁡                                  [                                                            s                      ⁢                                                                                          ⁢                                              Σ                        i                                                              +                                                                  (                                                  1                          -                          s                                                )                                            ⁢                                              Σ                        j                                                                              ]                                                            -                1                                      ⁢                          (                                                μ                  i                                -                                  μ                  j                                            )                                +                                    1              2                        ⁢            ln            ⁢                                                                                                                        s                      ⁢                                                                                          ⁢                                              Σ                        i                                                              +                                                                  (                                                  1                          -                          s                                                )                                            ⁢                                              Σ                        j                                                                                                                                                                                                                        Σ                        i                                                                                    s                                    ⁢                                                                                                          Σ                        j                                                                                                          1                      -                      s                                                                                  .                                                          (        30        )            
The Bhattacharyya bound is a special case of the Chernoff bound with τ(i, j;s=½). By selecting a value for s, the Bhattacharyya bound is less complicated, but this means that the optimum Chernoff bound, defined by a s in the interval [0,1], may not be used. The Bhattacharyya distance between two classes is defined as                               ρ          ⁡                      (                          i              ,              j                        )                          =                                            1              8                        ⁢                                                                                (                                                                  μ                        i                                            -                                              μ                        j                                                              )                                    T                                ⁡                                  [                                                                                    Σ                        i                                            +                                              Σ                        j                                                              2                                    ]                                                            -                1                                      ⁢                          (                                                μ                  i                                -                                  μ                  j                                            )                                +                                    1              2                        ⁢                          log              e                        ⁢                                                                                                                      Σ                      i                                        +                                          Σ                      j                                                        2                                                                                                                                                            Σ                      i                                                                            ⁢                                                                          ⁢                                                                                Σ                      j                                                                                                                                                  (        31        )            and the Bhattacharyya bound on the error probability is defined as                               ɛ                      ρ            ⁡                          (                              i                ,                j                            )                                      =                                                            p                                  ω                  i                                            ⁢                              p                                  ω                  j                                                              ⁢                                    e                              -                                  ρ                  ⁡                                      (                                          i                      ,                      j                                        )                                                                        .                                              (        32        )            
The Bhattacharyya bound as given by equation (32) is an upper bound for the Bayes error between two-classes. In order make this bound applicable to pattern recognition problems with more than two classes, an upper bound of Bayes error for multiple classes is defined. Saon and Padmanabhan show that the following is an upper bound on Bayes error for pattern recognition problems with more than two classes.
Upon inspection of equation (33),                                           P            ⁡                          (              error              )                                ≤                                    ∑                              1                ≤                i                ≤                j                ≤                C                                      ⁢                                                  ⁢                                                                                p                                          ω                      i                                                        ⁢                                      p                                          ω                      j                                                                                  ⁢                              ∫                                                                                                                              p                          i                                                ⁡                                                  (                          x                          )                                                                    ⁢                                                                        p                          j                                                ⁡                                                  (                          x                          )                                                                                                      ⁢                                      ⅆ                    x                                                                                      ,                            (        33        )            it can be seen that objects of the form of equation (32) are summed. By taking a union of Bhattacharyya bounds defined by (32), an upper bound for summation of (33) exists. Thus, the union of Bhattacharyya bounds is defined as                               P          ⁡                      (            error            )                          ≤                              ∑                          1              ≤              i              ≤              j              ≤              C                                ⁢                                          ⁢                                                                      p                                      ω                    i                                                  ⁢                                  p                                      ω                    j                                                                        ⁢                                          e                                  -                                      ρ                    ⁡                                          (                                              i                        ,                        j                                            )                                                                                  .                                                          (        34        )            
Equation (34) forms the basis for the criterion function that can be used to find a p×d matrix Ψp that will be used to reduce the dimension of the observation space. Given a matrix Ψp and with the assumption that the class conditional densities are normal, pi(x)=N(x;μi, Σi), the class conditional densities in the projected space are also normal, that is, pi(y=Ψpx)=N(y;Ψpμi, ΨpΣiΨpT). The Bhattacharyya distance in the projected space between two classes becomes                                           ρ                          Ψ              p                                ⁡                      (                          i              ,              j                        )                          =                                            1              8                        ⁢            trace            ⁢                          {                                                                    (                                                                  Ψ                        p                                            ⁢                                              W                        ij                                            ⁢                                              Ψ                        p                        T                                                              )                                                        -                    1                                                  ⁢                                  Ψ                  p                                ⁢                                  S                  ij                                ⁢                                  Ψ                  p                  T                                            }                                +                                    1              2                        ⁢            ln            ⁢                                                                                                Ψ                    p                                    ⁢                                      W                    ij                                    ⁢                                      Ψ                    p                    T                                                                                                                                                                                                      Ψ                        p                                            ⁢                                              Σ                        i                                            ⁢                                              Ψ                        p                        T                                                                                                  ⁢                                                                          ⁢                                                                                                        Ψ                        p                                            ⁢                                              Σ                        j                                            ⁢                                              Ψ                        p                        T                                                                                                                                                                        (        35        )            where Sij=(μi−μj)(μi−μj)T; Wij=½(Σi+Σj) for 1≦i<j≦C. The Bhattacharyya bound in the projected space is                               B          ⁡                      (            Ψ            )                          =                              ∑                          1              ≤              i              ≤              j              ≤              C                                ⁢                                                                      p                                      ω                    i                                                  ⁢                                  p                                      ω                    j                                                                        ⁢                                          e                                  -                                                            ρ                      Ψ                                        ⁡                                          (                                              i                        ,                        j                                            )                                                                                  .                                                          (        36        )            
Given that this is an upper bound on Bayes error probability, finding a matrix Ψp that minimizes equation (36) is the problem that needs to be solved. Thus,                                                         Ψ              ^                        p                    =                                                    arg                ⁢                                                                  ⁢                min                                                              Ψ                  ∈                                      {                                                                  E                        p                                            ×                                              E                        d                                                              }                                                  ,                                                      rank                    ⁡                                          (                      Ψ                      )                                                        =                  p                                                      ⁢                          B              ⁡                              (                Ψ                )                                                    ,                            (        37        )            is solved by using a numerical optimization algorithm, such as a gradient descent. This type of numerical optimization algorithm requires the gradient of B(Ψ), which is defined as                               ∇                      B            ⁡                          (              Ψ              )                                      =                  -                                    ∑                              1                ≤                i                ≤                j                ≤                C                                      ⁢                                                                                p                                          ω                      i                                                        ⁢                                      p                                          ω                      j                                                                                  ⁢                              exp                ⁡                                  (                                      -                                                                  ρ                        Ψ                                            ⁡                                              (                                                  i                          ,                          j                                                )                                                                              )                                            ⁢                              ∇                                                      ρ                    Ψ                                    ⁡                                      (                                          i                      ,                      j                                        )                                                                                                          (        38        )            where                               ∇                                    ρ              Ψ                        ⁡                          (                              i                ,                j                            )                                      =                                            1              2                        ⁢                                                            (                                      Ψ                    ⁢                                                                                  ⁢                                          W                      ij                                        ⁢                                          Ψ                      T                                                        )                                                  -                  1                                            ⁡                              [                                                      Ψ                    ⁢                                                                                  ⁢                                          S                      ij                                                        -                                      Ψ                    ⁢                                                                                  ⁢                                          S                      ij                                        ⁢                                                                                            Ψ                          T                                                ⁡                                                  (                                                      Ψ                            ⁢                                                                                                                  ⁢                                                          W                              ij                                                        ⁢                                                          Ψ                              T                                                                                )                                                                                            -                        1                                                              ⁢                    Ψ                    ⁢                                                                                  ⁢                                          W                      ij                                                                      ]                                              +                                                    (                                  Ψ                  ⁢                                                                          ⁢                                      W                    ij                                    ⁢                                      Ψ                    T                                                  )                                            -                1                                      ⁢            Ψ            ⁢                                                  ⁢                          W              ij                                -                                                    1                2                            ⁡                              [                                                                                                    (                                                                              ΨΣ                            i                                                    ⁢                                                      Ψ                            T                                                                          )                                                                    -                        1                                                              ⁢                                          ΨΣ                      i                                                        +                                                                                    (                                                                              ΨΣ                            j                                                    ⁢                                                      Ψ                            T                                                                          )                                                                    -                        1                                                              ⁢                                          ΨΣ                      j                                                                      ]                                      .                                              (        39        )            