Subspace learning is becoming increasingly important in many applications such as web document classification and face recognition. Some of the most widely used linear subspace learning algorithms are Principal Component Analysis (“PCA”), Linear Discriminant Analysis (“LDA”), and Maximum Margin Criterion (“MMC”). These subspace learning algorithms learn a projection from an original high dimensional space to a low dimensional space. For example, a web document may be represented by a feature vector with an entry for each possible word that can be contained in a web document. Thus, the dimension of the feature vector would be one million if there were one million possible words. It would be impracticable to train a classifier of web documents using training data with such a large feature vector. A subspace learning algorithm, however, can project the training data from a high dimensional space to a low dimensional space so that the training data in the low dimensional space can be used to train a classifier.
PCA is an unsupervised subspace learning algorithm that attempts to find the geometrical structure of the training data and projects the training data along the directions with maximal variances. PCA, however, does not use classification information of the training data. LDA, in contrast, is a supervised subspace learning algorithm that searches for the projection axes on which the data points of different classifications are far from each other while data points in the same classification are close to each other. MMC is also a supervised subspace learning algorithm that has the same goal as LDA. MMC has, however, a much lower computational complexity than LDA because of an objective function that is less complex.
PCA, LDA, and MMC are batch algorithms that require that data be available in advance and be processed together. Many current applications, however, cannot be implemented effectively with a batch algorithm. For example, as classified web documents are received by a classification system, the classification system may want to revise its projection based on the newly received web documents. With a batch algorithm the classification system would need to regenerate the projection matrix using all the data that had been previously received, which can be computationally very expensive. Thus, an incremental technique would be useful to compute the adaptive subspace for data that is received sequentially. One such incremental algorithm is Incremental PCA (“IPCA”). IPCA, however, ignores the classification information, and the features derived by IPCA may not be the most discriminant ones.
LDA generates a linear projection matrix as represented by the following equation:W∈Rd×p  (1)where W is the projection matrix and d is the high dimension and p is the low dimension. LDA attempts to maximize the Fisher criterion as represented by the following equation:
                                          J            ⁡                          (              W              )                                =                                                                                    W                  T                                ⁢                                  S                  b                                ⁢                W                                                    /                                                                          W                  T                                ⁢                                  S                  w                                ⁢                W                                                                  ⁢                                  ⁢        where                            (        2        )                                          S          b                =                              ∑                          i              =              1                        c                    ⁢                                                    p                i                            ⁡                              (                                                      m                    i                                    -                  m                                )                                      ⁢                                          (                                                      m                    i                                    -                  m                                )                            T                                                          (        3        )                                          S          w                =                              ∑                          i              =              1                        c                    ⁢                                    p              i                        ⁢                          E              ⁡                              (                                                      u                    i                                    -                                      m                    i                                                  )                                      ⁢                                          (                                                      u                    i                                    -                                      m                    i                                                  )                            T                                                          (        4        )            where Sb is the inter-class scatter matrix and Sw is the intra-class scatter matrix, where c is the number of classes, m is the mean of all samples, mi is the mean of the samples belonging to class i, pi is the prior probability for a sample belonging to class i, ui is the data samples of class i, and E is the expectation that a given sample is in class i. LDA obtains the projection matrix W by solving the generalized eigenvector decomposition problem as represented by the following equation:Sbw=λSww  (5)Since there are at most c−1 nonzero eigenvalues, the upper bound of p is c−1. Moreover, at least d+c data samples are required to ensure that Sw is not singular. These constraints limit the application of LDA. Furthermore, it is difficult for LDA to handle a large training set when the dimension of the feature space is high.
MMC uses a different and more computationally efficient objective function or feature extraction criterion. Using the same representation as LDA, the goal of MMC is to maximize the criterion represented by the following equation:J(W)=WT(Sb−Sw)W  (6)The subtraction of the scatter matrices by MMC can be performed in a more computationally efficient manner than the division of the scatter matrices by LDA. Although both MMC and LDA are supervised subspace learning algorithms, the computation of MMC is easier than that of LDA since MMC does not have an inverse operation. The projection matrix w can be obtained by solving the eigenvector decomposition problem represented by the following equation:(Sb−Sw)w=λw  (7)Nevertheless, both LDA and MMC are batch algorithms. It would be desirable to have an incremental algorithm that applies the principles of LDA and MMC.