Conventionally, in the field of pattern recognition, the similarity between patterns such as characters or human faces has been determined by extracting feature vectors from input patterns, extracting feature vectors effective for identification from the feature vectors, and comparing the feature vectors obtained from the respective patterns.
In the case of face verification, for example, pixel values of a facial image normalized with the positions of the eyes or the like are raster-scanned to transform the pixel values into a one-dimensional feature vector, and the principal component analysis is performed by using this feature vector as an input feature vector (non-patent reference 1: Moghaddam et al., “Probabilistic Visual Learning for Object Detection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 7, pp. 696-710, 1997) or linear discriminant analysis is performed on the principal components of the feature vector (non-patent reference 2: W. Zhao et al., “Discriminant Analysis of Principal Components for Face Recognition”, Proceedings of the IEEE Third International Conference on Automatic Face and Gesture Recognition, pp. 336-341, 1998), thereby reducing dimensions and performing personal identification or the like based on faces by using obtained feature vectors.
In these methods, covariance matrices, within-class covariance matrices, and between-class covariance matrices are calculated with respect to prepared learning samples, and basis vectors are obtained as solutions to the eigenvalue problems in the covariance matrices. The features of input feature vectors are then transformed by using these basis vectors.
Linear discriminant analysis will be described in more detail below.
Linear discriminant analysis is a method of obtaining a transformation matrix W which maximizes the ratio of a between-class covariance matrix SB to a within-class covariance matrix SW of an M-dimensional vector y (=WTx) obtained when an N-dimensional feature vector x is transformed by the transformation matrix W. As such a covariance evaluation function, equation (1) as an evaluation expression is defined:
                              J          ⁡                      (            W            )                          =                                                                          S                B                                                                                                S                W                                                            =                                                                                    W                  T                                ⁢                                                      ∑                    B                                    ⁢                  W                                                                                                                                  W                  T                                ⁢                                                      ∑                    W                                    ⁢                  W                                                                                                      (        1        )            
In this equation, the within-class covariance matrix ΣW and between-class covariance matrix ΣB are respectively a covariance matrix Σi within C classes ωi (i=1, 2, . . . , C; their data count ni) in a set of feature vectors x in a learning sample and a covariance matrix between the classes, and are respectively represented by:
                                                                        ∑                W                            ⁢                              =                                                      ∑                                          i                      =                      1                                        C                                    ⁢                                                            P                      ⁡                                              (                                                  ω                          i                                                )                                                              ⁢                                          ∑                      i                                                                                                                                              =                                                ∑                                      i                    =                    1                                    C                                ⁢                                  (                                                            P                      ⁡                                              (                                                  ω                          i                                                )                                                              ⁢                                          1                                              n                        i                                                              ⁢                                                                  ∑                                                  x                          ∈                                                      x                            i                                                                                              ⁢                                                                        (                                                      x                            -                                                          m                              i                                                                                )                                                ⁢                                                                              (                                                          x                              -                                                              m                                i                                                                                      )                                                    T                                                                                                      )                                                                                        (        2        )                                          ∑          B                ⁢                  =                                    ∑                              i                =                1                            C                        ⁢                                          P                ⁡                                  (                                      ω                    i                                    )                                            ⁢                              (                                                      m                    i                                    -                  m                                )                            ⁢                                                (                                                            m                      i                                        -                    m                                    )                                T                                                                        (        3        )            where mi is a mean vector of a class ωi (equation (4)), and m is a mean vector of x in total (equation (5)):
                              m          i                =                              1                          n              i                                ⁢                                    ∑                              x                ∈                                  x                  j                                                      ⁢            x                                              (        4        )                                m        =                              ∑                          i              =              1                        C                    ⁢                                    P              ⁡                              (                                  ω                  i                                )                                      ⁢                          m              i                                                          (        5        )            
If a priori probability P(ωi) of each class ωi reflects a sample count ni in advance, it suffices to assume P(ωi)=ni/n. If each probability can be assumed to be equal, it suffice to set P(ωi)=1/C.
The transformation matrix W which maximizes equation (1) can be obtained as a set of generalized eigenvectors corresponding to M large eigenvalues of equation (6) as the eigenvalue problem of a column vector wi. The transformation matrix W obtained in this manner will be referred to as a discriminant matrix.
                                          ∑            B                    ⁢                      w            i                          =                              λ            i                    ⁢                                    ∑              w                        ⁢                          w              i                                                          (        6        )            
Note that a conventional linear discriminant analysis method is disclosed in, for example, non-patent reference 5: Richard O. Duda et al., “Pattern Recognition” (supervised/translated by Morio Onoue, Shingijutu Communications, 2001, pp. 113-122).
Assume that the number of dimensions of the input feature vector x is especially large. In this case, if small learning data is used, ΣW becomes singular. As a consequence, the eigenvalue problem of equation (6) cannot be solved by a general method.
As described in patent reference 1: Japanese Patent Laid-Open No. 7-296169, it is known that a high-order component with a small eigenvalue in a covariance matrix includes a large parameter estimation error, which adversely affects recognition precision.
According to the above article by W. Zhao et al., the principal component analysis is performed on input feature vectors, and discriminant analysis is applied to principal components with large eigenvalues. More specifically, as shown in FIG. 2, after principal components are extracted by projecting an input feature vector by using a basis matrix obtained by the principal component analysis, a feature vector effective for identification is extracted by projecting principal components by using the discriminant matrix obtained by discriminant analysis as a basis matrix.
According to the computation scheme for feature transformation matrices described in patent reference 1: Japanese Patent Laid-Open No. 7-296169, the number of dimensions is reduced by deleting high-order eigenvalues of total covariance matrix ΣT and corresponding eigenvectors, and discriminant analysis is applied to a reduced feature space. Deleting high-order eigenvalues of total covariance matrix and corresponding eigenvectors is equivalent to performing discriminant analysis in a space of only principal components with large eigenvalues by the principal component analysis. In this sense, this technique, like the method by W. Zhao, provides stable parameter estimation by removing high-order features.
The principal component analysis using the total covariance matrix ΣT, however, is no more than sequentially selecting orthogonal axes within a feature space in the axial direction in which large covariances appear. For this reason, a feature axis effective for pattern identification is lost.
Assume that the feature vector x is comprised of three elements (x=(x1, x2, x3)T) x1 and x2 are features which have large variances but are irrelevant to pattern identification, and x3 is effective for pattern identification but has a small variance (between-class variance/within-class variance, i.e., Fisher's ratio, is large, but the variance value itself is sufficiently smaller than those of x1 and x2). In this case, if the principal component analysis is performed and only two-dimensional values are selected, a feature space associated with x1 and x2 is selected, and the contribution of x3 effective for identification is neglected.
This phenomenon will be described with reference to the accompanying drawings. Assume that FIG. 3A is the distribution of data viewed from a direction almost perpendicular to the plane defined by x1 and x2/with the black circles and white circles representing data points in different classes. When viewed in the space defined by x1 and x2 (plane in FIG. 3A), black and white circles cannot be identified. When, however, viewed from a feature axis of x3 perpendicular to this plane as shown in FIG. 3B, black and white circles can be separated from each other. If, however, an axis with a large variance is selected, the plane defined by x1 and x2 is selected as a feature space, which is equivalent to performing discrimination by seeing FIG. 3A. This makes it difficult to perform discrimination.
In the prior art, this is a phenomenon which cannot be avoided by the principal component analysis and the technique of deleting spaces with small eigenvalues in (total) covariance matrices.