1. Field of the Disclosure
The present disclosure generally relates to image pattern recognition for use in various applications including automatic target recognition, face recognition, fingerprint recognition, and iris recognition, among others and, more particularly, to image pattern recognition using correlation filters.
2. Brief Description of Related Art
Ever since the advent of the optical frequency plane correlator or correlation filter (A. VanderLugt, IEEE Trans. Inf. Th., 10 (1964) 139), there has been considerable interest in using correlators for pattern recognition. Correlators are shift-invariant filters (i.e., no need to center the input face image during testing), which allow one to locate the object of interest in the input scene merely by locating the correlation peak. Thus, one may not need to segment or register an object in the input scene prior to correlation, as is required in many other image pattern recognition methods. Much of the earlier work in correlation-based pattern recognition was devoted to recognizing military vehicles in scenes. Correlators (or more correctly, correlation filters) can be used for recognition of other patterns such as fingerprints, face images, etc. Authenticating the identity of a user based on their biometrics (e.g., face, iris, fingerprint, voice, etc.) is a growing research topic with wide range of applications in e-commerce, computer security and consumer electronics. In authentication (also termed “verification”), a stored biometric is compared to a live biometric to determine if the live biometric is that of an authorized user or not. There is a wide range of computing platforms that can be used to host biometric authentication systems. With current desktop computing power, researchers may not have to worry about the complexity of the algorithms; however, embedding such verification modules in small form factor devices such as cell phones and PDAs (Personal Digital Assistant) is a challenge as these platforms are limited by their memory and computing power. In applications where these filters are stored directly on a chip (such as in system-on-chip implementations), the memory available may be limited. Therefore, it is desirable to devise correlation filters with reduced memory requirements.
The matched spatial filter (D. O. North, Proc. IEEE, 51 (1963) 1016) (MSF) is based on a single view of the target and is optimal (in the sense of yielding maximal signal-to-noise ratio (SNR)) for detecting a completely known pattern in the presence of additive, white noise (noise with equal power at all frequencies). Unfortunately, MSFs are not suitable for practical pattern recognition because their correlation peak degrades rapidly when the input patterns deviate (sometimes even slightly) from the reference. These variations in the patterns are often due to common phenomena such as pose, illumination and scale changes. In optical implementations, the Matched Spatial Filter (MSF) is represented by a transparency, thus the transmittance of the filter is less than or equal to 1 at all spatial frequencies. This causes much of the incoming light to be attenuated causing low levels of light for the detector in the correlation plane. To address this issue, Homer and Gianino (J. L. Homer, and P. D. Gianino, Appl. Opt. 23, 812-816, 1984) suggested setting the filter magnitude to 1 at all frequencies. Thus the resulting filter contains only phase information and is known as the Phase-Only Filter (POF). POF has 100% light throughput efficiency.
In optical correlators, matched filters are represented on spatial light modulators (SLMs) which convert electrical inputs to optical properties such as transmittance and reflectance. Examples of SLMs are magneto-optic SLM (MOSLM) and liquid crystal display (LCD). The magneto-optic SLM can be operated in two levels of effective transmittance and to accommodate the limitations of the magneto-optic SLM, Psaltis et al. (D. Psaltis, E. G. Paek, and S. S. Venkatesh, Opt. Eng, 23, 698-704, 1984), and Homer and others (J. L. Homer and H. O. Bartlett, Appl. Opt. 24, 2889-2893, 1985, J. L. Homer and J. R. Leger, Appl. Opt. 24, 609-611, 1985), suggested the use of Binary Phase Only Filters (BPOF) which only use two levels in the filter. Psaltis et al. supra, suggested binarizing the real part of the matched spatial filter, while Homer and others, supra, suggested the binarization of the imaginary part of the matched spatial filter. Later, Cottrell et al. (D. M. Cottrell, R. A. Lilly, J. A. Davis, and T. Day, Appl. Opt. 26, 3755-3761, 1987) proposed the binarizing of the sum of the real and imaginary parts of the matched spatial filter. The main attribute of the BPOFs is that they are well suited for implementation on a binary SLM such as the magneto-optic SLM and were not designed specifically for digital implementations.
Dickey and Hansche (F. M. Dickey, and B. D. Hansche, Appl. Opt. 28, 1611-1613, 1989) extended the BPOF idea to the Quad-Phase Only Filters (QPOFs), that have 4 possible phase levels (namely ±π/4, ±3π/4) which could be implemented using two MOSLMs (each capable of providing only 2 phase levels) to effectively obtain the 4 phases needed in a QPOF. QPOFs are all-pass filters and they are based on a single image. A QPOF is an all-pass filter and has no ability to suppress noise. An effort to improve the signal-to-noise ratio (SNR) of the QPOF led to the development of the complex ternary matched filter (CTMF) defined below (F. M. Dickey, B. V. K. Vijaya Kumar, L. A. Romero, and J. M. Connely, Opt. Eng. 29, 994-1001,1990).HCTMF(u,v)=HR(u,v)+jHI(u,v)  (1)where HR (u,v), the real part of the filter transfer function HCTMF and HI (u,v), the imaginary part of HCTMF take on 3 levels (namely −1, 0 and +1) at each frequency (u,v).
However all the above filters are made from a single reference image and thus are sensitive to any distortions from this reference image. One approach to overcoming the distortion sensitivity of the MSF is to use one MSF for every view. However, this leads to the requirement to store and use a large number of filters that make this approach impractical. The alternative is to design composite filters that can exhibit better distortion tolerance than the MSFs. Composite filters (also known as Synthetic Discriminant Function or SDF filters)(C. F. Hester and D. Casasent, Appl. Opt., 19 (1980) 1758; B. V. K. Vijaya Kumar, Appl. Opt., 31 (1992) 4773) use a set of training images to synthesize a template that yields pre-specified correlation outputs in response to training images.
If matched filters are used, many filters would be needed, approximately one filter for each view. When one thinks of possible distortions (e.g., illuminations, expressions, pose changes, etc.), this is clearly too many filters to store and use. Therefore, Hester and Casasent, supra, introduced the concept of SDF filters in 1980. The first SDF filter required that the associated composite template be a weighted sum of training images with the weights chosen so that resulting correlation output values at the origin take on pre-specified (non-zero) values. This filter proved to be unattractive as it almost always led to sidelobes that are much larger than the correlation “peak” (the correlation value at the origin is loosely referred to herein as the correlation peak). Kumar (B. V. K. Vijaya Kumar, JOSA-A, 3 (1986) 1579) introduced the minimum variance SDF (MVSDF) formulation that minimized the output noise variance from the SDF filters. The sidelobe problem was addressed by the minimum average correlation energy (MACE) filters introduced by Mahalanobis et al. (A. Mahalanobis et al., Appl. Opt., 26 (1987) 3633). Refregier (Ph. Refregier, Opt. Lett., 16 (1991) 829) showed how to optimally trade off the noise tolerance and peak sharpness attributes of correlation filters. These and many other SDF filter developments were summarized in the tutorial review paper by Kumar (B. V. K. Vijaya Kumar, Appl. Opt., 31 (1992) 4773).
Correlation filters offer several advantages including shift-invariance (i.e., no need to center the input face image during testing), closed-form solutions, graceful degradation (i.e., loss of parts of input image results in slow loss of correlation peak) and ability to design built-in tolerance to normal impairments such as expression, illumination and pose changes. Correlation filters have been used widely in the areas of signal detection and automatic target recognition. As noted before, the matched filter is known to be optimal for detecting a known signal or image in the presence of additive white Gaussian noise (AWGN). When the noisy reference image or signal is input to the matched filter, its output is the cross correlation of the noisy input image with the reference image. If the noisy input contains a replica of the reference, the correlation output will have a large value (the “correlation peak”) at a location corresponding to the location of the reference image in the input scene and small values elsewhere. The value of the correlation peak is a measure of the likelihood that the input scene contains the reference image and the location of the peak provides the location of the reference image in the input scene. Thus, matched filters are well suited for both detecting the presence of a reference image in a noisy input scene as well as locating it. However, as noted before, matched filters suffer from the problem that the correlation peak degrades significantly when the target object exhibits appearance changes due to normal factors such as illumination changes, pose variations, facial expressions, etc. Therefore, it is desirable to devise filters that are tolerant to such variability.
There are two main stages in correlation-based pattern recognition. First is the correlation filter design (also called “the enrollment stage”) and the second is the use of the correlation filters (also called “the verification stage”). This correlation-based pattern recognition process is shown schematically in FIG. 1.
In the first stage (the enrollment or training stage), training images are used to design the correlation filter. The training images reflect the expected variability in the final image to be verified. For example, in designing a correlation filter for verifying the face of a person A, the person A's face images with a few expected variations (e.g., pose, expression, illumination) are acquired during the enrollment stage. These images are used to construct a correlation filter according to a carefully chosen performance metric and rigorously derived closed-form expressions. Most advanced correlation filter designs are in frequency domain. Thus, the training images are used to construct one or a few frequency-domain arrays (loosely called correlation “filters” or “templates”) that are stored in the system. Once the filters are computed, the filter arrays are stored and there is no need to store the training images. The authentication performance of the system depends critically on these stored filters. They must be designed to produce large peaks in response to images (many not seen during training) of the authentic user, small values in response to face images of impostors, and be tolerant to noise in the input images. As some of these goals are conflicting, optimal tradeoffs may be devised.
In the second stage, the input test image (e.g., someone's face image) is presented for verification and/or identification. In verification problems, the user claims his/her identity and the task is to compare the input image to the claimed identity and decide whether they match or not. In the identification problem, the user input is matched against a database of stored images (or equivalently, filters) to see which stored template best matches the input image. In either case, the 2-D (two dimensional) fast Fourier transform (FFT) of the test input is first computed and then multiplied by the stored templates (i.e., filter arrays). Thereafter, an inverse FFT (IFFT) of that product is performed to obtain the correlation output. If the input matches the template/filter, a sharply-peaked correlation output is obtained as in FIG. 2A, and when the two do not match, a correlation output with less sharp peaks as in FIG. 2B is obtained. Thus, one can use sharpness of the correlation peak for verification or identification—sharp correlation peaks relate to the images from the authentic, whereas no large discernible peaks result from face images from impostors. It is noted here that those skilled in the art would recognize that FFT is an efficient algorithm to compute the discrete Fourier transform (DFT) and, hence, the phrases “discrete Fourier transform” or “DFT” and “fast Fourier transform” or “FFT” are used interchangeably herein.
The following figure of merit, known as the peak-to-sidelobe ratio (PSR), is usually employed to measure the peak sharpness: First, the peak (i.e., the largest value) is located in the correlation output, and a small (e.g., of size 5×5) mask is centered at the peak. The sidelobe region is defined as the annular region between this small mask and a larger (e.g., of size 20×20) square also centered at the origin. The annular region may be rectangular or square or in any other suitable polygonal shape. The mean and standard deviation (“σ”) of the sidelobe region are computed and used to estimate the PSR using Eq. (2). PSR estimation is depicted pictorially in FIG. 3.
                    PSR        =                              peak            -            mean                    σ                                    (        2        )            The small mask size and the larger square sizes are somewhat arbitrary and are usually decided through numerical experiments. The basic goal is to be able to estimate the mean and the standard deviation of the correlation output near the correlation peak, but excluding the area close to the peak. The PSR is unaffected by any uniform illumination change in the input image. Thus, for example, if the input image is multiplied by a constant “k” (e.g., uniform illumination), the resulting correlation output will also be multiplied by the same factor. Thus, peak, mean and standard deviation all increase by “k”, making the PSR invariant to “k.” This can be useful in image problems where brightness variations are present. The PSR also takes into account multiple correlation points in the output plane (not just the peak), and thus it can be considered to lead to a more reliable decision. In order for a test image to be declared to belong to the trained class, the correlation peak should not only be large, but the neighboring correlation values should be small. Thus, the final verification decision is based on examining the outputs of many inner products (correlation region around the peak), rather than just one inner product (the correlation peak value.
In the discussion below, 1-D (one dimensional) notation is used for convenience, but all equations can easily be generalized to higher dimensions. Let f1(n), f2(n), . . . , fN(n) denote the training images (each with L pixels) from the authentic class, and let F1(k), F2(k), . . . , FN(k) denote their Fourier transforms. Let H(k) denote the filter. Then the correlation output ci(n) when the input image is fi(n) is given as follows. Note that j=√{square root over (−1)}.
                                          c            i                    ⁡                      (            n            )                          =                              1            L                    ⁢                                    ∑                              k                =                1                            L                        ⁢                                                            F                  i                  *                                ⁡                                  (                  k                  )                                            ⁢                              H                ⁡                                  (                  k                  )                                            ⁢                              exp                ⁡                                  [                                                            +                      j                                        ⁢                                                                  2                        ⁢                        π                        ⁢                                                                                                  ⁢                                                  (                                                      k                            -                            1                                                    )                                                ⁢                        n                                            L                                                        ]                                                                                        (        3        )            
In 2-D, let g(m,n) denote the correlation surface produced by the template h(m,n) in response to the input image f(m,n). Strictly speaking, the entire correlation surface g(m,n) is the output of the filter. However, the point g(0,0) is often referred to as “the correlation output or the correlation peak at the origin”. By maximizing the correlation output at the origin, the real peak may be forced to be even larger. With this interpretation, the correlation peak is given by
                              g          ⁡                      (                          0              ,              0                        )                          =                              ∑                          ∑                                                f                  ⁡                                      (                                          m                      ,                      n                                        )                                                  ⁢                                  h                  ⁡                                      (                                          m                      ,                      n                                        )                                                                                =                                    f              T                        ⁢            h                                              (        4        )            where superscript T denotes the vector transpose and where f and h are the column vector versions of f(m,n) and h(m,n), respectively. In the discussion hereinbelow, matrices will be represented by upper case bold letters and vectors are by lower case bold letters.
Composite filters are derived from several training images that are representative views of the object or pattern to be recognized. In principle, such filters can be trained to recognize any object or type of distortion as long as the distortion can be adequately represented by the training images. The objective of a composite filter is to be able to recognize the objects from one class (even non-training images), while being able to reject objects from other classes. The optimization of carefully designed performance criteria offers a methodical approach for achieving this objective.
In the early SDF filter designs, the filter was designed to yield a specific value at the origin of the correlation plane in response to each training image. The hope was that such a controlled value would also be the peak in correlation plane. It was further theorized that the resulting filter would be able to interpolate between the training images to yield comparable output values in response to other (non-training) images from the same class. A set of linear equations describing the constraints on the correlation peaks can be written asX+h=u  (5)where h is the filter vector, superscript “+” (in X+) denotes the conjugate transpose, X=[x1 x2 . . . xN] is an L×N matrix with the N training image Fourier transform vectors (each with L elements, where L is the number of pixels in the image) as its columns, and u=[u1 u2 . . . uN]T is an N×1 column vector containing the desired peak values for the N training images. For training images from the desired class (also known as the true class), the constraint vales are usually set to 1 and for images from the reject class (also known as the false class), they are usually set to 0.
However, because the number of training images N is generally much fewer than the dimension L (i.e., the number of frequencies) of the filters, the system of linear equations in Eq. (5) is under-determined. By requiring that h is a linear combination of the training images, one can obtain a unique solution known as the equal correlation peak SDF (ECP-SDF).
The ECP-SDF suffers from the problem of large sidelobes. In practice, it is important to ensure that the correlation peak is sharp and that sidelobes are suppressed. One way to achieve this is to minimize the energy in the correlation plane. The minimum average correlation energy (MACE) filter minimizes the average correlation energy (ACE) defined below in Eq. (6) while satisfying the correlation peak constraints in Eq. (5).
                              E          ave                =                                            1              N                        ⁢                                          ∑                                  i                  =                  1                                N                            ⁢                                                ∑                  k                                                                                        ⁢                                                      ∑                    l                                                                                                  ⁢                                                                                                                                      H                          ⁡                                                      (                                                          k                              ,                              l                                                        )                                                                                                                      2                                        ⁢                                                                                                                                                X                            i                                                    ⁡                                                      (                                                          k                              ,                              l                                                        )                                                                                                                      2                                                                                                    =                                    h              +                        ⁢            Dh                                              (        6        )            where D is a diagonal matrix containing the average training image power spectrum along its diagonal. This leads to the closed form solution of the MACE filter shown in Eq. (7).h=D−1X(X+D−1X)−1u  (7)In the above equations, input images, frequency domain arrays and correlation outputs are assumed to be of size d×d and “N” is the number of training images. Further, h is a d2×1 column vector containing the 2-D correlation filter H(k,l) lexicographically reordered to 1-D, u is a column vector, and X is a d2×N complex matrix whose ith column contains the 2-D Fourier transform of the ith training image lexicographically reordered into a column vector. As is known in the art, in lexicographical reordering, an image is reordered by scanning it row-by-row and placing all the scanned elements in a vector (e.g., a column vector).
MACE filters have been shown to generally produce sharp correlation peaks. They are the first set of filters that attempted to control the entire correlation plane. However, MACE filters suffer from two main drawbacks. First, there is no built-in immunity to noise. Second, the MACE filters are often excessively sensitive to intra-class variations.
The minimum variance synthetic discriminant function (MVSDF) was developed to address the noise tolerance issue. Here, the filter h was designed to minimize the effect of additive noise on the correlation output. Let the noise be of zero mean and let C be the diagonal noise power spectral density (PSD) matrix in that the PSD of the noise is represented along the diagonal of C. Then the output noise variance (ONV) can be shown to be σ2=hTCh. The MVSDF minimizes σ2 while satisfying the conditions in Eq. (5). Here, C is a d2×d2 diagonal matrix whose diagonal elements C(k,k) represent the noise power spectral density at frequency k. Minimizing ONV (σ2) subject to the usual linear constraints of Eq. (5) leads to the following closed form solution:h=C−1X(X+C−1X)−1u  (8)The ECP-SDF is a special case of MVSDF in that it is obtained if the noise is white, i.e., if C is equal to I, the identity matrix, then the MVSDF is same as the ECP SDF.
The MACE filter yields sharp peaks that are easy to detect while the MVSDF is designed to be more robust to noise. Since both attributes (namely, sharp peaks and noise tolerance) are desirable in practice, it is desirable to formulate a filter that possesses the ability to produce sharp peaks and behaves robustly in the presence of noise. Refregier, supra, showed that one can optimally trade off between these two metrics (i.e., ONV and ACE). The resulting filter, named the Optimal Trade-off SDF (OTSDF) is given ash=T−1X(X+T−1X)−1u  (9)where T=(αD+√{square root over (1−α2)}C), and 1≧α≧0. It is noted here that when α=1, the optimal tradeoff filter reduces to the MACE filter given in equation (7), and when α=0, it simplifies to the noise-tolerant filter in equation (8).
The ECP-SDF filter and its variants such as MVSDF filter and MACE filter assume that the distortion tolerance of a filter could be controlled by explicitly specifying desired correlation peak values for training images. The hard constraints in Eq. (5) may be removed because non-training images always yield different values than those specified for the training images and no formal relation appears to exist between the constraints imposed on the filter output and its ability to tolerate distortions. In fact, it is unclear that even intuitively satisfying choices of constraints (such as the Equal Correlation Peak (ECP) condition) have any significant positive impact on a filter's performance. Finally, relaxing or removing the hard constraints should increase the domain of solutions.
Removing the hard constraints in Eq. (5) led to the introduction of the unconstrained MACE (UMACE) filter (A. Mahalanobis, B. V. K. Vijaya Kumar, S. R. F. Sims and J. F. Epperson, Appl. Opt.,33, 3751-3759,1994). Instead of constraining the peak value at the origin of the correlation output to take on a specific value, UMACE tries to maximize the peak at the origin while minimizing the average correlation energy resulting from the cross-correlation of the training images. This is done by optimizing the metric J(h) in Eq. (10).
                              J          ⁡                      (            h            )                          =                                            h              +                        ⁢                          mm              +                        ⁢            h                                              h              +                        ⁢            Dh                                              (        10        )            which leads to the closed form solution in Eq. (11) for the UMACE filter.h=D−1m  (11)In equations (10) and (11), D is a diagonal matrix as defined earlier, and m denotes the Fourier transform of mean training image. It is noted that both MACE and UMACE filters yield sharp correlation peaks because they are designed to minimize the average correlation energy.
Adding noise tolerance to the UMACE filter, as was done to MACE filters, yields the unconstrained optimal trade-off SDF (UOTSDF) given in Eq. (12).h=(αD+√{square root over (1−α2)}C)−1m  (12)Varying α produces filters with optimal tradeoff between noise tolerance and discrimination. Typically, using α values close to, but not equal to 1 (e.g., 0.99) improves the robustness of MACE filters.
Advances in correlation filters include considering the correlation plane as a new pattern generated by the correlation filter in response to an input image. The correlation planes may be considered as linearly transformed versions of the input image, obtained by applying the correlation filter. It can then be argued that if the filter is distortion tolerant, its output will not change much even if the input pattern exhibits some variations. Thus, the emphasis is not only on the correlation peak, but on the entire shape of the correlation surface. Based on the above, a metric of interest is the average variation in images after filtering. If gi (m,n) is the correlation surface produced in response to the ith training image, we can quantify the variation in these correlation outputs by the average similarity measure (ASM) defined in Eq. (13).
                    ASM        =                              1            N                    ⁢                                    ∑                              i                =                1                            N                        ⁢                                          ∑                m                                                                              ⁢                                                ∑                  n                                                                                        ⁢                                                      [                                                                                            g                          i                                                ⁡                                                  (                                                      m                            ,                            n                                                    )                                                                    -                                                                        g                          _                                                ⁡                                                  (                                                      m                            ,                            n                                                    )                                                                                      ]                                    2                                                                                        (        13        )            where
            g      _        ⁡          (              m        ,        n            )        =            1      N        ⁢                  ∑                  j          =          1                M            ⁢                        g          j                ⁡                  (                      m            ,            n                    )                    is the average of the N training image correlation surfaces. ASM is a measure of distortions or dissimilarity (variations) in the correlation surfaces relative to an average shape. In an ideal situation, all correlation surfaces produced by a distortion invariant filter (in response to a valid input pattern) would be the same, and ASM would be zero. In practice, reducing ASM improves the filter stability.
In addition to being distortion-tolerant, a correlation filter must yield large peak values to facilitate detection. Towards this end, one maximizes the filter's response to the training images on the average. However, no hard constraints are imposed on the filter's response to training images at the origin. Rather, it is desired that the filter should yield a large peak on the average over the entire training set. This condition is met by maximizing the average correlation height (ACH) metric defined in Eq. (14).
                    ACH        =                                            1              N                        ⁢                                          ∑                                  i                  =                  1                                N                            ⁢                                                x                  +                                ⁢                h                                              =                                    m              +                        ⁢            h                                              (        14        )            where m is the mean of all vectors. It is desirable to reduce the effect of noise by reducing ONV. To make ACH large while reducing ASM and ONV, the filter is designed to maximize the metric in Eq. (15).
                              J          ⁡                      (            h            )                          =                                                          ACH                                      2                                ASM            +            ONV                                              (        15        )            The filter which maximizes this metric is referred to as the maximum average correlation height (MACH) (A. Mahalanobis et al., Appl. Opt., 33 (1994) 3751) filter.
The correlation filters previously described are presented as linear systems whose response to patterns of interest is carefully controlled by various optimization techniques. The correlation filters may also be interpreted as methods of applying transformations to the input data. Thus the correlation can be viewed as a linear transformation. Specifically, the filtering process can be mathematically expressed as multiplication by a diagonal matrix in the frequency domain.
The distance of a vector x to a reference mk under a linear transform H is given by
                              d          k                =                                                                          Hx                -                                  Hm                  k                                                                    2                    =                                                    (                                  x                  -                                      m                    k                                                  )                            +                        ⁢                          H              +                        ⁢                          H              ⁡                              (                                  x                  -                                      m                    k                                                  )                                                                        (        16        )            where superscript + denotes a conjugate transpose operation.
The filtering process transforms the input images to new images. For the correlation filter to be useful as a transform, it is required that the images of the different classes become as different as possible after filtering. Then, distances can be computed between the transformed input image and the references of the different classes that have been also transformed in the same manner. The input is assigned to the class to which the distance is the smallest. The emphasis is shifted from using just one point (i.e., the correlation peak) to comparing the entire shape of the correlation plane. These facts along with the simplifying properties of linear systems lead to a realization of a distance classifier in the form of a correlation filter. In the distance classifier correlation filter (DCCF) approach (A. Mahalanobis et al., Appl. Opt., 35 (1996) 3127) the transform matrix H is found that maximally separates the classes while making all the classes as compact as possible.
While the various filters discussed can provide good results, the physical implementation often requires complex computations and large memories. Thus, the need exists for a correlation filter that is computationally simple and that requires less memory than the existing art without compromising the results produced by the filter.