The present invention relates to pattern recognition and, more particularly, to a method of learning templates and deformation models to facilitate pattern matching.
Pattern recognition systems are employed in various areas of technology to help classify and/or match a test pattern relative to one or more known prototype patterns. Examples of pattern recognition applications include image analysis and classification, handwriting recognition, speech recognition, man and machine diagnostics, industrial inspection, medical imaging, etc.
In a pattern recognition system, it is common to store large amounts of data indicative of prototype patterns and compare them to a given example or unknown input symbol for identification. Several common algorithms may be utilized to compare or measure similarities between patterns such as K-nearest Neighbor (KNN), Parzen windows, and radial basis function (RBF). A level of similarity may be determined by generating a distance measure. By way of example, a simple algorithm for comparing two patterns f and g is to compute the Euclidean distance between them, such as may be expressed as:             d      E        ⁡          (              f        ,        g            )        =                              ∑                      x            ,            y                          ⁢                              (                                          f                ⁡                                  (                                      x                    ,                    y                                    )                                            -                              g                ⁡                                  (                                      x                    ,                    y                                    )                                                      )                    2                      =                            (                      f            -            g                    )                2            
where dx denotes the Euclidean distance between two patterns, and f and g are assumed to be 2 dimensional patterns, indexed by x and y. An extension of the Euclidean distance methodology to other dimensions is straightforward.
The usefulness of the Euclidean distance algorithm is limited, however, because if f and g are not perfectly aligned, the Euclidean distance can yield arbitrarily large values. Consider, for instance, a case where g is a translated version of f, that is g(x, y)=f(x+1, y). In this case, the Euclidean distance could yield a very large value, even thought f and g may be virtually identical except for a small amount of translation in the x-direction.
One proposed comparison scheme to remedy the aforementioned shortcoming associated with the traditional Euclidean distance approach is to employ a tangent distance, such as is disclosed in U.S. Pat. No. 5,422,961. This comparison scheme is invariant with respect to a selected set of small transformations of the prototypes. The small transformations of interest are expressed by calculating a derivative of the transformed image with respect to the parameter that controls the transformation. The directional derivative is used to generate a computational representation of the transformation of interest. The transformation of interest, which corresponds to a desired invariance, can be efficiently expressed by using tangent vectors constructed relative to the surface of transformation. The tangent distance dT may be expressed as:                     d        T            ⁡              (                  f          ,          g                )              2    =            min                        α          f                ⁢                  α          g                      ⁢                  (                  f          +                                    l              f                        ⁢                          a              f                                -          g          -                                    l              g                        ⁢                          a              g                                      )            2      
where Lf and Lg are matrices of tangent vectors for f and g respectively, and af and ag are vectors representing the amount of deformation along the tangent plane. An advantage of tangent distance compared to the traditional Euclidean distance approach is that the tangent distance is less affected by translation than the Euclidean distance because if Lf and Lg contain a linear approximation of the translation transformation, the tangent distance compares the translated version of f and g. The tangent distance concept is explored in greater detail in a paper entitled, xe2x80x9cEfficient Pattern Recognition Using a New Transformation Distance,xe2x80x9d presented by Patrice Y. Simard, Yann LeCun and John Denker, Advances in Neural Information Processing Systems, Eds. Morgan Kaufmann, pp. 50-58, 1993.
A limitation of tangent distance approach, however, is that the transformations to which it is invariant generally must be known a-priori and precisely (e.g., translation, rotation, scaling, etc.). Moreover, tangent distance has no mechanism to specify loose constraints such as small elastic displacements. Such mechanism would be useful because in many cases, such as with speech or image patterns, it is not known which transformations should be used, but it is assumed that the problem exhibit some invariance with respect to small elastic displacements.
A desirable property of a pattern recognition machine is that its output be invariant with respect to certain small transformations of its input. That is, some transformations of a meaningful pattern, such as an alphanumeric symbol, will not affect the interpretation of the pattern by a human observer. A comparison scheme that is invariant to such transformations can operate with greater economy and speed than comparison schemes that require exhaustive sets of prototypes. By way of example, transformations of alphanumeric patterns that are of interest in this regard may include translation, rotation, scaling, hyperbolic deformations, line thickness changes, and gray-level changes. Any desired number of possible invariances can be included in any particular recognition process, provided that such invariances are known a priori, which is not always possible.
Many computer vision and image processing tasks benefit from invariances to spatial deformations in the image. Examples include handwritten character recognition, face recognition and motion estimation in video sequences. When the input images are subjected to possibly large transformations from a known finite set of transformations (e.g., translations in images), it is possible to model the transformations using a discrete latent variable and perform transformation-invariant clustering and dimensionality reduction using Expectation Maximization as in xe2x80x9cTopographic transformation as a discrete latent variablexe2x80x9d by Jojic and Frey presented at Neural Information Processing Systems (NIPS) 1999. Although this method produces excellent results on practical problems, the amount of computation grows linearly with the total number of possible transformations in the input.
A tangent-based construction of a deformation field may be used to model large deformations in an approximate manner. The tangent approximation can also be included in generative models, such as including linear factor analyzer models and nonlinear generative models. Another approach to modeling small deformations is to jointly cluster the data and learn a locally linear deformation model for each cluster, e.g., using expectation maximization in a factor analyzer as in xe2x80x9cModeling the manifolds of images of handwritten digits,xe2x80x9d by Hinton et al. published in IEEE Trans. on Neural Networks, 8, 65-74. With the factor analysis approach, however, a large amount of data is needed to accurately model the deformations. Learning also is susceptible to local optima that might confuse deformed data from one cluster with data from another cluster. That is, some factors tend to xe2x80x9cerasexe2x80x9d parts of the image and xe2x80x9cdrawxe2x80x9d new parts, instead of just perturbing the image.
The present invention relates to a method for learning mixtures of smooth, non-uniform deformation models to facilitate pattern recognition or matching. A generative network is created to model one or more classes of patterns for use in determining a likelihood that a pattern matches patterns modeled by the network. The model is created to be invariant to non-uniform pattern deformation.
The model is developed to describe an error pattern as a difference between first and second patterns. In accordance with an aspect of the present invention, at least the first pattern is deformed by application of a deformation field. The deformation field may be a smooth, non-uniform field, such as may be constructed from low frequency wavelet basis vectors and associated deformation coefficients. Various parameters in the model describe a set of pattern prototypes and associated levels of noise. The parameters further control the amount of deformation and correlations among the deformations in different parts of the pattern. An error pattern thus may be generated from the model by sampling according to the probability distributions associated with different components of the model.
In accordance with an aspect of the present invention, joint and conditional likelihoods for the model may be evaluated. The model has a number of parameters that govern different probability distributions, and a number of intermediate variables that may not be observed in real applications. By way of example, the deformation coefficients are types of variables, for which a functional form of their probability distribution may be known, but the exact coefficients for each observed pattern may be unknown. To deal with the non-observed, or hidden variables, a joint likelihood of variables in the system, given the second pattern, is evaluated assuming that the error pattern equals zero. The joint likelihood may be employed to estimate (or infer) parameters of the model that tend to maximize the joint likelihood for stored patterns. After the parameters have been estimated, a likelihood of observing a zero error pattern, given the second pattern, may be computed, such as by integrating over hidden variables in the model. In essence, this produces a likelihood value as to whether the first pattern is in accordance with the model. The likelihood value may then be used in classification by evaluating models for different classes of patterns.
To properly integrate out the hidden variables, the parameters of the associated conditional distributions need to be known. However, the parameters are typically unknown, while there are a number of labeled patterns available to train the model. As a result, the model may be optimized in accordance with an aspect of the present invention. For example, hidden variables may be integrated out in an iterative process, such as by increasing the likelihood of all observed patterns in each iteration. In each iteration, the current parameter estimates are used to infer hidden variables from the joint likelihood, which, in turn, may be utilized to re-estimate the parameters. This iterative process may be repeated for patterns until the estimated parameters converge, thereby providing substantially optimized parameters for the model.
Provided that parameters of the model are known (or at least estimated from a training set), the joint likelihood that the error pattern equals zero, given the second pattern, further may be employed to classify new patterns in accordance with the present invention. The joint likelihood may be computed by averaging over the hidden variables, taking the dependencies among variables into account. The estimated parameters of the model serve to properly regularize the distance among patterns, and the integration technique rewards the patterns that are easier to reach with the generative model. For example, if an observed pattern is close to several prototype patterns, the likelihood computation will naturally reward such a pattern with greater likelihood than if the observed pattern is only remotely similar to one of the prototypes.
In accordance with another aspect of the present invention, the model may be designed also to deform the second pattern. For example, a deformation field having deformation coefficients may be applied to form a deformation component that is added to the second patter to derive a second deformed pattern. In a case where the second pattern is deformed, the deformation coefficients of for the second pattern may be correlated with the deformation coefficients for the first pattern, such as being substantially opposite. A covariance matrix parameter further may be selected to capture a desired level of correlation between the respective deformation coefficients for the first and second patterns. Consequently, the resulting error pattern is the difference between the deformed first pattern and the deformed second pattern.
Another aspect of the present invention provides a method for learning mixtures of models to facilitate pattern recognition. The method includes providing a model having model parameters, the model characterizing an error pattern functionally related to a difference between two patterns. At least a first of the two patterns is deformed by application of an associated substantially smooth and non-uniform deformation field. A joint likelihood in the model is determined relative to the model parameters, given a stored pattern, assuming that the error pattern equals zero. The model parameters that tend to maximize the joint likelihood for a plurality of stored patterns are determined. The methodology may, in accordance with an aspect of the present invention, be implemented as computer-executable instructions in a computer-readable medium.
Yet another aspect of the present invention provides a method for generating a model to facilitate pattern recognition. An error pattern is modeled based on a first pattern relative to a second pattern, at least the first pattern being deformed by application of a substantially smooth deformation field. The model has at least one parameter for characterizing a set of pattern prototypes and associated noise levels and for controlling deformation of the first pattern. A likelihood that an error pattern is zero, given the second pattern, is characterized. The error pattern is functionally related to the first pattern, the second pattern, and the parameter. The parameter is estimated so as to tend to maximize the likelihood for a plurality of stored second patterns. The methodology may, in accordance with an aspect of the present invention, be implemented as computer-executable instructions in a computer-readable medium.
To the accomplishment of the foregoing and related ends, certain illustrative aspects of the invention are described herein in connection with the following description and the annexed drawings. These aspects are indicative, however, of but a few of the various ways in which the principles of the invention may be employed and the present invention is intended to include all such aspects and their equivalents. Other advantages and novel features of the invention will become apparent from the following detailed description of the invention when considered in conjunction with the drawings.