In object recognition, and in particular in many machine vision tasks, one is interested in recognizing a user-defined model object in an image. The object in the image may have undergone arbitrary transformations of a certain class of geometric transformations. If the class of transformations is the class of translations, one is interested in obtaining the position of the model in the image. The class of translations is typically used if it can be ensured that the model always occurs in the same rotation and size in the image, e.g., because it is mounted at a fixed angle on a x-y-stage and the camera is mounted in a fixed position perpendicular to the stage. If the class of transformations is the class of rigid transformations, additionally the rotation of the object in the image is desired. This class of transformations can, for example, be used if the camera is mounted perpendicular to the stage, but the angle of the object cannot be kept fixed. If the class of transformations is the class of similarity transformations, additionally the size of the object in the image may vary. This class of transformations can occur, for example, if the distance between the camera and the object cannot be kept fixed or if the object itself may undergo size changes. If neither the position nor the 3D rotation of the camera with respect to the object can be kept fixed, the object will undergo a general perspective transformation in the image. If the interior orientation of the camera is unknown, a perspective projection between two planes (i.e., the surface of the object and the image plane) can be described by a 3×3 matrix in homogeneous coordinates:       (                                        x            ′                                                            y            ′                                                            t            ′                                )    =            (                                                  p              11                                                          p              12                                                          p              13                                                                          p              21                                                          p              22                                                          p              23                                                                          p              31                                                          p              32                                                          p              33                                          )        ⁢          (                                    x                                                y                                                t                              )      
The matrix and vectors are only determined up to an overall scale factor (see Hartley and Zisserman (2000) [Richard Hartley and Andrew Zisserman: Multiple View Geometry in Computer Vision. Cambridge University Press, 2000], chapters 1.1-1.4). Hence, the matrix, which determines the pose of the object, has eight degrees of freedom. If the interior orientation of the camera is known, these eight degrees of freedom reduce to the six degrees of freedom of the pose of the object with respect to the camera (three for translation and three for rotation).
Often, this type of transformation is approximated by a general 2D affine transformation, i.e., a transformation where the output points (x′,y′,)T are obtained from the input points (x,y)T by the following formula:       (                                        x            ′                                                            y            ′                                )    =                    (                                                            a                11                                                                    a                12                                                                                        a                21                                                                    a                22                                                    )            ⁢              (                                            x                                                          y                                      )              +                  (                                                            t                x                                                                                        t                y                                                    )            .      
General affine transformations can, for example, be decomposed into the following, geometrically intuitive, transformations: A scaling of the original x and y axes by different scaling factors sx and sy, a skew transformation of the y axis with respect to the x axis, i.e., a rotation of the y axis by an angle θ, while the x axis is kept fixed, a rotation of both axes by an angle φ, and finally a translation by a vector (tx,ty)T. Therefore, an arbitrary affine transformation can be written as:       (                                        x            ′                                                            y            ′                                )    =                    (                                                            cos                ⁢                                                                   ⁢                φ                                                                                      -                  sin                                ⁢                                                                   ⁢                φ                                                                                        sin                ⁢                                                                   ⁢                φ                                                                    cos                ⁢                                                                   ⁢                φ                                                    )            ⁢              (                                            1                                                                        -                  sin                                ⁢                                                                   ⁢                θ                                                                        0                                                      cos                ⁢                                                                   ⁢                θ                                                    )            ⁢              (                                                            s                x                                                    0                                                          0                                                      s                y                                                    )            ⁢              (                                            x                                                          y                                      )              +                  (                                                            t                x                                                                                        t                y                                                    )            .      
FIG. 1 displays the parameters of a general affine transformation graphically. Here, a square of side length 1 is transformed into a parallelogram. Similarity transformations are a special case of affine transformations in which the skew angle θ is 0 and both scaling factors are identical, i.e., sx=sy=s. Likewise, rigid transformations are a special case of similarity transformations in which the scaling factor is 1, i.e., s=1. Finally, translations are a special case of rigid transformations in which φ=0. The relevant parameters of the class of geometrical transformations will be referred to as the pose of the object in the image. For example, for rigid transformations the pose consists of the rotation angle φ and the translation vector (tx,ty)T. Object recognition hence is the determination of the poses of all instances of the model in the image.
Several methods have been proposed in the art to recognize objects in images. Most of them suffer from the restriction that the model will not be found in the image if it is occluded or degraded by additional clutter objects. Furthermore, most of the existing methods will not detect the model if the image exhibits non-linear contrast changes, e.g., due to illumination changes.
All of the known object recognition methods generate an internal representation of the model in memory at the time the model is generated. To recognize the model in the image, in most methods the model is systematically compared to the image using all allowable degrees of freedom of the chosen class of transformations for the pose of the object (see, e.g., Borgefors (1988) [Gunilla Borgefors. Hierarchical chamfer matching: A parametric edge matching algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6):849-865, November 1988], Brown (1992) [Lisa Gottesfeld Brown. A survey of image registration techniques. ACM Computing Surveys, 24(4):325-376, December 1992], and Rucklidge (1997) [William J. Rucklidge. Efficiently locating objects using the Hausdorff distance. International Journal of Computer Vision, 24(3):251-270, 1997]). For each set of parameters of the pose, a match metric is computed that gives a measure of how well the model fits to the image at the pose under consideration. To speed up the search through the space of allowable transformations, usually image pyramids are used both on the model and the image to reduce the amount of data that needs to be examined (see, e.g., Tanimoto (1981) [Steven L. Tanimoto. Template matching in pyramids. Computer Graphics and Image Processing, 16:356-369, 1981], Borgefors (1988), or Brown (1992)).
The simplest class of object recognition methods is based on the gray values of the model and image itself and uses normalized cross correlation as a match metric (see U.S. Pat. No. 4,972,359, U.S. Pat. No. 5,222,155, U.S. Pat. No. 5,583,954, U.S. Pat. No. 5,943,442, U.S. Pat. No. 6,088,483, and Brown (1992), for example). Normalized cross correlation has the advantage that it is invariant to linear brightness changes, i.e., the object can be recognized if it has undergone linear illumination changes. However, normalized cross correlation has several distinct disadvantages. First, it is very expensive to compute, making the methods based on this metric very slow. This leads to the fact that the class of transformations is usually chosen as the class of translations only because otherwise the search would take too much time for real-time applications, even if image pyramids are used. Second, the metric is not robust to occlusions of the object, i.e., the object will usually not be found even if only small parts of it are occluded in the image. Third, the metric is not robust to clutter, i.e., the object will usually not be found if there are disturbances on or close to the object.
Another class of algorithms is also based on the gray values of the model and image itself, but uses either the sum of the squared gray value differences or the sum of the absolute value of the gray value differences as the match metric (see U.S. Pat. No. 5,548,326 and Brown (1992), for example). This metric can be made invariant to linear brightness changes (Lai and Fang (1999) [Shang-Hong Lai and Ming Fang. Accurate and fast pattern localization algorithm for automated visual inspection. Real-Time Imaging, 5:3-14, 1999]). Since sums of squared or absolute differences are not as expensive to compute as normalized cross correlation, usually a larger class of transformations, e.g., rigid transformations, are allowed. This metric, however, possesses the same disadvantages as correlation-based methods, i.e., it is not robust to occlusion or clutter.
A more complex class of object recognition methods does not use the gray values of the model or object itself, but uses the edges of the object for matching. During the creation of the model, edge extraction is performed on the model image and its derived image pyramid (see, e.g., Borgefors (1988), Rucklidge (1997), and U.S. Pat. No. 6,005,978). Edge extraction is the process of converting a gray level image into a binary image in which only the points corresponding to an edge are set to the value 1, while all other pixels receive the value 0, i.e., the image is actually segmented into an edge region. Of course, the segmented edge region need not be stored as a binary image, but can also be stored by other means, e.g., runlength encoding. Usually, the edge pixels are defined as the pixels in the image where the magnitude of the gradient is maximum in the direction of the gradient. Edge extraction is also performed on the image in which the model is to be recognized and its derived image pyramid. Various match metrics can then be used to compare the model to the image. One class of match metrics is based on measuring the distance of the model edges to the image edges under the pose under consideration. To facilitate the computation of the distances of the edges, a distance transform is computed on the image pyramid. The match metric in Borgefors (1988) computes the average distance of the model edges and the image edges. Obviously, this match metric is robust to clutter edges since they do not occur in the model and hence can only decrease the average distance from the model to the image edges. The disadvantage of this match metric is that it is not robust to occlusions because the distance to the nearest edge increases significantly if some of the edges of the model are missing in the image. The match metric in Rucklidge (1997) tries to remedy this shortcoming by calculating the k-th largest distance of the model edges to the image edges. If the model contains n points, the metric hence is robust to 100*k/n % occlusion. Another class of match metrics is based on simple binary correlation, i.e., the match metric is the average of all points in which the model and the image under the current pose both have an edge pixel set (see U.S. Pat. Nos. 6,005,978 and 6,111,984, for example). To speed up the search for potential instances of the model, in U.S. Pat. No. 6,005,978 the generalized Hough transform (Ballard (1981) [D. H. Ballard. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition, 13(2):111-122, 1981]) is used. This match metric has the disadvantage that the alignment between the edges in the model and the edges in the image needs to be very good to return the correct value of the match metric under the pose under consideration. A complex scheme is used to make the edges in the image broader to achieve the correct match metric. Finally, edges are sometimes used to define the relevant points to use for correlation-based approaches (see U.S. Pat. Nos. 6,023,530 and 6,154,567). Obviously, these approaches have the same drawbacks as the above mentioned correlation-based schemes since the match metric is the same or very similar. All of these match metrics have the disadvantage that they do not take into account the direction of the edges. In U.S. Pat. No. 6,005,978, the edge direction enters the method through the use of the generalized Hough transform, but is disregarded in the match metric. It is well known, however, that disregarding the edge direction information leads to many false positive instances of the model in the image, i.e., found models that are not true instances of the model (Olson and Huttenlocher (1997) [Clark F. Olson and Daniel P. Huttenlocher. Automatic target recognition by matching oriented edge pixels. IEEE Transactions on Image Processing, 6(1):103-113, January 1997]). For this reason, some approaches integrate edge direction information into the match metric (see U.S. Pat. Nos. 5,550,933, 5,631,981, 6,154,566, and Hashimoto et al. (1992) [Manabu Hashimoto, Kazuhiko Sumi, Yoshikazu Sakaue, and Shinjiro Kawato. High-Speed Template Matching Algorithm Using Information of Contour Points. Systems and Computers in Japan, 23(9):78-87, 1992], for example). However, these approaches do not use image pyramids to speed up the search (which makes the runtime prohibitively large) and only compute the translation of the model. In all the above mentioned approaches, since the image itself is binarized, the match metric is only invariant against a narrow range of illumination changes. If the image contrast is lowered, progressively fewer edge points will be segmented, which has the same effects as progressively larger occlusion.
Evidently, the state-of-the-art methods for object recognition possess several shortcomings. None of the approaches is robust against occlusion, clutter, and non-linear contrast changes at the same time. Furthermore, often computationally expensive preprocessing operations, e.g., distance transforms or generalized Hough transforms, need to be performed to facilitate the object recognition. In many applications it is necessary that the object recognition step is robust to the types of changes mentioned above. For example, in print quality inspection, the model image is the ideal print, e.g., of a logo. In the inspection, one is interested in determining whether the current print deviates from the ideal print. To do so, the print in the image must be aligned with the model (usually by a rigid transformation). Obviously the object recognition (i.e., the determination of the pose of the print) must be robust to missing characters or parts thereof (occlusion) and to extra ink in the print (clutter). If the illumination cannot be kept constant across the entire field of view, the object recognition obviously must also be robust to non-linear illumination changes. Hence, it is an object of the present invention to provide an improved visual recognition system and method for occlusion- and clutter-invariant object recognition. It is a further object to provide a visual recognition system and method for occlusion-, clutter-, and illumination-invariant object recognition.
These objects are achieved with the features of the claims.