The fast, robust, and accurate localization of a given 2D object template in images is the natural prerequisite for numerous computer vision and particularly machine vision applications. For example, for pick and place applications, an object recognition method must determine the location of the object that is imaged. Given its location in conjunction with the known geometry of the imaging device, a pose of the object can be calculated by methods that are well known in the art. Given this pose, a robot can grasp the object from, e.g., a conveyor belt. In various inspection tasks extracting the location of an object allows for the un-warping of the found region in the image and facilitates optical character recognition (OCR) or a comparison with a prototype image for, e.g., detection of possible manufacturing errors.
Several methods have been proposed in the art to determine the position of an object in an image. Most of the methods compare a similarity between a set of possible object poses and the image. Positions that exceed a threshold and are local maxima with respect to this similarity measure are chosen as the location of the object.
Depending on the similarity measure that is used, a certain invariance against adverse imaging conditions is achieved. For instance, with normalized correlation as the similarity measure, invariance against linear gray value changes between the model image and the search image is achieved. Particularly relevant for the present invention is a similarity measure that is invariant against partial occlusion, clutter, and nonlinear contrast changes, incorporated herein by reference (U.S. Pat. No. 7,062,093, EP 1193642, and JP 3776340). The general idea of said metric is to use the dot product of the normalized directions of image and model features as the measure of similarity between a model and the image.
Typically, an exhaustive search over all pose parameters is computationally very expensive and prohibitive for most real-time applications. Most of the prior art methods overcome this speed limitation by building an image pyramid from both the model and the search image (see e.g., Tanimoto (1981) [Steven L. Tanimoto Template matching in pyramids. Computer Graphics and Image Processing, 16:356-369, 1981], or Brown (1992) [Lisa Gottesfeld Brown. A survey of image registration techniques. ACM Computing Surveys, 24(4):325-376, December 1992.]). Then the similarity measure is evaluated for the full search range only at the highest pyramid level. At lower levels, only promising match candidates are tracked until the lowest pyramid level is reached. Here, the number of pyramid levels that are used is a critical decision that directly influences the runtime of the object recognition method. Typically, the number of pyramid levels is selected based on the minimal size of the object in the highest pyramid image. If the object is very small in that image, it is hard to discriminate the object from, e.g., clutter. Then too many possible match candidates must be evaluated. If not enough pyramid levels are chosen, the search on the highest pyramid level is prohibitively slow.
Another way to speed up the search is to assume that the motion parameters of the object under inspection can be approximated by a linear affine transformation. A linear affine transformation maps input points (x, y)T to output points (x′, y′)T according the formula:
      (                                        x            ′                                                            y            ′                                )    =                    (                                                            a                11                                                                    a                12                                                                                        a                21                                                                    a                22                                                    )            ⁢              (                                            x                                                          y                                      )              +                  (                                                            t                x                                                                                        t                y                                                    )            .      
This general formula can be decomposed further into a geometrically more meaningful parameterization
      (                                        x            ′                                                            y            ′                                )    =                    (                                                            cos                ⁢                                                                  ⁢                φ                                                                                      -                  sin                                ⁢                                                                  ⁢                φ                                                                                        sin                ⁢                                                                  ⁢                φ                                                                    cos                ⁢                                                                  ⁢                φ                                                    )            ⁢              (                                            1                                                                        -                  sin                                ⁢                                                                  ⁢                θ                                                                        0                                                      cos                ⁢                                                                  ⁢                θ                                                    )            ⁢              (                                                            s                x                                                    0                                                          0                                                      s                y                                                    )            ⁢              (                                            x                                                          y                                      )              +                  (                                                            t                x                                                                                        t                y                                                    )            .      The parameters then describe a scaling of the original x and y axes by different scaling factors sx and sy, a skew transformation of the y axis with respect to the x axis, i.e., a rotation of the y axis by an angle θ, while the x axis is kept fixed, a rotation of both axes by an angle φ, and finally a translation by a vector (tx,ty)T. Typically, an object recognition system evaluates these parameters only for a reduced subset, e.g., only translation and rotation. Furthermore, the parameters are restricted to a certain fixed range, e.g., a reduced rotation range. This reduces the space of possible poses that an object recognition system must check on the highest pyramid level and hence speeds up the search.
However, in various situations the object that must be found is transformed according to a more general transformation than a linear affine transformation or a subset thereof. One such transformation is the perspective transformation that describes a mapping of a planar object that is imaged from different camera positions according to the formula:
      (                                        x            ′                                                            y            ′                                                            t            ′                                )    =            (                                                  p              11                                                          p              12                                                          p              13                                                                          p              21                                                          p              22                                                          p              23                                                                          p              31                                                          p              32                                                          p              33                                          )        ⁢          (                                    x                                                y                                                t                              )      (see Hartley and Zisserman (2000) [Richard Hartley and Andrew Zisserman, Multiple View Geometry in Computer Vision. Cambridge University Press, 2000]). The nine parameters are defined up to scale, resulting in 8 degrees of freedom.
We distinguish explicitly between the case were the final task of the object recognition system is to only rectify an image and the case that the pose of the object must be determined. For the former, it is enough to determine the perspective transformation. Here, the inverted perspective transformation is used to rectify the image.
For the case that the 3D pose of the object must be determined, and the internal parameters of the camera are provided, only 6 degrees of freedom suffice to describe the pose (3 for the translation and 3 for the rotation). It is important to note that a perspective transformation cannot always be directly transformed into a pose, because additionally two nonlinear constrains must be enforced for the 8 parameters of the perspective transformation in order to result into real poses (Berthold K. P. Horn, Projective Geometry considered Harmful, 1999). Once a valid perspective transformation is found, it can be decomposed directly into a 3D pose by methods known in the art (e.g., Oliver Faugeras, Three-dimensional computer vision: a geometric viewpoint. The MIT Press, 2001, chapter 7.5). A preferred way is to directly search for the 3D pose parameters and not to first determine a perspective transformation and then decompose it into a pose.
Another example where a linear transformation does not suffice is when the image of the object is deformed nonlinearly. This might be due to a distortion induced by the camera lens system that cannot be corrected beforehand. A further example is when the imaging is performed in a medium that produces irregular distortions like hot air or images taken under water. Another source of nonlinear transformation is when the object itself is deformable, e.g. when it is printed on a surface that is bent or wrinkled. Here, not only the pose, but also the deformation of the model must be determined simultaneously. A mathematical description for a non-rigid deformation is to add a warping W(x,y) so that points are transformed according to the formula:
      (                                        x            ′                                                            y            ′                                )    =            W      ⁡              (                  x          ,          y                )              +                  (                                                            a                11                                                                    a                12                                                                                        a                21                                                                    a                22                                                    )            ⁢              (                                            x                                                          y                                      )              +                  (                                                            t                x                                                                                        t                y                                                    )            .      If
      W    ⁡          (              x        ,        y            )        =            ∑              i        =        1            n        ⁢                  w        i            ⁢              U        ⁡                  (                                                                P                i                            -                              (                                  x                  ,                  y                                )                                                          )                    and U(r)=r2 log r2, the well-known thin-plate-spline function (Fred L. Bookstein, “Principal Warps: Thin-plate Splines and the Decomposition of Deformations”, IEEE Transactions on pattern analysis and machine intelligence, Vol 11, No. 6, 567-585 1989) is obtained. Here, the warp is parameterized by anchor points Pi and coefficients wi. The resulting warp minimizes the curvature between the anchor points.
Most prior art approaches for nonlinear object recognition make an assumption that even if the whole object is deformed, sufficiently small parts of the model remain fairly similar in an image, even after a deformation.
However, it is an open question how to incorporate this assumption into an efficient search method of an object recognition system. One approach (see, e.g., U.S. Pat. No. 7,239,929 or U.S. Pat. No. 7,190,834) consists of organizing the decomposed parts of the model hierarchically. Here, one part is selected as a root part of the subdivision. Starting from this root part, the other objects are organized in a tree-like structure. It is important to note that in the subsequent search this root object is detected alone. Once this root part is detected, the possible locations of the subsequent parts are narrowed down based on the assumptions of the deformation of the object. The search for the other parts is consequently simplified.
However, there are several evident problems with this prior art approach. One is that searching for a part is typically less discriminative than a search for the whole object because a part contains by definition less information. This leads to spurious matches and to a reduced search speed because more match hypotheses must be evaluated. A further limitation is that the size of a part is smaller than that of the whole model and accordingly only a smaller number of pyramid levels can be used before the relative size of the model in the image becomes too small to be used by a feature-based search method.
The aim of the present invention is a holistic approach for deformable object detection that combines the advantages of the said invariant match metric, the decomposition of the model into parts, and a search method that takes all search results for all parts into account at the same time. Despite the fact that the model is decomposed into sub-parts, the relevant size of the model that is used for the search at the highest pyramid level is not reduced. Hence, the present invention does not suffer the speed limitations of a reduced number of pyramid levels that prior art methods have.