Presently, many of practically used object recognition technologies use the template matching technique according to the sequential similarity detection algorithm and cross-correlation coefficients. The template matching technique is effective in a special case where it is possible to assume that a detection object appears in an input image. However, the technique is ineffective for an environment to recognize objects from an ordinary image subject to inconsistent viewpoints or illumination states.
Further, there is proposed the shape matching technique that finds a match between a detection object's shape feature and a shape feature of each region in an input image extracted by an image division technique. Under the above-mentioned environment to recognize ordinary objects, the region division yields inconsistent results, making it difficult to provide high-quality representation of object shapes in the input image. The recognition becomes especially difficult when the detection object is partially hidden by another object.
The above-mentioned matching techniques use an overall feature of the input image or its partial region. By contrast, another technique is proposed. The technique extracts characteristic points (feature points) or edges from an input image. The technique uses diagrams and graphs to represent spatial relationship among line segment sets or edge sets comprising extracted feature points or edges. The technique performs matching based on structural similarity between the diagrams or graphs. This technique effectively works for specialized objects. However, a deformed image may prevent stable extraction of the structure between feature points. This makes it especially difficult to recognize an object partially hidden by another object as mentioned above.
Moreover, there are other matching techniques to extract feature points from an image and use a feature quantity acquired from the feature points and image information about local vicinities. For example, C. Schmid and R. Mohr treat corners detected by a Harris corner detector as feature points and propose a technique to use the unrotatable feature quantity near feature points (C. Schmid and R. Mohr, “Local grayvalue invariants for image retrieval”, IEEE PAMI, Vol. 19, No 5, pp. 530-534, 1997). This document is hereafter referred to as document 1. The technique uses the constant local feature quantity for partial image deformation at the feature points. Compared to the above-mentioned techniques, this matching technique can perform stable detection even if an image is deformed or a detection object is partially hidden. However, the feature quantity used in document 1 has no constancy for enlarging or reducing images. It is difficult to recognize images if enlarged or reduced.
On the other hand, D. Lowe proposes the matching technique using feature points and feature quantities unchanged if images are enlarged or reduced (D. Lowe, “Object recognition from local scale-invariant features”, Proc. of the International Conference on Computer Vision, Vol. 2, pp. 1150-1157, Sep. 20-25, 1999, Corfu, Greece). This document is hereafter referred to as document 2. The following describes the image recognition apparatus proposed by D. Lowe with reference to FIG. 1.
As shown in FIG. 1, an image recognition apparatus 400 comprises feature point extraction sections 401a and 401b. The feature point extraction sections 401a and 401b acquire images in multiresolution representation from images (model images or object images) targeted to extract feature points. The multi-resolution representation is referred to as scale-space representation (see Lindeberg T., “Scale-space: A framework for handling image structures at multiple scales”, Journal of Applied Statistics, Vol. 21, No. 2, pp. 224-270, 1999). The feature point extraction sections 401a and 401b apply a DoG (Difference of Gaussian) filter to the images with different resolutions. Output images from the DoG filter contain locals points (local maximum points and local minimum points). Some of these local points are free from positional changes due to resolution changes within a specified range and are detected as feature points. In this example, the number of resolution levels is predetermined.
Feature quantity retention sections 402a and 402b extract and retain feature quantity of each feature point extracted by the feature point extraction sections 401a and 401b. At this time, the feature point extraction sections 401a and 401b use canonical orientations and orientation planes for feature point neighboring regions. The canonical orientation is a direction to provide a peak value of a direction histogram that accumulates Gauss-weighted gradient strengths. The feature quantity retention sections 402a and 402b retain the canonical orientation as the feature quantity. The feature quantity retention sections 402a and 402b normalize the gradient strength information about the feature point neighboring region. That is to say, directions are corrected by assuming the canonical orientation to be 0 degrees. The gradient strength information about each point in the neighboring region is categorized by gradient directions along with the positional information. For example, let us consider a case of categorizing the gradient strength information about points in the neighboring region into a total of eight orientation planes at 45 degrees each. The gradient information is assumed to have 93 degrees of direction and strength m at points (x, y) on the local coordinate system for the neighboring region. This information is mapped as information with strength m at position (x, y) on an orientation plane that has a 90-degree label and the same local coordinate system as the neighboring region. Thereafter, each orientation plane is blurred and resampled in accordance with the resolution scales. The feature quantity retention sections 402a and 402b retain a feature quantity vector having the dimension equivalent to (the number of resolutions)×(the number of orientation planes)×(size of each orientation plane) as found above.
Then, a feature quantity comparison section 403 uses the k-d tree query (a nearest-neighbor query for feature spaces with excellent retrieval efficiency) to retrieve a model feature point whose feature quantity is most similar to the feature quantity of each object feature point. The feature quantity comparison section 403 retains acquired candidate-associated feature point pairs as a candidate-associated feature point pair group.
On the other hand, a model attitude estimation section 404 uses the generalized Hough transform to estimate attitudes (image transform parameters for rotation angles, enlargement or reduction ratios, and the linear displacement) of a model on the object image according to the spatial relationship between the model feature point and the object feature point. At this time, it is expected to use the above-mentioned canonical orientation of each feature point as an index to a parameter reference table (R table) for the generalized Hough transform. An output from the model attitude estimation section 404 is a voting result on an image transform parameter space. The parameter that scores the maximum vote provides a rough estimation of the model attitude.
A candidate-associated feature point pair selection section 405 selects only candidate-associated feature point pairs whose object feature points as members voted for that parameter to narrow the candidate-associated feature point pair groups.
Finally, a model attitude estimation section 406 uses the least squares estimation to estimate an affine transformation parameter based on the spatial disposition of the corresponding feature point pair group. This operation is based on the restrictive condition that a model to be detected is processed by image deformation to the object image by means of the affine transformation. The model attitude estimation section 406 uses the affine transformation parameter to convert model feature points of the candidate-associated feature point pair group onto the object image. The model attitude estimation section 406 finds a positional displacement (spatial distance) from the corresponding object feature point. The model attitude estimation section 406 excludes pairs having excessive displacements to update the candidate-associated feature point pair group. If there are two candidate-associated feature point pair groups or less, the model attitude estimation section 406 terminates by notifying that a model cannot be detected. Otherwise, the model attitude estimation section 406 repeats this operation until a specified termination condition is satisfied. The model attitude estimation section 406 finally outputs a model recognition result in terms of the model attitude determined by the affine transformation parameter effective when the termination condition is satisfied.
However, there are several problems in the D. Lowe's technique described in document 2.
Firstly, there is a problem about the extraction of the canonical orientation at feature points. As mentioned above, the canonical orientation is determined by the direction to provide the peak value in a direction histogram that accumulates Gauss-weighted gradient strengths found from the local gradient information about feature point neighboring regions. The technique according to document 2 tends to detect feature points slightly inside object's corners. Since two peaks appear in directions orthogonal to each other in a direction histogram near such feature point, there is a possibility of detecting a plurality of competitive canonical orientations. At the later stages, the feature quantity comparison section 403 and the model attitude estimation section 404 are not intended for such case and cannot solve this problem. A direction histogram shape varies with parameters of the Gaussian weight function, preventing stable extraction of the canonical orientation. On the other hand, the canonical orientation is used for the feature quantity comparison section 403 and the model attitude estimation section 404 at later stages. Extracting an improper canonical orientation seriously affects a result of feature quantity matching.
Secondly, the orientation plane is used for feature quantity comparison to find a match between feature quantities according to density gradient strength information at each point in a local region. Generally, however, the gradient strength is not a consistent feature quantity against brightness changes. The stable match is not ensured if there is a brightness difference between the model image and the object image.
Thirdly, a plurality of model feature points having very short, but not shortest, distances in the feature space, i.e., having very similar feature quantities corresponding to each object feature point. The real feature point pair (inlier) may be contained in them. In the feature quantity comparison section 403, however, each object feature point pairs with only a model feature point that provides the shortest distance in the feature space. Accordingly, the above-mentioned inlier is not considered to be a candidate-associated pair.
Fourthly, a problem may occur when the model attitude estimation section 406 estimates affine transformation parameters. False feature point pairs (outliers) are contained in the corresponding feature point pair group narrowed by the candidate-associated feature point pair selection section 405. However, many outliers may be contained in the candidate-associated feature point pair group. There may be an outlier that extremely deviates from the true affine transformation parameters. In such cases, the affine transformation parameter estimation is affected by outliers. Depending on cases, a repetitive operation may gradually exclude the inliers and leave the outliers. An incorrect model attitude may be output.