1. Field of the Invention
The present invention relates to an information processing device, information processing method, and program, and particularly to an information processing device, information processing method, and program that can further enhance the accuracy of matching between an input image and a model image.
2. Description of the Related Art
As existing image recognition schemes, there are matching schemes in which features points are extracted from an image and features acquired from the image information of the feature points and local neighborhoods thereof are used.
For example, C. Schmid and R. Mohr propose a matching scheme in which corners detected by using the Harris corner detector are employed as feature points and rotation-invariant features of neighborhoods of the feature points are used (refer to C. Schmid and R. Mohr, “Local grayvalue invariants for image retrieval”, (USA), IEEE PAMI, 1997, vol. 19, no. 5, p. 530-534, hereinafter Non-Patent Document 1). In this matching scheme, in which local features of feature points invariant to partial image deformation are used, stable detection is possible even when an image involves deformation and even when a detection target is partially hidden. However, the features employed in this scheme of Non-Patent Document 1 do not have invariance to scaling of an image, and therefore recognition is difficult when an image involves scaling.
In contrast, D. Lowe proposes a matching scheme that employs feature points and features invariant also to image scaling (refer (refer to D. Lowe, “Object recognition from local scale-invariant features”, (Greece), Proc. of the International Conference on Computer Vision, 1999 Sep., vol. 2, p. 1150-1157, hereinafter Non-Patent Document 2). An image recognition device proposed by D. Lowe will be described below with reference to FIG. 1.
In the image recognition device shown in FIG. 1, feature point extractors 1a and 1b apply a Difference-of-Gaussian (DoG) filter to each of images of different resolutions obtained from an image as the target of feature point extraction (model image or input image) based on multi-resolution image representation (scale-space representation, refer to Lindeberg T., “Scale-space: A framework for handling image structures at multiple scales”, Journal of Applied Statistics, vol. 21, no. 2, pp. 224-270, 1994). Of local points (local maxima and local minima) on the images output through the DoG filter, points of which position does not change in resolution changes within a predetermined range are detected as feature points by the feature point extractors 1a and 1b. The number of levels of the resolutions is set in advance.
Features storages 2a and 2b extract and hold the features of the respective features points extracted by the feature point extractors 1a and 1b. The feature point extractors 1a and 1b use the canonical orientation (dominant orientation) and the orientation planes of the neighboring area of the feature point. The canonical orientation refers to the orientation that offers the peak of an orientation histogram in which Gaussian-weighted gradient magnitudes are accumulated. The feature storages 2a and 2b hold the canonical orientations as the features. Furthermore, the features storages 2a and 2b normalize the gradient magnitude information of the neighboring area of a feature point by the canonical orientation, i.e., the feature storages 2a and 2b carry out orientation correction with the canonical orientation defined as 0 deg, and classify by the gradient orientation the gradient magnitude information of the respective points in the neighboring area together with the position information. For example, when the gradient magnitude information of the respective points in a neighboring area is classified in total eight orientation planes at angle increments of 45 deg, the gradient information of an orientation of 93 deg and a magnitude m of a point (x, y) on the local coordinate system of the neighboring area is mapped as information of the magnitude m at the position (x, y) on the orientation plane having a 90-deg label and the same local coordinate system as that of the neighboring area. After the classification, each orientation plane is subjected to blurring and resampling dependent upon the scale of the resolution. The features storages 2a and 2b hold the thus obtained feature vectors of dimensions of (the number of resolutions)×(the number of orientation planes)×(the size of each orientation plane).
Subsequently, a features matching unit 3 retrieves model feature points having features that are the most similar to those of the respective object features points by using the k-d tree method (Nearest Neighbor search method on a features space offering good retrieval efficiency), and holds obtained match pairs as a match pair group.
A model pose coarse-estimation unit 11 in a recognition determination unit 4 estimates, by the generalized Hough transform, the pose (image transformation parameters such as a rotation angle, scaling ratio, and translation amount) of a model on the input image from the spatial positional relationship between the model features points and the object feature points. At this time, the above-described canonical orientations of the respective features points will be used as indexes in a parameter reference table (R table) of the generalized Hough transform. The output of the model pose coarse-estimation unit 11 is equivalent to the result of voting on the image transformation parameter space. The parameter that has acquired the largest vote counts offers coarse estimation of the model pose.
A candidate corresponding feature point pair selector 12 in the recognition determination unit 4 selects only the match pairs each having as its member the object feature point that has voted to this parameter, to thereby refine the match pair group.
Finally, a model pose estimator 13 in the recognition determination unit 4 estimates affine transformation parameters from the spatial arrangement of the corresponding feature point pair group by least-squares estimation, under constraint that “the detected model involves image deformation due to affine transformation on the input image”. Furthermore, the model pose estimator 13 transfers the respective model features points of the match pair group on the input image based on the affine transformation parameters, and obtains the position shift (spatial distance) of the transferred feature points from the corresponding object feature point. In addition, the model pose estimator 13 eliminates match pairs involving a greatly large position shift to thereby update the match pair group. At this time, if the number of match pairs included in the match pair group is two or less, the model pose estimator 13 outputs an indication that “model detection is impossible” and ends its operation. If not so, the model post estimator 13 repeats this operation until a predetermined end condition is satisfied, and finally outputs, as a model recognition result, the model pose determined by the affine transformation parameters obtained when the end condition is satisfied.
However, this scheme by D. Lowe described in Non-Patent Document 2 involves several problems.
First, a problem exists in the extraction of the canonical orientation of a feature point. As described above, the canonical orientation is obtained as the orientation that offers the peak of an orientation histogram arising from accumulation of Gaussian-weighted gradient magnitudes, obtained from local gradient information of the neighboring area of a feature point. The scheme of Non-Patent Document 2 has a tendency that a point slightly inside a corner of an object is detected as a feature point. In the orientation histogram of the neighborhood of such a feature point, two peaks appear as the orientations perpendicular to the edge, and therefore plural conflicting canonical orientations will be possibly detected. However, the feature matching unit 3 and the model pose estimator 13 at the subsequent stages are not designed for such a situation and thus cannot address the situation. Furthermore, there is also another problem that the shape of an orientation histogram changes depending on the parameters of the Gaussian weighting function and hence stable extraction of the canonical orientations is impossible. In addition, because the canonical orientations are used in the features matching unit 3 and the model pose estimator 13 at the subsequent stages, the extraction of improper canonical orientations has significantly adverse effects on the result of the feature matching.
Second, in feature comparison based on the orientation planes, feature matching based on the density gradient magnitude information of the respective points in a local area is carried out. However, the gradient magnitude is not a feature invariant to luminosity changes in general. Therefore, when there is a luminosity difference between a model image and an input image, stable matching fails to be ensured problematically.
Third, the following situation would be possible: there are plural model feature points of which distance on a feature space with respect to the corresponding object feature point is not the smallest but sufficiently small, i.e., there are plural model feature points each having a sufficiently similar feature, and feature points of true feature points pairs (inliers) are included in these model feature points. However, in the feature matching unit 3, each object feature point is paired with only the model feature point having the smallest distance in the feature space, and therefore these inliers are not taken into consideration as candidate corresponding pairs problematically.
Fourth, a problem is possibly caused in the estimation of affine transformation parameters in the recognition determination unit 74. Specifically, false feature point pairs (outliers) will be included in the corresponding feature point pair group resulting from the refining by the candidate corresponding feature point pair selector 12. If a large number of outliers are included in the match pair group of there are outliers that extremely depart from true affine transformation parameters, the estimation of affine transformation parameters is affected by the outliers. Furthermore, depending on the case, inliers are gradually eliminated through repeated operation while the outliers are left, so that an erroneous model pose is output problematically.