Due to recent development in the robot technology, complicated tasks such as assembling of industrial products, which have been manually performed, are now increasingly performed by robots. Such a robot grasps components by using an end effector such as a hand, and assembles the components. In order for the robot to grasp a component, it is necessary to measure relative position and orientation between a component to be grasped and the robot (hand). In addition to the case in which a robot grasps components, such measurement of the position and orientation is applied to various purposes such as self-position estimation for autonomous movement of robots, positioning between the real space and virtual objects in augmented reality, and the like.
Methods that uses a two-dimensional image captured by a camera is one example of the method for measuring the position and orientation. In such methods, measurement based on model fitting, in which a three-dimensionally shaped model of an object is fitted to a feature detected from a two-dimensional image, is generally performed. For example, in T. Drummond and R. Cipolla, “Real-time visual tracking of complex structures,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 932-946, 2002, a method is disclosed in which the position and orientation of an object are measured by using an edge as a feature detected from the two-dimensional image. In this method, a three-dimensionally shaped model of an object is represented by a set of line segments (wire frame model), and assuming that approximate position and orientation of the object are known, three-dimensional projection images of the line segments are fitted to the edges detected on the image, thereby measuring the position and orientation of the object.
With the above-described model fitting method, a correspondence between measurement data and the model is retrieved, and a difference between the correspondence pair elements is minimized. However, due to various factors such as a noise in the measurement data, or an error in the initial position and orientation estimation value, a correspondence that is obviously an error may be detected. A group of methods called robust estimation is used, in which a small weight is set for such an obvious error (hereinafter referred to as an “outlier”), thereby preventing the outlier from causing an adverse effect on the estimation results. A typical method of the robust estimation is M estimation.
Also, there are conventional object recognition techniques in which the type or an individual piece of an object is identified by matching a feature model of the object with a feature detected from an image. For example, in Lowe, D. G., “Object Recognition from Local Scale-Invariant Features”, Proc. of IEEE International Conference on Computer Vision, pp. 1150-1157, 1999, an object recognition method is disclosed that uses a method for extracting local image feature points, which is called SIFT (scale-invariant feature transform), as image features.
When an object targeted for estimation of the position and orientation or object recognition is disposed under illumination in order to capture an image of the target object, part of the target object or other objects may block the illumination light, thereby causing a shadow. From the image obtained by capturing the image of that scene, it is highly likely that an image feature is detected in the image region corresponding to the shadow, in particular, in a boundary portion between the shadow region and the non-shadow region. This is because a large gradient of the luminance level occurs at the boundary.
Then, when the correspondence between the feature described in the model and an image feature is searched for in order to perform fitting or matching between the object model and image data, a feature of the model may be erroneously associated with a pseudo image feature caused by the shadow. If such an erroneous correspondence occurs, various problems occur such as a failure in the position and orientation estimation processing or the object recognition processing, or a reduction in the accuracy of the position and orientation estimation or the object recognition accuracy.
Even if the above-described robust estimation is applied to such erroneous correspondence with the pseudo image feature, the effect of the erroneous correspondence is not necessarily mitigated. In the robust estimation, the weight coefficient with which each correspondence pair influences the processing results is adjusted based on a certain evaluation value for measuring the distance between the correspondence pair elements (for example, the distance on the image between the correspondence pair elements). Specifically, the weight coefficient is set to be higher as the evaluation value of the distance is larger. Note that when the distance evaluation value is calculated, whether the corresponding image feature is within the shadow region is not considered, and thus it is not possible to prevent the situation in which a large weight coefficient is set for a feature point within the shadow region. Accordingly, even if an erroneous corresponding point is extracted from the shadow region, it is difficult for the robust estimation to remove an adverse effect caused by the erroneous corresponding point. The robust estimation is a method for excluding outliers based on the numerical calculation, and does not involve geometric/optical judgment as to whether the image feature is affected by the shadow. For this reason, if elements of an erroneous correspondence pair accidentally have a small distance therebetween, it is difficult to mitigate the adverse effect caused by that pair.
In view of the above-described issues, the present invention provides a technique for reducing an effect of a pseudo image feature that is extracted from an image region corresponding to a region where a shadow is formed, and improving stability/accuracy of fitting/matching.