1. Field of the Invention
The present invention relates to a technique for calculating at least one of the position and orientation of an image pickup apparatus or an observation target object on the basis of image features detected on a captured image.
2. Description of the Related Art
Recently, research of Mixed Reality (MR) techniques for providing a seamless integration of real and virtual spaces has been actively conducted. Among these MR techniques, especially an Augmented Reality (AR) technique that superimposes a virtual space on a real space has received attention.
An image providing apparatus employing the AR technique is mainly implemented by a video see-through or optical see-through head mounted display (HMD).
In a video see-through HMD, a virtual-space image (e.g., virtual objects or text information rendered by computer graphics) generated according to the position and orientation of an image pickup apparatus such as a video camera in the HMD is superimposed on a real-space image captured by the image pickup apparatus, and the resulting synthesized image is displayed to a user. In an optical see-through HMD, a virtual-space image generated according to the position and orientation of the HMD is displayed on a transmissive-type display to allow a synthesized image of real and virtual spaces to be formed on the retina of a user.
One of the most serious problems with the AR technique is accurate registration between real and virtual spaces, and many attempts have been made to address this problem. In a video see-through HMD, the problem of registration in AR involves accurate determination of the position and orientation of the image pickup apparatus in a scene (that is, in a reference coordinate system defined in the scene). In an optical see-through HMD, the problem of registration involves accurate determination of the position and orientation of the HMD in a scene.
To solve the former problem, it is common to place artificial markers in a scene and determine the position and orientation of the image pickup apparatus in the reference coordinate system using the markers. The position and orientation of the image pickup apparatus in the reference coordinate system are determined from the correspondences between detected positions of the markers in an image captured by the image pickup apparatus and known information, namely, three-dimensional positions of the markers in the reference coordinate system.
To solve the latter problem, it is common to attach an image pickup apparatus to the HMD and determine the position and orientation of the image pickup apparatus in a manner similar to that described above to determine the position and orientation of the HMD on the basis of the determined position and orientation of the image pickup apparatus.
Methods for determining the position and orientation of an image pickup apparatus on the basis of correspondences between image coordinates and three-dimensional coordinates have been proposed for a long time in the fields of photogrammetry and computer vision.
A method for determining the position and orientation of an image pickup apparatus by solving nonlinear simultaneous equations on the basis of correspondence of three points is disclosed in R. M. Haralick, C. Lee, K. Ottenberg, and M. Nolle, “Review and Analysis of Solutions of the Three Point Perspective Pose Estimation Problem”, International Journal of Computer Vision, vol. 13, No. 3, PP. 331-356, 1994 (hereinafter referred to as “Document 1”).
A method for determining the position and orientation of an image pickup apparatus by optimizing a rough position and orientation of the image pickup apparatus through iterative calculations on the basis of correspondences between image coordinates and three-dimensional coordinates of a plurality of points is disclosed in D. G. Lowe, “Fitting Parameterized Three-Dimensional Models to Images”, IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 13, No. 5, PP. 441-450, 1991 (hereinafter referred to as “Document 2”).
Another serious problem in AR technology, other than registration, is an occlusion problem in which determination of the in-front/behind relationship between real and virtual spaces is required. For example, when a virtual object is located at a position hidden or occluded by a real object such as a hand, it is necessary to render the real object in front of the virtual object. If the occlusion effect is not taken into account, the virtual object is always rendered in front of the real object, and an observer viewing the resulting image feels unnatural. In Japanese Patent Laid-Open No. 2003-296759 (hereinafter referred to as “the Patent Document”), the occlusion problem is overcome by designating the color of an occluding real object (e.g., the color of a hand) in advance so that a virtual object is not rendered in a region of a captured image having the same color as the occluding real object.
In N. Yokoya, H. Takemura, T. Okuma, and M. Kanbara, “Stereo vision based video see-through mixed reality,” in (Y. Ohta & H. Tamura, eds.) Mixed Reality-Merging Real and Virtual Worlds, Chapter 7, Ohmsha-Springer Verlag, 1999 (hereinafter referred to as “Document 11”), the occlusion problem is overcome by obtaining real-space depth information through stereo matching using images captured by two built-in cameras of an HMD.
With the recent high-speed performance of computing machines, research of registration using features present in a scene (hereinafter referred to as “natural features”), rather than artificial markers, has been actively carried out.
Methods for determining the position and orientation of an image pickup apparatus on the basis of correspondences between image edges and a three-dimensional model of an observation target are disclosed in T. Drummond and R. Cipolla, “Real-time visual tracking of complex structures”, IEEE Transaction Pattern Analysis and Machine Intelligence, vol. 24, No. 7, PP. 932-946, 2002 (hereinafter referred to as “Document 3”), and A. I. Comport, E. Marchand, and F. Chaumette, “A real-time tracker for markerless augmented reality”, Proceedings of the Second IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR03), PP. 36-45, 2003 (hereinafter referred to as “Document 4”).
In these methods, first, (1) a three-dimensional model is projected onto a captured image using a rough position and rough orientation of an image pickup apparatus. The rough position and rough orientation of the image pickup apparatus are, for example, the position and orientation calculated in the preceding frame. Then, (2) line segments comprising the projected model are divided into equal intervals on the image, and, for each of the division points, a point (edge) where the intensity gradient is a local maximum in the direction perpendicular to the projected line segment is searched for as a corresponding point. Further, (3) correcting values of the position and orientation of the image pickup apparatus are determined so that the distances between the corresponding points found for the individual division points and the corresponding projected line segments become minimum on the image, and the position and orientation of the image pickup apparatus are updated. The three-dimensional model is again projected onto the captured image using the updated position and orientation of the image pickup apparatus, and the step (3) is iterated until the sum of the distances has converged to the optimum. Thus, the final position and orientation of the image pickup apparatus are obtained.
In the above step (2), erroneous detection may occur if the accuracy of the rough position and rough orientation of the image pickup apparatus is low. That is, wrong points may be detected as corresponding points. If such erroneous detection occurs, the iterative calculations may not be converged in the step (3), or the accuracy of the obtained position and orientation of the image pickup apparatus may be low, resulting in low-accuracy AR registration.
In Documents 3 and 4, therefore, an M-estimator, which is a robust estimation method, is used to minimize the sum of weighted errors by assigning a small weight to data having a large distance between the corresponding point and the line segment and assigning a large weight to data having a small distance. Therefore, any influence of erroneous detection is eliminated.
In L. Vacchetti, V. Lepetit, and P. Fua, “Combining edge and texture information for real-time accurate 3D camera tracking”, Proceedings of the Third IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR04), PP. 48-57, 2004 (hereinafter referred to as “Document 5”), a plurality of candidate points are extracted and stored in the search step (2), and the closest points to the projected line segments are selected from among the plurality of candidate points each time the step (3) is repeated. Therefore, any influence of erroneous detection is eliminated.
In H. Wuest, F. Vial, and D. Stricker, “Adaptive Line Tracking with Multiple Hypotheses for Augmented Reality”, Proceedings of the Fourth IEEE and ACM International Symposium Mixed and Augmented Reality (ISMAR05), PP. 62-69, 2005 (hereinafter referred to as “Document 6”), information concerning visual properties of edges near line segments on an image is held to eliminate the influence of erroneous detection caused by changes in lighting or changes in point of view.
Methods for determining the position and orientation of an image pickup apparatus using point features, rather than edges, on an image are disclosed in G. Simon, A. W. Fitzgibbon, and A. Zisserman, “Markerless Tracking using Planar Structures in the Scene”, Proc. Int'l Symp. on Augmented Reality 2000 (ISAR2000), PP. 120-128, 2000 (hereinafter referred to as “Document 7”), and I. Skrypnyk and D. G. Lowe, “Scene Modelling, Recognition and Tracking with Invariant Image features”, Proc. The Third Int'l Symp. on Mixed and Augmented Reality (ISMAR04), PP. 110-119, 2004 (hereinafter referred to as “Document 8”).
Point features are features represented in terms of position (image coordinates) on an image and image information around. For example, point features are detected using the Harris operator, Moravec operator, or the like.
In Document 7, point features on the same plane in a three-dimensional space are tracked in successive frames, and the position and orientation of an image pickup apparatus are calculated on the basis of the relationship between the positions of these points on the plane and the image coordinates of the corresponding point features.
In Document 8, the position and orientation of an image pickup apparatus are determined using point features having feature information invariant to scale change and rotation change on an image on the basis of correspondences between image coordinates and three-dimensional coordinates of the point features. In Document 8, point features are not tracked in successive frames. On the contrary, matching is performed between a predetermined point feature database and point features detected in the current frame to identify the point features in every frame.
The point-feature-based methods also have a problem of erroneous detection, like the edge-based methods. In Documents 7 and 8, point features detected as outliers are removed using the Random Sample Consensus (RANSAC) algorithm. In the RANSAC-based outlier removal scheme, corresponding points are selected at random, and the position and orientation of the image pickup apparatus are calculated. When the number of corresponding points meeting the calculated values is the maximum, corresponding points that are not included in the set of the corresponding points are removed as outliers.
There is a method of the related art using artificial markers, in which erroneous detection of the markers is prevented using chroma-keying. ProSet and SmartSet systems, which are virtual studio systems of Orad Hi-Tec systems Ltd., utilize traditional blue or green screen chroma-key techniques to separate a human figure from a background.
On the background, an artificial pattern for registration having a color similar to that of the background is placed, and the position and orientation of a camera are estimated using the detected pattern on the captured image. Since the registration pattern is separated as the background from the human figure using chroma-keying, the registration pattern may not be erroneously detected on the human image. Therefore, stable estimation of the position and orientation of the camera can be achieved. Furthermore, since the registration pattern is removed as the background using chroma-keying, the registration pattern is not observed on a composite image in which a computer graphics (CG) image is rendered on the background.
The above-described technique to avoid erroneous detection of the markers, proposed by Orad Hi-Tec systems Ltd., is a technique for use in virtual studio applications. In a virtual studio, a human figure is extracted from the background, and a CG image is rendered onto a background portion so as to be combined with an image of the human figure. Therefore, a blue screen can be used as a background, and the background can be extracted using chroma-keying.
In AR systems, however, a CG image is superimposed on a real background image. Therefore, it is difficult to extract the background by performing simple processing such as chroma-keying, and it is also difficult to adopt the technique proposed by Orad Hi-Tec systems Ltd. to avoid erroneous detection of natural features on the background image.
In the related art, the process for detecting natural features for registration and the process for determining the in-front/behind relationship between a virtual object and a real occluding object such as a hand are separately performed. In an image region where the occluding object is in front of the virtual object, the natural features used for registration must not be detected. Therefore, it can be expected that erroneous detection may be prevented by using information concerning the in-front/behind relationship for image feature detection. In the related art, however, information concerning the in-front/behind relationship is not used for natural feature detection.
A measurement device arranged to measure the position and orientation of an occluding object allows the in-front/behind relationship between an observation target object and the occluding object to be determined using measurement results of the measurement device. In the related art, however, information concerning the measured position and orientation of the occluding object is not used for natural feature detection.