1. Field of the Invention
The present invention relates to an information processing apparatus that calculates information on the position and orientation of an image capture device relative to an object captured by the image capture device, a processing method thereof, and a computer-readable storage medium.
2. Description of the Related Art
In recent years, research on AR (Augmented Reality) technology for superimposing information regarding virtual space on the real space and displaying the result has been actively conducted. As a typical information presenting device that has adopted such AR technology, a video see-through-type head-mounted display is known, for example. A camera that captures real space is provided in the video see-through-type head-mounted display. With the video see-through-type head-mounted display, a virtual object is drawn using CG (Computer Graphics) according to the position and orientation of the camera or the like. Then, the combined image obtained by superimposing the drawn virtual object on the image of real space is displayed on a display device of the head-mounted display, such as a liquid crystal panel. Thereby, a user can feel as if the virtual object exists in real space.
One of the big problems to be solved when realizing such AR technology is “alignment”. Alignment in AR makes geometric matches between a virtual object and real space. In order that the user feels as if a virtual object exists in real space, alignment needs be correctly performed so that the virtual object always exists in the position where the object is to exist in real space, and such a state needs be presented to the user.
With AR using the video see-through-type head-mounted display, generally, every time an image is inputted from the camera provided in the head-mounted display, the position and orientation of the camera in real space when capturing an image are measured. Then, an object is drawn using CG based on the position and orientation of this camera and parameters intrinsic to the camera such as a focal length, and is superimposed on the image of real space. Therefore, when performing alignment in AR, the position and orientation of the camera provided in the head-mounted display need to be correctly measured. Generally, the position and orientation of a camera are measured using a physical sensor with six degrees of freedom that can measure the position and orientation of a camera, such as a magnetic sensor, an ultrasonic sensor, and an optical sensor.
On the other hand, the video see-through-type head-mounted display can use image information from the camera provided therein for alignment. If alignment is performed using image information, such alignment can be more easily performed at a lower cost, compared with an alignment method using a physical sensor. Generally, with this alignment method, an index whose three-dimensional position is known in real space is captured with a camera, and based on the correspondence between the position of the index on the captured image and a three-dimensional position, the position and orientation of the camera are calculated. For an index, for example, a marker artificially disposed in real space, or natural features that originally exist in real space, such as a corner point or an edge, are used. Practically, in terms of stability or the calculation load, artificial markers that are easily detected and identified from image information are widely used.
Relating to such technology, “An Augmented Reality System and its Calibration based on Marker Tracking” (Kato, M. Billinghurst, Asano, and Tachibana, the Journal of the Virtual Reality Society of Japan paper magazine, Vol. 4, No. 4, pp. 607-617, 1999) (hereinafter, referred to as Document 1) discloses a method for performing alignment using a marker having a square shape with an intrinsic two-dimensional pattern drawn inside, for an index. Artificial markers, such as the above square marker, can be easily used and are thus widely used. However, in a case in which it is physically impossible or difficult to dispose a marker, or in a case in which a marker is not preferably disposed for the reason that a fine view is spoiled, or the like, a marker cannot be used.
On the other hand, as the capabilities of computers have improved in recent years, research on technology for performing alignment using a natural feature that originally exists in an actual scene has been actively conducted. Such natural features used for alignment include a feature having a point shape, such as a corner point (hereinafter, a point feature), and a line feature such as an edge. A method for alignment using an edge is disclosed in “Real-time visual tracking of complex structures” (T. Drummond and R. Cipolla, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 7, pp. 932-946, 2002) (hereinafter, referred to as Document 2), “A real-time tracker for markerless augmented reality” (A. I. Comport, E. Marchand, and F. Chaumette, Proc. The 2nd IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR03), pp. 36-45, 2003) (hereinafter, referred to as Document 3) and “Combining edge and texture information for real-time accurate 3D camera tracking” (L. Vacchetti, V. Lepetit, and P. Fua, Proc. The 3rd IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR04), pp. 48-57, 2004) (hereinafter, referred to as Document 4). Since edges do not change with respect to scale or an observation direction, alignment using an edge has the feature that accuracy is high. For alignment using an edge, having three-dimensional model data of real space or a real object drawn using a set of line segments is a premise. Alignment using edges disclosed in Documents 2, 3, and 4 is realized through the following processes (1) to (4). (1) Based on the position and orientation of a camera for a previous frame and the intrinsic parameters of a camera that have been corrected in advance, the three-dimensional model data described above (line segment model) is projected on an image. (2) Each projected line segment is divided at constant intervals on the image, and dividing points are set. Then, for each dividing point, an edge is searched for on the line segment that passes through a dividing point and whose direction is a normal direction of the projected line segment (a search line), and a point whose luminance value has the maximum gradient on the search line and that is nearest to the dividing point is detected as a corresponding edge. (3) A correction value of the position and orientation of a camera is calculated such that the total distance on an image between a corresponding edge detected for each dividing point, and a projected line segment is the minimum, and the position and orientation of a camera are calculated, based on that correction value. (4) Repeat the process in (3) until the calculated result converges, and optimizing calculation is performed.
Unlike the point feature, an edge is less identifiable on an image. When searching for an edge, since only information regarding the maximum gradient of the luminance value on a search line is used, an incorrect edge is often detected. Accordingly, in Documents 2, 3, and 4, in order to prevent an edge incorrectly detected from having a harmful effect on the optimizing calculation, using the technique called M-estimation, the weight of the data of an edge considered to have been incorrectly detected is reduced, and then an optimizing calculation is performed.
Further, as described above, alignment using an edge needs three-dimensional model data for line segments that constitute real space and a real object that are to be aligned. Conventionally, measurement of three-dimensional model data of a line segment was manually performed or was performed using an image. A tape measure, a ruler, a protractor, and the like are used for manual measurement. Further, after capturing an image of a scene or an object that is to be measured, photogrammetry software for calculating three-dimensional data based on the result obtained by the person measuring manually designating a line segment on that image is also used, for instance. The person measuring searches, from the real space/real object, for a line segment that is likely to be detected as an edge when performing alignment, such as a line of intersection between a plane and a plane, a line whose luminance greatly changes between each side of a line, and the like, and measures using the above-mentioned tools or software. Further, other than this, as described in “Structure and motion from line segments in multiple images” (C. J. Taylor and D. J. Kriegman, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 11, pp. 1021-1032, 1995) (hereinafter, referred to as Document 5), a method for measuring the three-dimensional data of a line segment using an image is also known. In Document 5, based on the correspondence of a line segment on images among a plurality of images, the direction of a straight line and the passing position thereof in three-dimensional space are estimated. In Document 5, as a method for detecting an edge on an actual image, a two-dimensional edge detecting method described in “A computational approach to edge detection” (J. Canny, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 8, No. 6, pp. 679-698, 1986) (hereinafter, referred to as Document 6) is used.
When performing alignment using an edge described above, three-dimensional model data is projected, and an edge is detected through a one-dimensional search based on that projection image. Accordingly, at the point in time of measuring three-dimensional model data, since the three-dimensional model data does not exist, such edge detection through a one-dimensional search based on a projection of three-dimensional model data cannot be performed. In other words, whether or not an edge estimated by the person measuring as being an edge, and an edge detected using a two-dimensional edge detecting method will be detected when performing alignment is not known in advance. This is a common problem with not only the aforementioned edge detecting method for detecting the extremum of a concentration gradient, but also with other methods for projecting three-dimensional model data and performing a one-dimensional search (for example, methods such as that for performing a one-dimensional search for a corresponding edge using information on an image around an edge).
Accordingly, in the case of manual measurement, since the person measuring needs to determine with his/her eyes whether or not an edge is the edge used for alignment, such operations require skill. Furthermore, since there are cases in which even a person skilled in measuring may make a mistake in determination, after measurement of three-dimensional model data, there is a need to actually perform alignment and to repeatedly determine which line segment is unnecessary. Therefore, measurement takes time and effort.
Further, when performing conventional alignment using an edge, even in a case in which information on a line segment that is not actually detected as an edge is included in a three-dimensional model, the edges corresponding to these line segments are to be detected, and correspondence is to be established. For that reason, since incorrect detection of an edge on an image, or incorrect correspondence between such a line segment and the line segment of a model occurs, the accuracy and stability of such alignment decreases.