Field of the Invention
The present invention relates to an information processing apparatus and a method of controlling the information processing apparatus which generate an image database for estimating a position and orientation of an imaging apparatus from a captured image.
Description of the Related Art
In recent years, research of mixed reality (hereinafter, referred to as MR) technologies in which a real space and a virtual space are caused to be blended without a sense of unnaturalness and presented is active. However, among MR technologies, augmented reality (hereinafter, referred to as AR) technologies in which a virtual space is overlaid on a real space and presented have been collecting interest. One of the important problems in the MR technologies and the AR technologies is how to perform alignment between the real space and the virtual space accurately in real time, and a great deal of effort has been put into this problem. The problem of alignment in MR and AR is a problem of obtaining the position and orientation of the imaging apparatus within the scene (specifically, in a reference coordinate system defined within the scene) in a video see-through method.
There is a method, as a representative example of a method in which alignment in a video see-through method is realized, in which a known artificial indicator of shape information is arranged in the scene, the indicator is imaged and recognized by an imaging apparatus, and the position and orientation of the imaging apparatus in the reference coordinate system are obtained thereby. The position and orientation of the imaging apparatus in the reference coordinate system is obtained from a correspondence between a projected position (image coordinate) of the indicator within the image that the imaging apparatus imaged and a three-dimensional coordinate in the reference coordinate system of the indicator which is known information.
Also, alignment in which characteristics (hereinafter referred to as natural features) originally existing within a scene are used without using an artificial indicator is actively being researched as a method of realizing alignment in the video see-through method. A method in which the position and orientation of an imaging apparatus are obtained based on a correspondence between an edge within an image and a three-dimensional model of an observation target is disclosed in “T. Drummond and R. Cipolla: “Real-time visual tracking of complex structures”, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 932-946, 2002. (hereinafter referred to as D1)” and “A. I. Comport, E. Marchand, and F. Chaumette: “A real-time tracker for markerless augmented reality”, Proc. The Second Int'l Symp. on Mixed and Augmented Reality (ISMAR03), pp. 36-45, 2003. (hereinafter referred to as D2)”. Precision of the position and orientation of the imaging apparatus decreases when an incorrect detection occurs in a detection of an edge (corresponding point) corresponding to the three-dimensional model and precision of the alignment of MR and AR decreases. Accordingly, an M-estimator which is one robust estimation method is used and influence of the incorrect detection is eliminated by performing a weighted error minimization in D1 and D2.
Meanwhile, a method in which a point feature detected by a Harris operator, a Moravec operator, or the like, rather than an edge on the image, is used to obtain the position and orientation of an imaging apparatus is disclosed in “G. Simon, A. W. Fitzgibbon, and A. Zisserman: “Markerless tracking using planar structures in the scene”, Proc. Int'l Symp. on Augmented Reality 2000 (ISAR2000), pp. 120-128, 2000. (hereinafter referred to as D3)” and “I. Gordon and D. G. Lowe: “Scene modelling, recognition and tracking with invariant features”, Proc. The Third Int'l Symp. on Mixed and Augmented Reality (ISMAR04), pp. 110-119, 2004. (hereinafter referred to as D4)”. A problem of an incorrect detection occurs even in a case when a point feature is used, similarly to a case when an edge is used. Accordingly, point features incorrectly detected are eliminated by a RANSAC (RANdom SAmple Concensus) algorithm in D3 and D4. In incorrect detection elimination using RANSAC, the position and orientation of an imaging apparatus are estimated by using corresponding points chosen randomly and a corresponding point that is not included in a set of corresponding points in a case when the number of corresponding points that agree with the estimated values is the greatest is eliminated as an incorrect detection.
The camera position and orientation estimation method recited in D1 through D4 are methods with a presupposition of an environment (hereinafter referred to as a stationary environment) in which a moving object is not captured in a camera image. In D1 through D3, estimation accuracy decreases in a case when a point being tracked moves within the actual environment because a camera position and orientation estimation is performed by corresponding point tracking between frames. Also, estimation accuracy decreases by the number of points that can be tracked decreasing by the points being tracked getting occluded by an object whose position and orientation changes along with a change in time (hereinafter referred to as a moving object) and by incorrect correspondence of the tracking increasing.
Also, in D4, a portion of an image group selected from a whole imaged image group is registered in an image database and an estimation of the camera position and orientation is performed by selecting from the image database and using an image in which an estimation error of the relative position and orientation with respect to the current image is the smallest. Here, cases in which a database is constructed in an environment in which people or cars pass through, water flows, or the like are considered. In such cases, because an image in which a moving object is captured is registered to the database, it ceases to be possible to correctly associate an object in the environment between the image registered in the image database and the current frame. For this reason, it is difficult to perform a camera position and orientation estimation that references an image database in a dynamic environment in which a moving object is captured in an image.
Meanwhile, conventionally, techniques for estimating a region of a moving object captured in an image by using image processing technology and an acceleration sensor or the like mounted to a camera are being developed. However, conventionally, a region of a moving object captured in an image cannot be used to determine whether to register a camera image to the database. Also, in a case when there is a function for measuring the position and orientation of the moving object, a region of a moving object in an image can be estimated by using a measurement result of the measurement function. However, conventionally, information of a region in which a moving object appears has not been used to determine whether to register a camera image to the database.