1. Technical Field
The invention is directed to a method for estimating a camera motion and for determining a three-dimensional model of an environment.
2. Background Information
Visual real-time tracking with respect to known or unknown scenes is essential and an incontrovertible component of vision-based Augmented Reality (AR) applications. Determining the relative motion of the camera with respect to an unknown environment with end-user hardware was made possible thanks to approaches inspired from A. J. Davison. 2003. Real-Time Simultaneous Localisation and Mapping with a Single Camera. In Proceedings of the Ninth IEEE International Conference on Computer Vision—Volume 2 (ICCV '03), Vol. 2, pp. 1403 (“Davison”)—. This approach is performing real-time tracking of visual features extracted from the captured images.
A feature is a salient element in an image which can be a point (often referred to as keypoint or interest point), a line, a curve, a connected region or any set of pixels. Features are usually extracted in scale space, i.e. at different scales. Therefore, each feature has a repeatable scale in addition to its two-dimensional position in the image. Also, a repeatable orientation (rotation) is usually computed from the intensities of the pixels in a region around the feature, e.g. as the dominant direction of intensity gradients. Finally, to enable comparison and matching of features, a feature descriptor is needed. Common approaches use the computed scale and orientation of a feature to transform the coordinates of the descriptor, which provides invariance to rotation and scale. Eventually, the descriptor is an n-dimensional vector, which is usually constructed by concatenating histograms of functions of local image intensities, such as gradients disclosed in D. G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vision 60, 2, pp. 91-110 (“Lowe”).
The features need to be seen in many images for which the camera has performed a motion that is sufficient enough for estimating the depth and consequently reconstructing the 3D coordinates of the features. This is generally based on the structure-from-motion principle. In order to get correctly scaled 3D coordinates of the reconstructed points and therefore a correctly scaled camera motion, these approaches usually require an explicit manual measurement of some parts of the environment or equipping it with known objects. Another possibility to induce scale is to ask the user to perform a constrained camera motion—often the camera needs to move between two known frames such that its optical center position varies with a metrically known scaled translation.
However, there are some limitations to this type of approach. Before reconstructing a point and adding it to the feature map, the point needs to be tracked over multiple frames that have an estimated camera pose. This delays the participation of a newly visible physical point in the estimation of the full camera motion. Also, either the environment needs to be partially measured or pre-equipped or the user needs to have some experience with the system in order to correctly perform constrained camera motion that allows correct scale estimation. Lastly, since the existing approaches are mainly based on visual features (often extracted where some texture gradient is available), the online feature map that is obtained from the existing approaches is generally sparse and could not be used, even after post-processing and meshing, for occlusion handling or similar AR tasks that for example may require a meshed version of the environment.
The authors of R. A. Newcombe and A. J. Davison. 2010. Live dense reconstruction with a single moving camera. IEEE Conference on Computer Vision and pattern Recognition (CVPR), 2010 showed that with a higher computational power where a single standard hand-held video camera is attached to a powerful PC and with the usage of the computational power of the Graphics Processing Unit (GPU), it is possible to get a dense representation of a desktop scale environment and highly textured scene while performing the tracking using the PTAM method. (e.g., see G. Klein and D. Murray. 2007. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR '07), pp. 1-10; “Klein”). The density of the online created map was increased with stereo-dense matching and GPU-based implementations.
Approaches exist that work on combined range-intensity data. In addition to an intensity image they make use of a range map that contains dense depth information associated to the intensity image. The depth of a pixel refers to the distance between the principal point of the capturing device and the physical 3D surface that is imaged in that pixel.
FIG. 8 shows a scene consisting of two sets of dolls S1 and S2 (each set comprising a tall and a small doll), and a capturing device CD. A physical point PP1 of the set S1 is imaged in the pixel IP1 with the capturing device. The depth of this pixel is D1, the distance between the optical center OC of the capturing device, which defines the origin of the camera coordinate system, and the physical point PP1. Analogously, a second physical point PP2 of the set S2 is imaged in IP2 and has the depth D2. Note that an estimate of the camera intrinsic parameters (in particular focal length) allows for computing the 3D position in Cartesian coordinates of a point PP1 given its depth D1 and its pixel position on the image plane IP1.
V. Castaneda, D. Mateus, and N. Navab. 2011. Slam combining tof and high-resolution cameras. In Proceedings of the 2011 IEEE Workshop on Applications of Computer Vision (WACV '11), pp. 672-678 replaced the generally used standard hand-held video camera with a combination of a Time of Flight (204×204) resolution camera and a (640×480) RGB camera and modified the measurement model and the innovation formulas of the Extended Kalman filter used by MonoSLAM (e.g., see Davison) to improve the tracking results. Since this approach is based on Extended Kalman filter, it provides lower accuracy compared to Keyframe based methods. As it is nicely discussed in H. Strasdat, J. Montiel and A. J. Davison. 2010. Real-time Monocular SLAM: Why Filter?. In 2010 IEEE International Conference on Robotics and Automation (ICRA), Anchorage, Ak., USA, pp. 2657-2664) in modern applications and systems, keyframe-based approach give the best accuracy per unit of computing time.
Microsoft's end-user device Xbox 360 Kinect is a low cost and relatively high resolution RGB-D camera consisting of a stereo system composed of an infra-red structured light projector combined with an infra-red camera allowing pixel depth computation and to which a camera for providing intensity images is registered. This device has directly been used by P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. 2010. RGB-D Mapping: Using depth cameras for dense 3d Modeling of indoor environments. In Proc. of 2010 International Symposium on Experimental Robotics (ISER '10) (“Henry”) for surfel-based modeling of indoor environments. However, the proposed system does not run in real-time and works on recorded videos; it does not perform any real-time or inter-frame tracking.
Therefore, it would be beneficial to provide a tracking method for simultaneously estimating a camera motion and for determining a three-dimensional model of a real environment which takes account of the above mentioned aspects.