1. Technical Field
The present invention relates to the field of imaging, and particularly to determining the pose of a camera in respect of a scene from information provided by an image of the scene captured by the camera and the predetermined position of known features of the scene.
2. Background Information
The problem to be solved is to estimate the pose of a physical camera based on a real-time or recorded sequence of images. The motivation for finding the physical camera pose, is to be able to insert virtual objects into the image-stream in real-time, thereby obtaining an augmented reality effect. A typical setup is a user wearing head mounted displays (HMD) and camera, where images from the camera are used to reconstruct the pose of the camera and thus the pose of the HMDs. The pose is then used to correctly align virtual objects on the screen of the HMDs, either on top of the camera image (video-see-through), or on top of an optical see-through screen, to make the virtual objects/elements, that is, objects or elements not actually present in the real scene imaged by the camera, appear as real-life objects in the real surroundings of the user.
Prior art solutions to the problem are known to suffer either from not being capable of real-time use or because of slow or ineffective methods or algorithms used to determine in real-time the camera pose from the camera image, or from other shortcomings such as inaccuracies, such as low repeatability, in the results obtained, drifting, non-usable recovery of pose, or non-robust pose estimation.
2.1 Tracking Using a Known 3D Model of the Environment
Tracking or frame-to-frame tracking is described in chapter 0. Generally, frame-to-frame tracking methods suffer from drifting and (in practice) non-usable recovery of pose-methods. With only small camera movements, frame-to-frame tracking might work well, but in real-life scenarios where users are given control over the camera, this is a restriction that cannot be applied in practise.
In a practical world, where unlimited processing power or unlimited frame rate is not available, frame-to-frame tracking is generally known also to suffer from losing track. If the camera moves too much, the next frame will not contain enough feature points that can be matched against feature points in the previous frame to perform pose estimation. Moving the camera fast will result in a blurred new frame (motion blur), which also reduces the number of feature points in the previous frame that are reliably matched in the new frame. If the tracking loses its pose, there must be a pose recovery method implemented to reinitialize the tracking. Various methods have been suggested and published, however, most of them force the user to move and/or rotate the camera to a specific position and/or direction in order to recover the pose and to reinitialize the tracking. Even if a suitable recovery method is implemented, frame-to-frame tracking is known to suffer from drifting.
US 2002/0191862 A1 (U.S. Pat. No. 6,765,569 B1) discloses a method where the user needs to start use of the system by looking, that is, by pointing the camera, in a certain direction, and the system expands the possible working area during use. The method disclosed using a method similar to frame-to-frame tracking, but it does not use a prior-known 3D model, which means that it relies on triangulating feature point position in a subsequent frame to find the 3D position of feature points. The method stores 3D information of feature points during use, and refines the 3D position each time a feature point is detected, based on triangulation. When the refinement of the 3D position has reached a satisfactory level, the feature point is regarded as calibrated. This reduces the drifting issues somewhat, but in practice there is still severe drifting when moving far away from the starting point. An accurate determination of new feature points rely on accurate estimation of 3D position of previous 3D feature point positions, which again relies on accurate estimation of previous camera poses.
It is well known that there always will be some amount of numerical and mathematical error added in every calculation/estimation of the above proposed method. This means that feature points that are far away from the starting point will have large errors due to drifting. On top of that, if the 3D position of one single feature point is regarded as “calibrated”, and this position is actually wrong, then the further tracking and calibration of feature points will be affected by an even larger error.
In addition, it is required that the user moves in certain ways to ensure accurate estimation and refining of the 3D positions of the feature points, which in practice results in a cumbersome and hard-to-use system. U.S. Pat. No. 6,765,569 B1 includes a disclosure of a method for pose recovery if there are not enough feature points identified in an image, but this method still relies on having the user look in a direction where there are feature points that have been estimated satisfactorily, such that they can be considered “calibrated” feature points. The method does not use a 3D model of the environment that may be observed by the camera, which means that it computes 3D positions of the feature points while running by triangulation in subsequent frames.
2.2 Local Detection and Matching of Feature Points (Detection)
A good way of removing the problems of drifting and lost pose is to use a method where the system is trained based on feature points of the environment/object/scene before using the system in runtime. Training-/classification-information stored together with 3D positions makes it possible in real-time to match feature points detected in the current frame against the ones stored in a database or classified through a classifier. When a sufficient number of feature points match, it is possible to estimate the physical camera pose through geometrical calculations. Further refinements can be obtained through numerical minimizations. This can be performed in every single frame without depending on the result of estimations in previous frames. In practice this means that “pose recovery” is performed in every single frame.
Drifting and lost pose will not be an issue in detection, as long as enough feature points are classified or matched against the contents of a keypoint database. However this kind of method is known to suffer from false detections, meaning that feature points detected in real-time are falsely classified or matched to feature points stored in the database. Some methods for estimating the camera pose can deal with a certain number of false matches (also called outliers), but either way the final result of the estimated pose will typically suffer from low repeatability, even with few outliers. A typical scenario of such detection methods is that there are at least 15 outliers. This generally results in low repeatability, even when the camera is kept completely still. The effect of the aforementioned problem in an augmented reality system is that the virtual objects are not kept in place, but appear to be bouncing around, jittering, etc.
US 2006/0233423 A1 suggests a method for matching real-time feature points against a database, and for training/storing relevant information about the calibrated/trained feature points. In relation to what has been described above, the method suggested can be considered to be a detection method.
The disclosure “Randomized Trees for Real-Time Keypoint Recognition” by V. Lepetit, P. Lagger and P. Fua published accepted to Conference on Computer Vision and Pattern Recognition, San Diego, Calif., June 2005 describes an earlier but similar method to what is disclosed by US 2006/0233423 A1 for detecting objects in single frames. Instead of using a database, the use of “Classification trees” (see Chapter 0 below) is proposed. The classification trees are built in a training phase, and at runtime each keypoint is “dropped down” multiple trees, and finally the keypoint is matched with a certain score (confidence). Even this method shows the strengths and weaknesses of a detection method.
2.3 Recorded Image Streams
Numerous other methods and algorithms have been published and patented for estimating the pose of a physical camera based on recorded image streams. Estimating the pose of a physical camera based on recorded image streams means that the method needs all or several images of the stream to be available “forward” and “backward” in time to be able to compute the camera pose of the “current” frame. In movie production several such methods have been patented and published, but, as mentioned earlier, in this category the methods do not run in real-time, and they rely on a complete image stream to yield satisfactory result. Their intended use is not real-time augmented reality, but visual post-production effects of “Hollywood movies”.