The present invention relates to reconstruction of structure from motion and, in particular, it concerns a method, system and computer product for deriving three-dimensional information progressively from a streaming video sequence.
Much work has been done in the field know as “Structure From Motion” (SfM) in which a three-dimensional form is reconstructed based upon a sequence of video images from a moving camera. Where no information is provided as to the path of the camera, the problem is sometimes referred to as “Structure And Motion” (SaM) since the motion of the camera must be derived together with the three-dimensional form. Reconstruction of three-dimensional form from a video sequence may be essentially subdivided into two processes: feature tracking and model reconstruction. Feature tracking identifies trackable features which appear in two or more frames of the video sequence. Then, model reconstruction employs the parallax between positions of these trackable features in multiple frames to derive three-dimensional information regarding the shape of the three-dimensional scene viewed and/or the camera motion.
Feature tracking is implemented as a two-dimensional image-processing task in which features (pixel patterns) suitable for tracking are identified in each image and then these features are compared between adjacent frames to identify tracks, sometimes referred to as “feature traces”. Suitable features for tracking are features like corners which are variant under translation (pixel displacement), and which are not repeated many times in the frame. These properties are sometimes referred to as “cornerness” and “uniqueness”. Searching for corresponding features in adjacent (or nearby) frames is typically performed by pattern comparison. Where sufficient processing resources are available, the pattern correlation may optionally be enhanced to allow matching of features which have undergone unknown planar transformations (e.g., scaling, rotation and/or “warping”).
The average length of track per trackable feature is critical to the quality of the three-dimensional reconstruction. Besides features passing out of view or being obscured, tracks are often lost prematurely due to rotation, scaling, or changing of background, lighting or other factors which can distort the texture of the region around a feature and render it difficult to associate a feature reliably across widely spaced frames. A lack of reliable feature matches between widely spaced images deprives the model reconstruction process of the high-parallax information vital for precise reconstruction.
The model reconstruction is typically performed by a refinement process known as “bundle adjustment”. Bundle adjustment, which is a process of refinement of an initial estimation, is computationally heavy, and its rate of convergence is highly dependent on what information or initial approximation is available as a starting point for the calculation.
The full bundle adjustment calculation is naturally suited to offline or “batch” processing scenarios where the entire video sequence is available before the calculations are performed. In such scenarios, the entire sequence is first processed to find feature tracks and an initial model approximation is derived. Bundle adjustment is then performed in parallel on some or all of the frames to refine the three-dimensional model and to derive the camera motion. Such offline techniques are well developed and documented, for example in Fitzgibbon and Zisserman (Automatic Camera Tracking, Andrew W. Fitzgibbon and Andrew Zisserman, Robotics Research Group, Department of Engineering Science, Oxford University, 2003), which is hereby incorporated by reference in its entirety.
The present invention, on the other hand, relates to “online” scenarios, where data is provided as a streaming video sequence, such as from real-time video, typically without any a priori information about the viewed scene or camera motion. Adaptation of the aforementioned techniques to online processing has proved non-trivial, particularly with regard to implementation of bundle adjustment in a real-time scenario. The common approach is to perform bundle adjustment locally on a small group (typically three) of closely spaced frames. These partial solutions are then concatenated or “stitched” together to provide an approximate solution for the entire video sequence (Preemptive RANSAC for Live Structure and Motion Estimation, David Nister, Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003)). This approach suffers from accumulating errors over long sequences and from an inability to maintain registration through severe aspect changes. To address these problems, Vacchetti et al. (Stable Real-Time 3D Tracking using Online and Offline Information, L. Vacchetti, V. Lepetit and P. Fua, Computer Vision Lab, Swiss Federal Institute of Technology (EPFL)) propose an approach using a number of keyframes generated during an offline registration procedure. At each stage, the current image is processed together with the previous image and the closest-matching keyframe. Additional keyframes are generated during the online processing whenever the reliability of matching becomes too low. As a result, as newly generated keyframes get further from the original registration keyframes, drift errors will tend to accumulate. Furthermore, the need for a registration procedure with offline processing prior to starting the online processing renders the approach of Vacchetti et al. unsuited to many applications.
There is therefore a need for a method, system and computer product for deriving three-dimensional information progressively from a streaming video sequence, such as from real-time video, and without requiring any a priori information about the viewed scene or camera motion.