In the computer vision community, research is currently very active regarding structure from motion, particularly the estimation of 3D structure from the images of a video sequence. The interest of the Structure from Motion (SaM), also referred to as Structure and Motion, branch of computer vision has recently shifted to developing reliable and practical SaM algorithms and to building systems which incorporate such algorithms. Further, special interest has been devoted to developing systems which are capable of processing video images directly from an initially uncalibrated camera to automatically produce a three-dimensional graphical model. Great advances have been made towards these goals, as reflected in the number of SaM algorithms which have been developed. For example, the algorithm disclosed by Fitzgibbon et al. in “Automatic Camera Recovery for Closed or Open Loop Image Sequences”, Proc. ECCV 1998, and the algorithm discussed by Pollefys et al. in “Self-calibration and Metric Reconstruction in Spite of Varying and Unknown Internal Camera Parameters”, IJCV, August 1999, both of which are incorporated herein by reference.
Typical structure from motion algorithms match points between images that are projections of the same point in space. This in turn enables triangulation of depth, in the same way as the human brain performs stereo vision. The result of SaM processing is a three-dimensional textured model of the structure seen in the images. The position and calibration of the projections that produced the images can also be retrieved. However, many of the SaM algorithms perform best when supplied with a set of sharp, moderately interspaced still images, rather than with a raw video sequence. Therefore choosing a subset of frames from a raw video sequence can produce a more appropriate input to these algorithms and thereby improve the final result.
One way to obtain a smaller set of views (i.e., a subset of frames), is simply to use a lower frame rate than the one produced by the camera. However, this is inadequate for several reasons. First, it can lead to unsharp frames being selected over sharp ones. Second, it typically means that an appropriate frame rate for a particular shot has to be guessed by the user or even worse, predefined by the system. In general, the motion between frames has to be fairly small to allow automatic matching, while significant parallax and large baseline is desirable to get a well-conditioned set of views (i.e., sequence of frames). If the frame rate is too low, matching between the image can be difficult if not impossible. However, with high frame rates (e.g., the full frame rate of a video camera) memory is quickly and sometimes needlessly consumed. Further, higher frame rates, while necessary in some situations, often result in a large abundance of similar views, which can produce problems, not only in terms of memory requirements, but in terms of processing requirements and/or numerical stability as well. Accordingly, with high frame rates, an unnecessarily large set of views is produced and with low frame rates, the connectivity between frames is jeopardized. In fact, the appropriate frame rate depends on the motion and parallax of the camera and can therefore vary over a sequence.
In practice, it is impossible to know in advance the appropriate frame rate at which a handheld video sequence should be grabbed, inasmuch as the appropriate frame rate depends on the motion of the camera and the imaged structure. Furthermore, the appropriate frame rate can vary within a sequence as well as from one sequence to another. Therefore, additional mechanisms are necessary in order to incorporate raw video sequences into a full working SaM algorithms system.