The availability of computer graphics is becoming increasingly important to the growth of entertainment-related applications including film animation tools, as well as lower resolution games and multimedia products for the home. A few of the many other areas touched by computer graphics include education, video conferencing, video editing, interactive user interfaces, computer-aided design and computer aided manufacturing (CAD/CAM), scientific and medical imaging, business applications and electronic publishing.
Although there are numerous ways to categorize graphics processing, one common approach is to describe an image in terms of dimensions of objects in the image. For example, a graphics system may represent objects in two dimensions (“2D”) where features in the object have x and y coordinates. Alternatively, objects may be represented in three dimensions (“3D”) where features in the object have x, y, and z coordinates. Most people are familiar with viewing images in 2D since computer monitors display images in 2D. However, if the computer maintains a graphical model representing an object in three-dimensions, the computer can alter the displayed image on a 2D display to illustrate a different perspective of the object in 3D space. For example, a user may view an image on a traditional computer screen from one perspective and see various lighting and shadowing changes as the user views the image from a different perspective. Thus, a user perceives the displayed objects as being in 3D.
A 3D scene that is displayed to a user may be generated from a 3D synthetic scene, which is generated by a computer, or a real scene, which can be captured using a camera (e.g., video camera). For either synthetic or real scenes, there are different parameters associated with the scene, such as the 3D structure associated with any object in the scene and the camera movement (e.g., rotation and translation) for a real or imaginary camera capturing the scene. Thus, it is desirable to automatically recover the structure of a 3D scene together with 3D camera positions from a sequence of images (e.g., video images) acquired by an unknown camera undergoing unknown movement.
Recovering such a 3D scene is often accomplished using structure-from-motion (SFM) algorithms. SFM has been studied extensively because of its applications in robotics, video editing and image based modeling and rendering. Some important aspects of SFM calculations include identifying multiple feature points in images, using a long baseline, and efficient bundle adjustment. A feature point is any point in the image that can be tracked well from one frame to another. Typically, corners of an object are easily identifiable and are considered good feature points. The base line is associated with how a camera is moving in relation to an object depicted in an image. Bundle adjustment is a non-linear minimization process that is typically applied to all of the input frames and features of the input image stream. Essentially, bundle adjustment is a non-linear averaging of the features over the input frames to obtain the most accurate 3D structure and camera motion.
There are, however, problems with conventional 3D reconstruction. For example, the bundle adjustment used in 3D reconstruction requires a good initial estimate of both 3D structure and camera motion. Additionally, bundle adjustment is computationally expensive because it involves all input frames and features. For example, the complexity of interleaving bundle adjustment for each iteration step is a function of mn3 where m is the number of feature points and n is the number of frames. Thus, bundle adjustment computed over all the frames is time consuming and slows the entire 3D reconstruction. For this reason, most systems use relatively short or sparse image sequences. In practice, however, structure from motion is often applied to a long video sequence.
Recently a paper entitled “Automatic Camera Recovery for Closed or Open Image Sequences” by Fitzgibbon and Zisserman (ECCV '98) describes a hierarchical approach by building local structure from image triplets, which are three sequential images. For this technique to be effective, an assumption is made that the sequence is sparse and each triplet forms a long baseline. In practice, however, a dense sequence of video images is often captured resulting in triplets having short baselines resulting in unreliable 3D models.
To overcome the shortcomings of conventional 3D reconstruction, in one aspect of the present invention, a method and apparatus divides a long sequence of frames into a number of smaller segments. A 3D reconstruction is performed on each segment individually. Then all the segments are combined together through an efficient bundle adjustment to complete the 3D reconstruction.
In another aspect of the invention, the number of frames per segment is reduced by creating virtual key frames. The virtual key frames encode the 3D structure for each segment, but are only a small subset of the original frames in the segment. A final bundle adjustment is performed on the virtual key frames, rather than all of the original frames. Thus, the final bundle adjustment is two orders of magnitude faster than a conventional bundle adjustment.
A further aspect of the invention allows for efficient segmenting of the sequence of frames. Instead of dividing the sequence so that segments are equal size (e.g., 100 frames per segment), the segments may vary in size and are divided based on the number of feature points that are in each frame in the segment. For example, any frame that has less than a threshold number of feature points may be moved to a different segment. This approach to segmenting balances the desire for a long baseline and a small tracking error.
In yet another aspect of the invention, a partial model is created by solving a number of long baseline two-frame SFM problems and interpolating motion parameters for in-between frames. The two-frame SFMs are scaled onto a common coordinate system. A bundle adjustment of the two-frame SFMs provides the partial model for the segment. Virtual key frames may then be created from the partial model.
Further features and advantages of invention will become apparent with reference to the following detailed description and accompanying drawings.