Much research has been devoted to two-dimensional (2-D) to three-dimensional (3-D) conversion techniques for the purposes of generating 3-D models of scenes, and significant progress has been made in this area. Fundamentally, the process of generating 3-D models from 2-D images involves determining disparity values for corresponding scene points in a plurality of 2-D images captured from different camera positions.
Generally, methods for determining 3-D point clouds from 2-D images involve three main steps. First, a set of corresponding features in a pair of images are determined using a feature matching algorithm. One such approach is described by Lowe in the article “Distinctive image features from scale-invariant keypoints” (International Journal of Computer Vision, Vol. 60, pp. 91-110, 2004). This method involves forming a Scale Invariant Feature Transform (SIFT), and the resulting corresponding features are sometimes referred to as “SIFT features”.
Next, a Structure-From-Motion (SFM) algorithm, such as that described Snavely et al. in the article entitled “Photo tourism: Exploring photo collections in 3-D” (ACM Transactions on Graphics, Vol. 25, pp. 835-846, 2006) is used to estimate camera parameters for each image. The camera parameters generally include extrinsic parameters that provide an indication of the camera position (including both a 3-D camera location and a pointing direction) and intrinsic parameters related to the image magnification.
Finally, a Multi-View-Stereo (MVS) algorithm is used to combine the images, the corresponding features and the camera parameters to generate a dense 3-D point cloud. Examples of MVS algorithms are described by Goesele et al. in the article “Multi-view stereo for community photo collections” (Proc. International Conference on Computer Vision, pp. 1-8, 2007), and by Jancosek et al. in the article “Scalable multi-view stereo” (Proc. International Conference on Computer Vision Workshops, pp. 1526-1533, 2009). However, due to scalability issues with the MVS algorithms, it has been found that these approaches are only practical for relatively small datasets (see: Seitz et al., “A comparison and evaluation of multi-view stereo reconstruction algorithms,” Proc. Computer Vision and Pattern Recognition, Vol. 1, pp. 519-528, 2006).
Methods to improve the efficiency of MVS algorithms have included using parallelization of the computations as described by Micusik et al. in an article entitled “Piecewise planar city 3D modeling from street view panoramic sequences” (Proc. Computer Vision and Pattern Recognition, pp. 2906-2912, 2009). Nevertheless, these methods generally require calculating a depth map for each image, and then merging the depth map results for further 3D reconstruction. Although these methods can calculate the depth maps in parallel, the depth maps tend to be noisy and highly redundant, which results in a waste of computational effort. Micusik et al. also proposed using a piece-wise planar depth map computation algorithm, and then fusing nearby depth maps, and merging the resulting depth maps to construct the 3D model.
To further improve the scalability, Furukawa et al., in an article entitled “Towards Internet-scale multi-view Stereo” (Proc. Computer Vision and Pattern Recognition, pp. 1063-6919, 2010), have proposed dividing the 3D model reconstruction process into several independent parts, and constructing them in parallel. However, this approach is not very effective in reducing the view redundancy for a frame sequence in a video.
Pollefeys et al., in articles entitled “Visual modeling with a handheld camera” (International Journal of Computer Vision, Vol. 59, pp. 207-232, 2004) and “Detailed real-time urban 3D reconstruction from video” (Int. J. Computer Vision, Vol. 78, pp. 143-167, 2008), have described real-time MVS systems designed to process a video captured by hand-held camera. The described method involves estimating a depth map for each video frame, and then use fusing and merging steps to build a mesh model. However, both methods are only suitable for highly structured datasets (e.g., street-view datasets obtained by a video camera mounted on a moving van). Unfortunately, for consumer videos taken using hand-held video cameras the video frame sequences are more disordered and less structured than the videos that these methods were designed to process. More specifically, the camera trajectories for the consumer videos are not smooth, and typically include a lot of overlap (i.e., frames captured at redundant locations).
In most cases, only some of the 3-D geometry information can be obtained from monocular videos, such as a depth map (see: Zhang et al., “Consistent depth maps recovery from a video sequence,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 31, pp. 974-988, 2009) or a sparse 3-D scene structure (see: Zhang et al., “3D-TV content creation: automatic 2-D-to-3-D video conversion,” IEEE Trans. on Broadcasting, Vol. 57, pp. 372-383, 2011). Image-based rendering (IBR) techniques are then commonly used to synthesize new views (for example, see the article by Zitnick entitled “Stereo for image-based rendering using image over-segmentation” International Journal of Computer Vision, Vol. 75, pp. 49-65, 2006, and the article by Fehn entitled “Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-TV,” Proc. SPIE, Vol. 5291, pp. 93-104, 2004).
With accurate geometry information, methods like light field (see: Levoy et al., “Light field rendering,” Proc. SIGGRAPH '96, pp. 31-42, 1996), lumigraph (see: Gortler et al., “The lumigraph,” Proc. SIGGRAPH '96, pp. 43-54, 1996), view interpolation (see: Chen et al., “View interpolation for image synthesis,” Proc. SIGGRAPH '93, pp. 279-288, 1993) and layered-depth images (see: Shade et al., “Layered depth images,” Proc. SIGGRAPH '98, pp. 231-242, 1998) can be used to synthesize reasonable new views by sampling and smoothing the scene. However, most IBR methods either synthesize a new view from only one original frame using little geometry information, or require accurate geometry information to fuse multiple frames.
Existing Automatic approaches unavoidably confront two key challenges. First, geometry information estimated from monocular videos is not very accurate, which can't meet the requirement for current image-based rendering (IBR) methods. Examples of IBR methods are described by Zitnick et al. in the aforementioned article “Stereo for image-based rendering using image over-segmentation,” and by Fehn in the aforementioned article “Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-TV.” Such methods synthesize new virtual views by fetching the exact corresponding pixels in other existing frames. Thus, they can only synthesize good virtual view images based on accurate pixel correspondence map between the virtual views and original frames, which needs precise 3-D geometry information (e.g., dense depth map, and accurate camera parameters). While the required 3-D geometry information can be calculated from multiple synchronized and calibrated cameras as described by Zitnick et al. in the article “High-quality video view interpolation using a layered representation” (ACM Transactions on Graphics, Vol. 23, pp. 600-608, 2004), the determination of such information from a normal monocular video is still quite error-prone.
Furthermore, the image quality that results from the synthesis of virtual views is typically degraded due to occlusion/disocclusion problems. Because of the parallax characteristics associated with different views, holes will be generated at the boundaries of occlusion/disocclusion objects when one view is warped to another view in 3-D. Lacking accurate 3-D geometry information, hole filling approaches are not able to blend information from multiple original frames. As a result, they ignore the underlying connections between frames, and generally perform smoothing-like methods to fill holes. Examples of such methods include view interpolation (see: Chen et al., “View interpolation for image synthesis,” IEEE Trans. on Broadcasting, Vol. 57, pp. 491-499, 2011), extrapolation techniques (see: Cao et al., “Semi-automatic 2-D-to-3-D conversion using disparity propagation,” IEEE Trans. on Broadcasting, Vol. 57, pp. 491-499, 2011) and median filter techniques (see: Knorr et al., “Super-resolution stereo- and multi-view synthesis from monocular video sequences,” Proc. Sixth International Conference on 3-D Digital Imaging and Modeling, pp. 55-64, 2007). Theoretically, these methods cannot obtain the exact information for the missing pixels from other frames, and thus it is difficult to fill the holes correctly. In practice, the boundaries of occlusion/disocclusion objects will be blurred greatly, which will thus degrade the visual experience.