Stereoscopic videos are regarded as the next prevalent media for movies, TV programs, and video games. Three-dimensional (3-D) movies have achieved great successes in providing extremely vivid visual experiences. The fast developments of stereoscopic display technologies and popularization of 3-D television has inspired people's desires to record their own 3-D videos and display them at home. However, professional stereoscopic recording cameras are very rare and expensive. Meanwhile, there is a great demand to perform 3-D conversion on legacy two-dimensional (2-D) videos. Unfortunately, specialized and complicated interactive 3-D conversion processes currently required, which has prevented the general public from converting captured 2-D videos to 3-D videos. Thus, it is a significant goal to develop an approach to automatically synthesize stereoscopic video from a casual monocular video.
Much research has been devoted to 2-D to 3-D conversion techniques for the purposes of generating stereoscopic videos, and significant progress has been made in this area. Fundamentally, the process of generating stereoscopic videos involves synthesizing the synchronized left and right stereo view sequences based on an original monocular view sequence. Although it is an ill-posed problem, a number of approaches have been designed to address it. Such approaches generally involve the use of human-interaction or other priors. According to the level of human assistance, these approaches can be categorized as manual, semiautomatic or automatic techniques. Manual and semiautomatic methods typically involve an enormous level of human annotation work. Automatic methods utilize extracted 3-D geometry information to synthesis new views for virtual left-eye and right-eye images.
Manual approaches typically involve manually assigning different disparity values to pixels of different objects, and then shifting these pixels horizontally by their disparities to produce a sense of parallax. Any holes generated by this shifting operation are filled manually with appropriate pixels. An example of such an approach is described by Harman in the article “Home-based 3-D entertainment—an overview” (Proc. International Conference on Image Processing, Vol., 1, pp. 1-4, 2000). These methods generally require extensive and time-consuming human interaction.
Semi-automatic approaches only require the users to manually label a sparse set of 3-D information (e.g., with user marked scribbles or strokes) for some a subset of the video frames for a given shot (e.g., the first and last video frames, or key-video frames) to obtain the dense disparity or depth map. Examples of such techniques are described by Guttmann et al. in the article “Semi-automatic stereo extraction from video footage” (Proc. IEEE 12th International Conference on Computer Vision, pp. 136-142, 2009) and by Cao et al. in the article “Semi-automatic 2-D-to-3-D conversion using disparity propagation” (IEEE Trans. on Broadcasting, Vol. 57, pp. 491-499, 2011). The 3-D information for other video frames is propagated from the manually labeled frames. However, the results may degrade significantly if the video frames in one shot are not very similar. Moreover, these methods can only apply to the simple scenes, which only have a few depth layers, such as foreground and background layers. Otherwise, extensive human annotations are still required to discriminate each depth layer.
Automatic approaches can be classified into two categories: non-geometric and geometric methods. Non-geometric methods directly render new virtual views from one nearby video frame in the monocular video sequence. One method of the type is the time-shifting approach described by Zhang et al. in the article “Stereoscopic video synthesis from a monocular video” (IEEE Trans. Visualization and Computer Graphics, Vol. 13, pp. 686-696, 2007). Such methods generally require the original video to be an over-captured images set. They also are unable to preserve the 3-D geometry information of the scene.
Geometric methods generally consists of two main steps: exploration of underline 3-D geometry information and synthesis new virtual view. For some simple scenes captured under stringent conditions, the full and accurate 3-D geometry information (e.g., a 3-D model) can be recovered as described by Pollefeys et al. in the article “Visual modeling with a handheld camera” (International Journal of Computer Vision, Vol. 59, pp. 207-232, 2004). Then, a new view can be rendered using conventional computer graphics techniques.
In most cases, only some of the 3-D geometry information can be obtained from monocular videos, such as a depth map (see: Zhang et al., “Consistent depth maps recovery from a video sequence,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 31, pp. 974-988, 2009) or a sparse 3-D scene structure (see: Zhang et al., “3D-TV content creation: automatic 2-D-to-3-D video conversion,” IEEE Trans. on Broadcasting, Vol. 57, pp. 372-383, 2011). Image-based rendering (IBR) techniques are then commonly used to synthesize new views (for example, see the article by Zitnick entitled “Stereo for image-based rendering using image over-segmentation” International Journal of Computer Vision, Vol. 75, pp. 49-65, 2006, and the article by Fehn entitled “Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-TV,” Proc. SPIE, Vol. 5291, pp. 93-104, 2004).
With accurate geometry information, methods like light field (see: Levoy et al., “Light field rendering,” Proc. SIGGRAPH '96, pp. 31-42, 1996), lumigraph (see: Gortler et al., “The lumigraph,” Proc. SIGGRAPH '96, pp. 43-54, 1996), view interpolation (see: Chen et al., “View interpolation for image synthesis,” Proc. SIGGRAPH '93, pp. 279-288, 1993) and layered-depth images (see: Shade et al., “Layered depth images,” Proc. SIGGRAPH '98, pp. 231-242, 1998) can be used to synthesize reasonable new views by sampling and smoothing the scene. However, most IBR methods either synthesize a new view from only one original frame using little geometry information, or require accurate geometry information to fuse multiple frames.
Existing Automatic approaches unavoidably confront two key challenges. First, geometry information estimated from monocular videos are not very accurate, which can't meet the requirement for current image-based rendering (IBR) methods. Examples of IBR methods are described by Zitnick et al. in the aforementioned article “Stereo for image-based rendering using image over-segmentation,” and by Fehn in the aforementioned article “Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-TV.” Such methods synthesize new virtual views by fetching the exact corresponding pixels in other existing frames. Thus, they can only synthesize good virtual view images based on accurate pixel correspondence map between the virtual views and original frames, which needs precise 3-D geometry information (e.g., dense depth map, and accurate camera parameters). While the required 3-D geometry information can be calculated from multiple synchronized and calibrated cameras as described by Zitnick et al. in the article “High-quality video view interpolation using a layered representation” (ACM Transactions on Graphics, Vol. 23, pp. 600-608, 2004), the determination of such information from a normal monocular video is still quite error-prone.
Furthermore, the image quality that results from the synthesis of virtual views is typically degraded due to occlusion/disocclusion problems. Because of the parallax characteristics associated with different views, holes will be generated at the boundaries of occlusion/disocclusion objects when one view is warped to another view in 3-D. Lacking accurate 3-D geometry information, hole filling approaches are not able to blend information from multiple original frames. As a result, they ignore the underlying connections between frames, and generally perform smoothing-like methods to fill holes. Examples of such methods include view interpolation (See the aforementioned article by Chen et al. entitled “View interpolation for image synthesis”), extrapolation techniques (see: the aforementioned article by Cao et al. entitled “Semi-automatic 2-D-to-3-D conversion using disparity propagation”) and median filter techniques (see: Knorr et al., “Super-resolution stereo- and multi-view synthesis from monocular video sequences,” Proc. Sixth International Conference on 3-D Digital Imaging and Modeling, pp. 55-64, 2007). Theoretically, these methods cannot obtain the exact information for the missing pixels from other frames, and thus it is difficult to fill the holes correctly. In practice, the boundaries of occlusion/disocclusion objects will be blurred greatly, which will thus degrade the visual experience.