1. Technical Field
A “Stereoscopic Video Converter” provides various techniques for automatically converting arbitrary 2D video sequences into perceptually plausible stereoscopic or “3D” versions based on estimations of dense depth maps for every frame of the input video sequence.
2. Background Art
In recent years, 3D videos have become increasingly popular. In this context, the term “3D” implies the presentation of stereoscopic video that displays separate corresponding images to the left and right eyes to convey the sense of depth to the viewer. As 3D viewing technology become more widespread, various techniques have been developed for converting monoscopic (2D) videos into 3D. In general, there are two primary classes of 2D-to-3D conversion: semi-automatic and fully-automatic. For semi-automated methods, a user generally guides the conversion process, e.g., by notating depth on various frames, drawing rough occlusion boundaries, or iteratively refining automatic estimates. Semi-automatic methods are interesting because they attempt to improve automatic results with some user intervention; however, most of these methods focus on interfaces for utilizing these methods rather than developing the methods themselves.
Most automatic 2D-to-3D work has focused on obtaining a dense depth map (which can be easily converted to an inverse-depth or disparity map), i.e., a relative depth measurement at each pixel, from which a stereo view can be rendered. One major drawback of these methods is that most require some assumptions about the scene and camera movement. Furthermore, rendering new views using depth maps is an ongoing point of research.
Structure from Motion (SfM), a method for capturing robust 3D positions of points in images, can be used when multiple camera views of a static scene are available. SfM output is generally sparse, but methods exist for obtaining dense point-based reconstructions, and SfM methods have been used to obtain super-resolution stereoscopic videos. When a video sequence is available, various techniques have been used to synthesize dense surface (mesh) reconstructions. Using a bundle optimization framework, methods exist for obtaining temporally consistent depth-maps rather than explicitly reconstructing the 3D scene. It has also been shown that graph-based depth inference from multiple views under second order smoothness priors is tractable and leads to plausible results. In contrast, rather than estimating exact structure as with SfM, other techniques estimate rough planar geometry of a scene for reconstruction and depth estimation, given multiple photographs. With active learning, similar methods have been applied to single images.
Depth from motion parallax is another method for obtaining dense disparity. These methods typically use optical flow techniques to estimate motion parallax, which in turn can be used to hypothesize per-pixel depth. Depth from motion parallax methods can work for dynamic scenes, but are prone to tracking failures e.g., due to noise, textureless surfaces, and sharp motion.
Regardless of how depth is determined, for generating new views (including stereo video), it is generally not enough to know scene depth. The primary issue with depth-based techniques is filling holes of unknown color that are guaranteed to appear during the depth image based rendering (DIBR) process of synthesizing a new view. Holes are most commonly filled using linear interpolation or even Poisson blending (note that such techniques are sometimes based on solving Laplace's equation to minimize gradients in the hole regions). In painting has also been used for hole filling. Further, when sufficient views exist, occlusion information may be found in other frames for use in filling holes. Unfortunately, most of these methods tend to produce unnatural artifacts near occlusion boundaries, thus degrading the appearance of the resulting 3D image or video sequence.
One alternative to the depth-based synthesis of a stereo view is through image warping (homographies). First, camera parameter and motion estimates are made from an input monoscopic video sequence. Then, synthetic stereo viewpoints are defined automatically, and suitable original views are selected and warped to become corresponding stereo frames. Iterative optimization has shown improved results, and real-time variants exist.
Several reconstruction methods for single images of real, unknown scenes have also been proposed. One such method creates convincing reconstructions of outdoor images by assuming an image could be broken into a few planar surfaces. A related technique is adapted to reconstructing indoor scenes. Another technique provides a supervised learning strategy for predicting depth from a single image to create realistic reconstructions for general scenes. Better depth estimates have been achieved by incorporating semantic labels, and more sophisticated learning techniques. Further, if more information about the scene is known, other single-image techniques can be applied, e.g., shape-from-shading, shape-from-texture, 2.1 D sketch, etc. Also, when repetitive structures exist, various techniques can be used to acquire dense depth from a single image.
Various techniques avoiding depth estimation by synthesizing a new view from a collection of known views by ensuring that the local image statistics of the synthesized view match to the local properties of the observed views. Such methods are generally slow, and can take several hours for synthesizing each view. Consequently, the utility of such techniques for converting entire video sequences is rather limited. Furthermore, such methods are subject to unpredictable synthesis in regions that are not visible in any of the original views.