In general, stereoscopic image conversion is a process that involves converting two-dimensional (2D) images or video into three-dimensional (3D) stereoscopic images or video. In one conventional process, a stereoscopic image can be generated by combining two monoscopic views (left and right eye perspective views) captured by two separate cameras positioned at different points, where each camera captures a 2D perspective image (left and right eye views) of a given scene, and where hereby the two 2D perspective images are combined to form a 3D or stereoscopic image. In other conventional methods, 3D or stereoscopic images are generated from original 2D monoscopic images captured by a single video camera, whereby corresponding 2D monoscopic image pairs are estimated using information extracted from the original 2D images. With such methods, the original 2D image can be established as the left perspective view providing a first view of a stereoscopic pair of images, while a corresponding right perspective image is an image that is processed from the original 2D image to generate a corresponding right perspective providing a second view of the stereoscopic image pair.
In one particular conventional scheme, 2D to 3D conversion systems can be configured to generate stereoscopic image pairs from a single sequence of 2D monoscopic images (e.g., 2D video image sequence) using camera motion data that is estimated between sequential 2D images in the source image data. With such techniques, the input 2D image data is often a video segment that is captured with camera motion. After camera motion is analyzed, the right image then can be derived from the 2D image in the input video and the inferred camera motion. 2D to 3D conversion systems can be used to convert 2D formatted image and video data (movies, home videos, games, etc.) into stereoscopic images to enable 3D viewing of the 2D formatted source image data. Together with the first image sequence, the second sequence makes it possible to view the originally two-dimensional images in three dimensions when the first and second image sequences are transmitted to the left or right eye.
Conventional approaches for generating stereoscopic image pairs from a sequence of 2D images using camera motion use depth maps, which are computed from the videos image data with camera, to render/generate the corresponding stereoscopic image pair. In general, these techniques involve estimating camera motion between two consecutive 2D images in monoscopic sequence of 2D images such that they become canonical stereo pair, followed by depth estimation to extract depth maps from the two or more consecutive images using the estimated camera motion. The estimated depth maps are then used to re-render the left eye image into the right eye image. More specifically, assuming two consecutive 2D images, Li, Li+1, where the input 2D image sequence is deemed to provide the left eye perspective views, a depth map can be estimated from the 2D images by minimizing a cost function F(D|Li, Li+1) with respect to the depth map D. Assuming the optimal depth map is {circumflex over (D)}i, the right image can be rendered by a rendering function: Ri=Render(Li, {circumflex over (D)}i).
This conventional approach works well, theoretically, if the depth map can be accurately estimated. An advantage of this approach is that camera motion can be arbitrary. On a practical level, however, the depth estimation process is problematic and, in most cases, the depth map is corrupted by noise. As a result, the estimated depth map will contain a noise component: {circumflex over (D)}i=Di+Dierror, where Di is the true depth map, and Dierror is the error component. When rendering the right eye image, the error component would be propagated and most likely be magnified by the rendering function, resulting in undesirable artifacts.
Other conventional methods based on camera motion use planar transform techniques to avoid depth map computation but such approaches have significant limitations as applied to video data with general, arbitrary camera motion. In general, planar transformation techniques involve estimating camera motion from the input video sequence by, for example, computing a fundamental matrix using adjacent frames. The estimated camera motion parameters are then used to derive a planar transformation matrix that is used to transform the current image in the input video image sequence to the hypothetical right eye image. However, the transformation matrix can only be derived when the camera motion only contains horizontal movement. If the camera also moves in the vertical direction, vertical parallax would be created and vertical parallax cannot be removed under any planar transformation. Most depth perception (i.e., 3D or stereo effect in viewing a scene, for example) is obtained in a generally horizontal plane rather than in a vertical plane because the viewer's eyes are spaced apart usually in a generally horizontal plane and respective views are seen according to the stereo base of the distance between the viewer's two eyes. As such, vertical motion or disparity between a pair of sequential images can be incorrectly interpreted by a 2D to 3D conversion system as motion indicative of depth. The planar transform can, though, remove camera rotation, and zooming, therefore create the canonical stereoscopic image pair (i.e. the left and right images are aligned to have the same focal length and parallel focal plane). Under these conditions, if camera motion is limited to horizontal translation, the input video stream of 2D images can be treated as a series of stereo image pairs with small separations.