Stereo and multi-view imaging has a long and rich history stretching back to the early days of photography. Stereo cameras employ multiple lenses to capture two images, typically from points of view that are horizontally displaced, to represent the scene from two different points of view. Such image pairs are displayed to the left and right eyes of a human viewer to let the viewer experience an impression of three dimensions (3D). The human visual system then merges information from the pair of different images to achieve the perception of depth.
Stereo cameras can come in any number of configurations. For example, a lens and a sensor unit can be attached to a port on a traditional single-view digital camera to enable the camera to capture two images from slightly different points of view, as described in U.S. Pat. No. 7,102,686 to Orimoto et al., entitled “Image-capturing apparatus having multiple image capturing units.” In this configuration, the lenses and sensors of each unit are similar and enable the interchangeability of parts. U.S. Patent Application Publication 2008/0218611 to Parulski et al., entitled “Method and apparatus for operating a dual lens camera to augment an image,” discloses another camera configuration having two lenses and image sensors that can be used to produce stereo images.
In another line of teaching, there are situations where a stereo image (or video) is desired, but only a single-view image (or video) has been captured. The problem of forming a stereo image from conventional two-dimensional (2D) images, is known as 2D-to-3D conversion, and has been addressed in the art. For example, Guttmann et al. in the article “Semi-automatic stereo extraction from video footage” (Proc. IEEE International Conference Computer Vision, pp. 136-142, 2009), teach a semi-automatic approach (using user input with scribbles) for converting each image of the video to stereo.
Hoiem et al, in the article “Automatic Photo Pop-up,” (ACM Transactions on Graphics, Vol. 24, pp. 577-584, 2005) describe a method for estimating the 3D geometry from a 2D image and producing images that represent what the scene might look like from another viewpoint.
Saxena et al., in the article “Make3d: Learning 3D scene structure from a single still image” (IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 31, pp. 824-840, 2009), describe a method for estimating 3D structure from a single still image in an unconstrained environment. The method uses a Markov Random Field trained via supervised learning to model both image depth cues and the relationships between different parts of the image.
Ideses et al., in the article “Real-time 2D to 3D video conversion” (Journal of Real-Time Image Processing, Vol. 2, pp. 3-9, 2007) describe a method to extract stereo pairs from video sequences. The method makes use of MPEG motion estimation that can be obtained in the decoding stage of a video. The magnitude of the optical flow between consecutive image frames associated with MPEG motion estimation is used as a depth map, as if a parallel camera acquired the images. Next, a second view for a stereo pair is resampled from the current frame using the depth map; the pixel values of the next frame are not used to generate the second view. With this approach, abrupt rotations and small transitions of a camera, which are frequently present in general 2D videos, violate the assumption of a parallel camera and can produce undesirable results.
In another line of teaching, U.S. Pat. No. 7,643,657, to Dufaux et al., entitled “System for selecting a keyframe to represent a video,” teaches a method to selecting key frames in a video sequence based on finding shot boundaries and considering other features such as spatial activity and skin detection. However, key frame extraction does not provide a method for representing for forming a stereo image from a video sequence.