Humans perceive the world in three spatial dimensions. Unfortunately, most of the images and videos created today are 2-D in nature. If we were able to imbue these images and videos with 3-D information, not only would we increase their functionality, we could dramatically increase our enjoyment of them as well. However, imbuing 2-D images and video with 3-D information often requires completely reconstructing the scene from the original 2-D data depicted. A given set of images can be used to create a model of the observer (camera/viewpoint) together with models of the objects in the scene (to a sufficient level of detail) enabling the generation of realistic alternate perspective images of the scene. A model of a scene thus contains the geometry and associated image data for the objects in the scene as well as the geometry for the cameras used to capture those images.
A number of technologies have been proposed and, in some cases, implemented to perform a conversion of one or several two dimensional images into one or several stereoscopic three dimensional images. The conversion of two dimensional images into three dimensional images involves creating a pair of stereoscopic images for each three dimensional frame. The stereoscopic images can then be presented to a viewer's left and right eyes using a suitable display device. The image information between respective stereoscopic images differ according to the calculated spatial relationships between the objects in the scene and the viewer of the scene. The difference in the image information enables the viewer to perceive the three dimensional effect.
An example of a conversion technology is described in U.S. Pat. No. 6,477,267 (the '267 patent). In the '267 patent, only selected objects within a given two dimensional image are processed to receive a three dimensional effect in a resulting three dimensional image. In the '267 patent, an object is initially selected for such processing by outlining the object. The selected object is assigned a “depth” value that is representative of the relative distance of the object from the viewer. A lateral displacement of the selected object is performed for each image of a stereoscopic pair of images that depends upon the assigned depth value. Essentially, a “cut-and-paste” operation occurs to create the three dimensional effect. The simple displacement of the object creates a gap or blank region in the object's background. The system disclosed in the '267 patent compensates for the gap by “stretching” the object's background to fill the blank region.
The '267 patent is associated with a number of limitations. Specifically, the stretching operations cause distortion of the object being stretched. The distortion needs to be minimized to reduce visual anomalies. The amount of stretching also corresponds to the disparity or parallax between an object and its background and is a function of their relative distances from the observer. Thus, the relative distances of interacting objects must be kept small.
Another example of a conversion technology is described in U.S. Pat. No. 6,466,205 (the '205 patent). In the '205 patent, a sequence of video frames is processed to select objects and to create “cells” or “mattes” of selected objects that substantially only include information pertaining to their respective objects. A partial occlusion of a selected object by another object in a given frame is addressed by temporally searching through the sequence of video frames to identify other frames in which the same portion of the first object is not occluded. Accordingly, a cell may be created for the full object even though the full object does not appear in any single frame. The advantage of such processing is that gaps or blank regions do not appear when objects are displaced in order to provide a three dimensional effect. Specifically, a portion of the background or other object that would be blank may be filled with graphical information obtained from other frames in the temporal sequence. Accordingly, the rendering of the three dimensional images may occur in an advantageous manner.
In reconstructing these scenes, features in the 2-D images, such as edges of objects, often need to be identified, extracted and their positions ascertained relative to the camera. Differences in the 3-D positions of various object features, coupled with differing camera positions for multiple images, result in relative differences in the 3-D to 2-D projections of the features that are captured in the 2-D images. By determining the positions of features in 2-D images, and comparing the relative locations of these features in images taken from differing camera positions, the 3-D positions of the features may be determined.
However, fundamental problems still exist with current conversion methods. For example, a typical motion picture will have a very large and predetermined image set, which (for the purposes of camera and scene reconstruction) may contain extraneous or poorly lit images, have inadequate variations in perspective, and contain objects with changing geometry and image data. Nor can the known conversion methods take advantage of the processor saving aspects of other applications, such as robot navigation applications that, while having to operate in real time using verbose and poor quality images, can limit attention to specific areas of interest and have no need to synthesize image data for segmented objects.
In addition, existing methods of conversion are not ideally suited for scene reconstruction. The reasons for this include excessive computational burden, inadequate facility for scene refinement, and the point clouds extracted from the images do not fully express model-specific geometry, such as lines and planes. The excessive computational burden often arises because these methods correlate all of the extracted features across all frames used for the reconstruction in a single step. Additionally, existing methods may not provide for adequate interactivity with a user that could leverage user knowledge of scene content for improving the reconstruction.
The existing techniques are also not well suited to the 2-D to 3-D conversion of things such as motion pictures. Existing techniques typically cannot account for dynamic objects, they usually use point clouds as models which are not adequate for rendering, and they do not accommodate very large sets of input images. These techniques also typically do not accommodate varying levels of detail in scene geometry, do not allow for additional geometric constraints on object or camera models, do not provide a means to exploit shared geometry between distinct scenes (e.g., same set, different props), and do not have interactive refinement of a scene model.