Humans perceive the world in three spatial dimensions. Unfortunately, most of the images and videos created today are 2D in nature. If we were able to imbue these images and videos with 3D information, not only would we increase their functionality, we could dramatically increase our enjoyment of them as well. However, imbuing 2D images and video with 3D information often requires completely reconstructing the scene from the original 2D data depicted. A given set of images can be used to create a model of the observer together with models of the objects in the scene (to a sufficient level of detail) enabling the generation of realistic alternate perspective images of the scene. A model of a scene thus contains the geometry and associated image data for the objects in the scene as well as the geometry for the cameras used to capture those images.
In reconstructing these scenes, features in the 2D images, such as edges of objects, often need to be extracted and their positions ascertained relative to the camera. Differences in the 3D positions of various object features, coupled with differing camera positions for multiple images, result in relative differences in the 3D to 2D projections of the features that are captured in the 2D images. By determining the positions of features in 2D images, and comparing the relative locations of these features in images taken from differing camera positions, the 3D positions of the features may be determined.
One technique, known as camera calibration, uses multiple 2D images captured using different camera perspectives of a scene. A set of point correspondences may then be found, which allows calculation of geometric attributes such as position and orientation of the camera for each image. This leads to the determination of 3D coordinates for features found in the 2D images. Many current methods of camera calibration, such as robot vision and satellite imaging, are geared toward full automation. M. Pollefeys, et al., “Visual Modeling with a Hand-Held Camera,” International Journal of Computer Vision, September, 2004, pages 207-232, Volume 59, Number 3, Kluwer Academic Publishers, Manufactured in The Netherlands, describes a procedure using a hand-held video camera for recreating a 3D scene. In this process, a camera operator is in control of the camera, and collects images of an object from multiple perspectives. The images in the video sequence are then processed to obtain a reconstruction of the object that is suitable for stereoscopic 3D projection.
However, fundamental problems still exist with current camera calibration methods. For example, a typical motion picture will have a very large and predetermined image set, which (for the purposes of camera and scene reconstruction) may contain extraneous or poorly lit images, have inadequate variations in perspective, and contain objects with changing geometry and image data. Nor can the known camera calibration methods take advantage of the processor saving aspects of other applications, such as robot navigation applications that, while having to operate in real time using verbose and poor quality images, can limit attention to specific areas of interest and have no need to synthesize image data for segmented objects.
In addition, existing methods of camera calibration are not ideally suited for scene reconstruction. The reasons for this include excessive computational burden, inadequate facility for scene refinement, and the point clouds extracted from the images do not fully express model-specific geometry, such as lines and planes. The excessive computational burden often arises because these methods correlate all of the extracted features across all frames used for the reconstruction in a single step. Additionally, existing methods may not provide for adequate interactivity with a user that could leverage user knowledge of scene content for improving the reconstruction.
The existing techniques are also not well suited to the 2D to 3D conversion of things such as motion pictures. Existing techniques typically cannot account for dynamic objects, they usually use point clouds as models which are not adequate for rendering, and they do not accommodate very large sets of input images. These techniques also typically do not accommodate varying levels of detail in scene geometry, do not allow for additional geometric constraints on object or camera models, do not provide a means to exploit shared geometry between distinct scenes (e.g., same set, different props), and do not have interactive refinement of a scene model.