1. Field of the Invention
The present invention relates generally to a method, apparatus, and article of manufacture for locating one or several cameras in three-dimensional (3D) space based on two-dimensional (2D) pictures from the camera. Embodiments of the invention may also apply to the field of augmented reality.
2. Description of the Related Art
In 3D computer applications, it is desirable to simulate a real world location/object/image in 3D space. To create a 3D computer representation, 2D data is received/input and used to reconstruct the 3D representation. For example, 2D photographs and videos may be used as a foundation to create a 3D representation of the image depicted. To create the 3D representation, a requisite basic computation is that of determining the location/viewpoint of the camera in 3D space. Once the location of the viewpoints are determined, the different viewpoints can be combined to reconstruct the 3D representation. Prior art methodologies are computation intensive and fail to provide an efficient and easy mechanism for computing a viewpoint and creating a 3D representation. Such problems may be better understood with a more detailed explanation of 3D applications, representations, and prior art systems for constructing 3D representations based on 2D data.
2D images from photographs and videos can be used for a wide range of 3D applications. In augmented reality, a user may simulate a 3D environment over 2D data (e.g., for an architectural project, one may desire to create a virtual building with video footage). In photogrammetry/3D image-based modeling, a user may desire to recreate something that exists in the real world. Based on multiple images (e.g., of a single location from different viewpoints), a user may also desire to automatically produce a 3D scene (e.g., automatic multi-view 3D reconstruction) (e.g., combine separate images of a left view, right view, top view, etc. of an object to recreate a single 3D view of the object). Alternatively, in an organized picture collection (such as Photosynth™ available from Microsoft™) various 2D pictures may be organized in 3D space to recreate a 3D scene that a user can view and move around in (the application performs a global computation of a location of all of the pictures such that pictures in the collection can be rotated in a 3D oriented manner). In another example, in movie special effects, the motion of the real camera may be needed so that when a virtual object (e.g., a virtual dinosaur) is rendered, the virtual object is synchronized with the camera footage.
The basic computation used in all of the above identified applications is that of identifying the location/viewpoint of the camera in 3D space. Once the location/viewpoints are identified, a virtual 3D world can be created (and synchronized with any video footage if desired). Prior art methods use images and attempt to match up 2D points across the images in order to estimate camera placement. As part of this process, there are several different parameters that need to be estimated so that the images are properly localized in space. Prior art methods for determining such parameters and camera placement are time and computationally expensive.
In addition to actual commercial products, algorithms used in the research community to determine the location/viewpoint of the camera are also computation intensive. Such algorithms rely on estimating location when the only information available is that of the image itself (e.g., no additional parameters are known such as camera location, orientation, angle, etc.). In this regard, points of interest in the images are tracked and used to establish a correspondence across the images (between the points). For example, a point that represents a window in a corner of a screen may be matched/mapped with a pixel in the image on the screen. Signal processing then provides a number of point correspondences across images. With the points, various equations are then solved to determine where the images in 2D space correspond with a 3D environment (e.g., point X in the image in the corner of the screen corresponds to point Y in 3D space).
Alternatively, various parameters of the camera may be known (e.g., location of camera, orientation, focal length, distortion of lens, etc.). Using all known parameters, prior art computations of the location/viewpoint is processor intensive and relies on optimization. For example, with one hundred (100) cameras and one hundred (100) points, with each camera there are three (3) rotations and three (3) translations (six [6] parameters per camera) plus three (3) (x,y,z) coordinates per point. Accordingly, one hundred (100) cameras results in six hundred (600) parameters plus three hundred (300) points to provide nine hundred (900) parameters for a simple problem. Accordingly, using all parameters is processor intensive and not feasible for high resolution data or for any type of data where real-time processing is desirable.
In view of the above limitations, attempts have been made to expedite the location determination process. One such prior art technique by Daniel Martinec introduces a global process of “structure of motion” by computing camera 3D positions and orientations from 2D point matches in images. Martinec's technique is performed in two stages by first estimating all rotations, followed by estimating all translations (see Daniel Martinec and Tomas Pajdla, “Robust Rotation and Translation Estimation in Multiview Reconstruction”, In Proceedings of the Computer Vision and Pattern Recognition conference 2007, IEEE, Minneapolis, Minn., USA, June 2007 which is incorporated by reference herein). However, Martinec's translation estimation involves a heavy non-linear optimization process that is computation intensive and not possible for real-time performance.
Accordingly, what is needed is the capability to solve for the location of cameras in a computationally and time efficient manner.