1. Field of the Invention
Embodiments of the present invention generally relate to an improved method for performing video processing and, more particularly, the invention relates to a method and apparatus for aligning video to three-dimensional (3D) point clouds.
2. Description of the Related Art
In modern three-dimensional processing systems there is a need for providing a three-dimensional framework onto which two-dimensional video can be applied. Such a framework is important to providing context to the viewer of the two-dimensional video. For example, registering two-dimensional images onto a three-dimensional model of a scene has been recognized as an important step in many security applications including three-dimensional modeling of urban environments. In such models, three-dimensional geometry and photometric information of the real world are recovered and registered to form virtual environments where realistic synthetic views of existing scenes are created from a sparse set of still photographs. The goal of these applications is to quickly build photorealistic three-dimensional models with correct geometry. Existing approaches to building such a model mainly have three separate steps: 1) Build a three-dimensional model using data from a three-dimensional sensor; 2) Align two-dimensional images onto the three-dimensional model; and 3) Texture the three-dimensional model using aligned video images.
To use real images as texture of a model, alignment of video onto a given three-dimensional scene is used to determine the pose of each video frame with respect to the scene. Camera pose can be established by several well-known methods.
In one method, physical measurements are used to determine the camera's location and orientation. Such physical measurements are provided by a global positioning system (GPS) receiver and/or inertial sensors (INS) that are co-located with a moving video camera. When the camera is airborne, high pose accuracy and measurement frequency can be achieved with complex and costly airborne instrumentation in conjunction with the use of ground surveys and differential GPS base-stations. Such instrumentation is usually used in aerial LIDAR acquisition. However, in many airborne video scenarios, GPS/INS-based camera pose is not continuously available for every frame and is distorted by significant outliers, biases, and drift. Thus, GPS/INS-based camera pose measurements alone are not sufficient for accurately aligning video to three-dimensional scenes. Nevertheless, they provide important approximate information to bootstrap the video based alignment for more accurate pose estimate.
In the class of two-dimensional frame to three-dimensional scene matching approaches, image appearance features such as high-contrast points and lines are matched to geometric scene features such as corners and edges. If a sufficient number of correct correspondences are identified, the pose parameters can be estimated for a frame. Most existing work for pose estimation belongs in this category.
One existing method aligns far-range video to a digital elevation map (DEM). The method first recovers three-dimensional geometry from the video by applying stereoscopic analysis on a pair of adjacent frames or adjacent images and then registering the geometry onto the DEM to obtain the camera position and heading. Based on the assumption that both the video camera and the photographic camera point downward and move in parallel to the ground, it is reasonable to represent the three-dimensional information in a two-dimensional image. This creates a compact two-dimensional representation, i.e., the so-called cliff maps that consist of edges and high curvature points. The drawback of this approach is that the reduction of full three-dimensional information to a two-dimensional cliff image limits the algorithm's capability of handling oblique three-dimensional pose.
In another example of a process to register video to a three-dimensional model, two-dimensional and three-dimensional lines are registered to align video to the three-dimensional model. In addition, a bundle adjustment can be implemented to obtain accurate pose estimation for a sequence of images. Specifically, the method first projects selected three-dimensional line segments onto an image via the current camera pose. Then, the method refines the pose by maximizing the integral of the gradient energy map (computed from the image) along the projected three dimensional line segments. In addition, three-dimensional curves/lines can be projected onto two-dimensional images as well. However, the cost function changes to the sum distance measured from the image edge points to the nearest projected curves/lines.
One common limitation of these two-dimensional to three-dimensional approaches is that they cannot handle scenes lacking easily extracted lines and curves. In addition, it is not easy for these approaches to correct for large three-dimensional pose error due to significant occlusions. Finally, it is time consuming and sometimes difficult to obtain accurate three-dimensional lines from noisy range images since planar segmentation of range data is a prior step to line extraction.
Therefore, there is a need in the art for an improved method of performing two-dimensional data to three-dimensional data alignment/registration.