Not Applicable.
1. Field of Invention
The present invention relates generally to image and video processing and, more particularly, to systems and methods for generating three-dimensional models from two-dimensional image sequences.
2. Description of the Background
This invention concerns the automatic generation of a three-dimensional (3D) description of the real world environment. Target applications are found in several fields, including digital video, virtual reality, and robotics. In digital video, 3D models enable content-based representations suitable for powerful video editing and handling, video content addressing, very low bit rate compression, transmission of video over wireless and wireline links, and frame rate conversion. In virtual reality, 3D models of real world objects potentiate the merging of virtual and real entities and the virtual manipulation of real world objects. In robotics, the navigation of autonomous mobile vehicles in unknown environments demands the automatic generation of 3D models of the environment.
The problem of recovering the 3D structure (i.e., 3D shape and 3D motion) from a 2D video sequence has been widely considered by the computer vision community. Methods that infer 3D shape from a single frame are based on cues such as shading and defocus. These methods fail to give reliable 3D shape estimates for unconstrained real-world scenes. If no prior knowledge about the scene is available, the cue to estimating the 3D structure is the 2D motion of the brightness pattern in the image plane. For this reason, the problem is generally referred to as structure from motion. The two major steps in structure from motion are usually the following: compute the 2D motion in the image plane; and estimate the 3D shape and the 3D motion from the computed 2D motion.
Early approaches to structure from motion processed a single pair of consecutive frames and provided existence and uniqueness results to the problem of estimating 3D motion and absolute depth from the 2D motion in the camera plane between two frames. Two-frame based algorithms are highly sensitive to image noise, and, when the object is far from the camera, i.e., at a large distance when compared to object depth, they fail even at low level image noise. More recent research has been oriented towards the use of longer image sequences. These techniques require filtering algorithms that integrate along time a set of two-frame depth estimates. All of these approaches require the computationally-intensive task of computing an estimate of the absolute depth as an intermediate step.
According to one known technique which does not require computing of an estimate of the absolute depth as an intermediate step, the 3D positions of the feature points are expressed in terms of Cartesian coordinates in a world-centered coordinate system, and the images are modeled as orthographic projections. The 2D projection of each feature point is tracked along the image sequence. The 3D shape and motion are then estimated by factorizing a measurement matrix whose entries are the set of trajectories of the feature point projections. The factorization of the measurement matrix, which is rank 3 in a noiseless situation, is computed by using a Singular Value Decomposition (SVD) expansion technique.
When the goal is the recovery of a dense representation of the 3D shape, the SVD factorization approach may not solve the problem satisfactorily because of two drawbacks. First, being feature-based, it is necessary to track a large number of features to obtain a dense description of the 3D shape. This is usually impossible because only distinguished points, as brightness corners, can be accurately tracked. Second, even if it is possible to track a large number of features, the computational cost of the SVD involved in the factorization of the rank 3 measurement matrix would be very high. Thus, the decomposition and normalization stages involved in the factorization approach are complicated and the approach is more susceptible to noise.
Accordingly, there exists a need in the prior art for an efficient and less computationally-intensive technique to recover 3D structure from a 2D image sequence. There further exists a need for such a technique to be robust to noise. There further exists a need for such a technique to model occluded objects.
The present invention is directed to a system for generating a three-dimensional model of an object from a two-dimensional image sequence. According to one embodiment, the system includes: an image sensor for capturing a sequence of two-dimensional images of a scene, the scene including the object; a two-dimensional motion filter module in communication with the image sensor for determining from the sequence of images a plurality of two-dimensional motion parameters for the object; and a three-dimensional structure recovery module in communication with the two-dimensional motion filter module for estimating a set of three-dimensional shape parameters and a set of three-dimensional motion parameters from the set of two-dimensional motion parameters using a rank 1 factorization of a matrix. The system may also include a three-dimensional shape refinement module to refine the estimate of the three-dimensional shape using a coarse-to-fine continuation-type method.
According to another embodiment, the present invention is directed to a method for generating a three-dimensional model of an object from a two-dimensional image sequence. The method includes: capturing a sequence of images of a scene, the scene including the object; determining a plurality of two-dimensional motion parameters for the object from the sequence of images; and estimating a set of three-dimensional shape parameters and a set of three-dimensional motion parameters from the two-dimensional motion parameters using a rank 1 factorization of a matrix. The method may also include refining the estimate of the three-dimensional shape using a coarse-to-fine continuation type method.
The present invention provides numerous advantages over prior approaches for determining structure from motion in a video sequence. For instance, in contrast to prior techniques which also estimate 3D shape directly instead of estimating the absolute depth as an intermediate step, the present invention does not rely on the tracking of pointwise features. Instead, the present invention uses a parametric description of the shape and the induced optical flow parameterization. This approach yields several advantages. First, the tracking of feature points may be unreliable when processing noisy video sequences. As a result, to alleviate this situation in the prior art, it is known to assume a very short interval between frames for easy tracking. In contrast, according to the present invention, the 3D shape and motion parameters may be estimated from a sequence of just a few flow parameters, avoiding the need to process a large set of feature trajectories. Second, the relation between the optical flow parameters and the 3D rotation and 3D shape parameters enables the recovery of the 3D structure according to the approach of the present invention by a rank 1 factorization a matrix, instead of a rank 3 matrix as in the prior art. Consequently, the decomposition and normalization stages involved in the factorization approach of the present invention are simpler and more robust to noise.
In addition, the approach of the present invention handles general shaped structures. It is well suited to the analysis of scenes with polyhedral surfaces. This is particularly relevant in outdoor modeling of buildings with flat walls, where the optical flow model reduces to the affine motion model. Further, the present invention provides an advantage over the prior art in that the surface-based 2D shape representations (i.e., the 3D shape) are not restricted to a sparse set of 3D points, but rather may be represented by, for example, planar patches. In addition, the factorization approach of the present invention permits the specification of different reliability weights to different 2D motion parameters, thus improving the quality of the 3D structure recovered without additional computational cost. Additionally, the present invention permits the 3D modeling of occluded objects.
These and other benefits of the present invention will be apparent from the detailed description hereinbelow.