1. Field of the Invention
The present invention relates to electronic image representation, and more particularly to a system and method for recovering the shape and texture of objects recorded on a medium such as videotape, and subsequently producing digitized, three-dimensional models of such objects.
2. Description of the Related Art
The advent of electronic equipment capable of processing digitally encoded pictorial images has substantially extended the computer's reach into traditional print and photographic media. Electronic imaging now includes the ability to digitize an observed scene into individual picture elements, or "pixels," with extremely high fidelity and resolution; to store and manipulate an image at the global or individual pixel level; to control its presentation (e.g., by inserting or withdrawing occlusive image elements, selectively enlarging or reducing selected elements, altering illumination, and translating or rotating objects in real-time); and to analyze elements of the digitally recorded scene through the use of various mathematical and graphical techniques.
"Machine vision" researchers attempt to reproduce computationally the attributes and capabilities of human vision, as well as to extend those capabilities in particular domains. One important goal of such research is to produce, in real-time, a computer model of a scene or object by analysis of its digital representation. That representation might be generated, for example, by digitizing a video segment containing a series of sequential views taken from different angles; the views collectively describe the scene or object in the sense of containing a complete record of all relevant visual elements.
Obtaining a three-dimensional model of an object from such a representation is sometimes called the "structure from motion" ("SfM") problem, since the true object structure must be deduced from successive video frames that reflect not only altered perspective, but also the effects of camera geometry and the videographer's own movements. For example, in making a video recording of a building by walking around its perimeter, the videographer will likely alter her distance from the building in the course of her excursion, and will probably rotate the camera (along any of three possible axes) as well. The SfM system must therefore interpret changing two-dimensional image information in terms both of changing three-dimensional object perspective and relative camera orientation.
To treat this problem mathematically, an arbitrary set of feature points (based, most commonly, on identification of intersecting object curves) is tracked through the sequence of image frames and analyzed numerically. The unknowns in this analysis are the depths of the feature points and the rigid-body transformations between frames. Isolating those elements of interframe inconsistency attributable to object structure rather than camera motion represents a nonlinear task, and is usually approached numerically by minimizing a nonlinear objective function.
Numerical estimation techniques for recovering object structure may be "batch" or "recursive" processes. In a batch process, all data upon which the process operates are initially known, and the process analyzes the data taken as a whole. Batch processes typically minimize nonlinear objective functions on sets of images using, for example, the Levenberg-Marquardt procedure. In a recursive process, by contrast, the image frames are analyzed in a stepwise manner. Each new frame is incorporated into the estimation analysis in accordance with an assigned weight that reflects its statistical contribution. Examples of recursive processes include the Kalman filter, which provides exact linear solutions for purely orthographic images; and the extended Kalman filter ("EKF"), which linearizes, within a local region, a nonlinear state vector that describes perspective images in order to permit application of the Kalman filter to the resulting approximation.
Although batch processes have been reported to exhibit substantial stability and rapid convergence to a solution, the computational procedures tend to be quite complex, and batch processing by definition precludes in-situ operation (since all the data must be gathered in advance of the numerical analysis). Unfortunately, current statistically based, recursive processes (including those based on the EKF) have failed to exhibit equivalent accuracy and robustness.
Furthermore, prevailing estimation methods--both batch and recursive--require prior knowledge of camera focal length (i.e., assume the use of a calibrated camera) to provide an estimation of metric geometry; in other words, focal length is a required parameter for the statistical estimation process, and cannot be recovered from that process. Existing video, from an unknown camera, therefore cannot be processed using these techniques. Current methods also tend to be specific either to orthographic or perspective image representation, but do not accommodate both.
In a typical implementation, the relationship between the image location (i.e., the scene as it appears in two dimensions to the camera) and the actual three-dimensional location within the recorded scene is described in terms of central-projection geometry. This geometric model, illustrated in FIG. 1, can be represented by the equation ##EQU1## where u,v specifies the location of an image feature in the image plane, f is the camera focal length, and X.sub.C, Y.sub.C, Z.sub.C specify the location of the object in the camera reference frame (i.e., the true spatial location relative to the current camera location). In other words, the geometric coordinate system is defined with respect to the center of projection located behind the image frame. Many SfM techniques utilize three unknown parameters--i.e., X.sub.C, Y.sub.C, Z.sub.C --for each object point. However, this representational mode inevitably leaves at least some parameters undetermined regardless of the number of feature points that are tracked. Thus, although one might intuitively assume that tracking more feature points increases the accuracy of the ultimate estimation, in fact just the opposite is the case. This is due to the fact that for N tracked feature points, there exist six motion parameters (corresponding to camera rotation and translation toward and away from the object) and 3N structure parameters (the three points specifying the X.sub.C, Y.sub.C, Z.sub.C location), versus 2N constraints (the observed u,v locations) plus one arbitrarily set scale constraint. The system is therefore underdetermined for any N, with the counterintuitive property that adding tracked features actually augments the degree of indeterminacy.
At the conclusion of the estimation process, when changes in relative distance among the tracked features have been fully analyzed, current SfM systems compute the three-dimensional geometry of the object by identifying feature planes. The user, or a suitable computer program, then "segments" these planes by identifying vertices, thereby creating a series of connected polygons that provide a skeletal outline of the object. As far as we are aware, current systems do not provide for recovery of further image detail.