The processing of images via a computer-based or similar electronic system, called digital image processing, is increasingly applied to a wide range of applications, including motion picture production, television productions, multimedia presentations, architectural design, and manufacturing automation. Each of these applications uses digital image processing to some degree in creating or rendering a computer model of a scene in the real world. The model not only describes physical objects such as buildings, parts, props, backgrounds, actors, and other objects in a scene accurately, but also represents relationships between objects such as their movement and other transformations over time.
There are presently two general categories of techniques for creating computer models of a scene. In the first, which is essentially image-based, the computer accepts a visual image stream such as produced by a motion picture, film or video camera. The image stream is first converted into digital information in the form of pixels. The computer then operates on the pixels in certain ways by grouping them together, comparing them with stored patterns, and other more sophisticated processes to determine information about the scene. So-called "machine vision" or "image understanding" techniques are then used to extract and interpret features of the actual physical scene as represented by the captured images. Computerized abstract models of the scene are then created and manipulated using this information.
For example, Becker, S. and Bove, V. M., in "Semiautomatic 3D Model Extraction from Uncalibrated 2D Camera Views," Proceedings SPIE Visual Data Exploration and Analysis II, vol. 2410, pp. 447-461 (1995) describe a technique for extracting a three-dimensional (3D) scene model from two-dimensional (2D) pixel-based image representations as a set of 3D mathematical abstract representations of visual objects in the scene as well as cameras and texture maps.
Horn, B. K. P. and Schunck, B. G., in "Determining Optical Flow," Artificial Intelligence, Vol. 17, pp. 185-203 (1981) describe how so-called optical flow techniques may be used to detect velocities of brightness patterns in an image stream to segment the image frames into pixel regions corresponding to particular visual objects.
Finally, Burt, P. J. and Adelson, E. H., in "The Laplacian Pyramid as a Compact Image Code," IEEE Transactions on Communications, Vol. COM-31, No. 4, pp. 532-540 (1983) describe a technique for encoding a sampled image as a "pyramid" in which successive levels of the pyramid provide a successively more detailed representation of the image.
In a second approach to developing a scene model, which is essentially abstraction-based, the computer model is built from geometric, volumetric, or other mathematical representations of the physical objects. These types of models are commonly found in architectural, computer-aided design (CAD), and other types of computer graphics systems, as generally described in Rohrer, R., "Automated Construction of Virtual Worlds Using Modeling Constraints," The George Washington University (January 1994), and Ballard, D., et al., "An Approach to Knowledge-Directed Image Analysis," in Computer Vision Systems (Academic Press, 1978) pp. 271-281.
The goal in using either type of scene model is to create as accurate a representation of the scene as possible. For example, consider a motion picture environment where computer-generated special effects are to appear in a scene with real world objects and actors. The producer may choose to start by creating a model from digitized motion picture film using automatic image-interpretation techniques and then proceed to combine computer-generated abstract elements with the elements derived from image-interpretation in a visually and aesthetically pleasing way.
Problems can occur with this approach, however, since automatic image-interpretation processes are statistical in nature, and the input image pixels are themselves the results of a sampling and filtering process. Consider that images are sampled from two-dimensional (2D) projections (onto a camera's imaging plane) of three-dimensional (3D) physical scenes. Not only does this sampling process introduce errors, but also the projection into the 2D image plane of the camera limits the amount of 3D information that can be recovered from these images. The 3D characteristics of objects in the scene, 3D movement of objects, and 3D camera movements can typically only be partially estimated from sequences of images provided by cameras.
As a result, image-interpretation processes do not always automatically converge to the correct solution. For example, even though one might think it is relatively straight forward to derive a 3D mathematical representation of a simple object such as a soda can from sequences of images of that soda can, a process for determining the location and size of a 3D cylinder needed to represent the soda can may not properly converge, depending upon the lighting, camera angles, and so on used in the original image capture. Because of the probabilistic nature of this type of model, the end result cannot be guaranteed.
Abstraction-based models also have their limitations. While they provide a deterministic and thus predictable representation of a scene, they assume that the representation and input parameters are exactly correct. The result therefore does not always represent the real scene accurately.
For example, although an object such as a soda can might be initially modeled as a 3D cylinder, other attributes of the scene, such as lights, may not be precisely placed or described in the model. Such impreciseness reveals itself when an attempt is made to use the abstraction-based model to create a shaded rendition of the soda can. In addition, the object in the actual scene may not be physically perfect, i.e., what was thought to be a perfectly cylindrical soda can may in fact be deformed in some way. Subtle curvatures, scratches, and dents may all be missing from the model of the soda can. The actual detailed geometry of the soda can's lid and pull tab may also be oversimplified or completely missing in the model.
It is therefore difficult to precisely assign mathematical or other abstract object descriptions to every attribute of a scene manually.
It is also very difficult to completely distinguish arbitrary physical objects and their attributes, along with camera parameters, solely from the pixel values of captured images.