Techniques for automated recovery of estimated three-dimensional scene structure from multiple two-dimensional images of a visual scene, and the availability of general and special purpose computing engines to support the required calculations, are advancing to a level that makes them practical in a range of design visualization applications. By recovering the estimated scene structure, it is possible to treat the elements of a visual scene as abstract three-dimensional geometric and/or volumetric objects that can be processed, manipulated and combined with other objects within a computer-based system.
One application of these techniques is in media production--the process of creating media content for use in films, videos, broadcast television, television commercials, interactive games, CD-ROM titles, DVD titles, Internet or intranet web sites, and related or derived formats. These techniques can be applied in the pre-production, production and post-production phases of the overall media production process. Other areas include industrial design, architecture, and other design visualization applications.
In creating such design visualization content, it is common to combine various elements and images from multiple sources, and arrange them to appear to interact as if they were in the same physical or synthetic space. It is also common to re-touch or otherwise manipulate images to enhance their aesthetic, artistic or commercial value, requiring the artist to estimate and/or simulate the three-dimensional characteristics of elements in the original visual scene. The recovery of three-dimensional scene structure including camera path data greatly expands the creative possibilities, while reducing the labor-intensive burdens associated with either modeling the elements of the scene by hand or by manipulating the two-dimensional images to simulate three-dimensional interactions.
In these image-based scene analysis techniques, the computer accepts a visual image stream such as produced by a motion picture, film or video camera. The image stream is first converted into digital information in the form of pixels. The computer then operates on the pixels in certain ways by grouping them together, comparing them with stored patterns, and other more sophisticated processes which use when available information such as camera position, orientation, and focal length to determine information about the scene structure. So-called "machine vision" or "image understanding" techniques are then used to automatically extract and interpret the structure of the actual physical scene as represented by the captured images. Computerized abstract models of elements of the scene may then be created and manipulated using this information about the actual physical scene. A computer-generated generated image stream of a synthetic scene can be similarly analyzed.
For example, Becker, S. and Bove, V. M., in "Semiautomatic 3D Model Extraction from Uncalibrated 2-D Camera Views," Proceedings SPIE Visual Data Exploration and Analysis II, vol. 2410, pp. 447-461 (1995) describe a technique for extracting a three-dimensional (3-D) scene model from two-dimensional (2-D) pixel-based image representations as a set of 3-D mathematical abstract representations of visual objects in the scene as well as camera parameters and depth maps.
Horn, B. K. P. and Schunck, B. G., in "Determining Optical Flow," Artificial Intelligence, Vol. 17, pp. 185-203 (1981) describe how so-called optical flow techniques may be used to detect velocities of brightness patterns in an image stream to segment the image frames into pixel regions corresponding to particular visual objects.
Sawhney, H. S., in "3D Geometry from Planar Parallax", IEEE 1063-6919/94 (1994), pp. 929-934 discusses a technique for deriving 3-D structure through perspective projection using motion parallax defined with respect to an arbitrary dominant plane.
Poelman, C. J. et al in "A Paraperspective Factorization Method for Shape and Motion Recovery", Dec. 11, 1993, Carnegie Mellon University Report (MU-CS-93-219), elaborates on a factorization method for recovery both the shape of an object and its motion from a sequence of images, using many images and tracking many feature points.
The goal for the creator of multimedia content in using such a scene model is to create as accurate a representation of the scene as possible. For example, consider a motion picture environment where computer-generated special effects are to appear in a scene with real world objects and actors. The content creator may choose to start by creating a model from digitized motion picture film using automatic image-interpretation techniques and then proceed to combine computer-generated abstract elements with the elements derived from image-interpretation in a visually and aesthetically pleasing way.
Problems can occur with this approach, however, since automatic image-interpretation processes are statistical in nature, and the input image pixels are themselves the results of a sampling and filtering process. Consider that images are sampled from two-dimensional (2-D) projections (onto a camera's imaging plane) of three-dimensional (3-D) physical scenes. Not only does this sampling process introduce errors, but also the projection into the 2-D image plane of the camera limits the amount of 3-D information that can be recovered from these images. The 3-D characteristics of objects in the scene, 3-D movement of objects, and 3-D camera movements can typically only be partially quantified from sequences of images provided by cameras.
As a result, image-interpretation processes do not always automatically converge to the correct solution. For example, even though one might think it is relatively straight forward to derive a 3-D mathematical representation of a simple object such as a soda can from sequences of images of that soda can, a process for determining the location and size of a 3-D wire frame mesh needed to represent the soda can may not properly converge, depending upon the lighting, camera angles, and so on used in the original image capture. Because of the probabilistic nature of this type of model, the end result cannot be reliably predicted.