Scene analysis is the process of using computer-based systems to derive estimated information about a visual scene from a series of images of that same visual scene. One application of these techniques is in media production--the process of creating media content for use in films, videos, broadcast television, television commercials, interactive games, CD-ROM titles, DVD titles, Internet or intranet web sites, and related or derived formats. These techniques can be applied in the pre-production, production and post-production phases of the overall media production process. Other design visualization application areas include industrial design and architecture. Scene analysis is also applied in areas such as surveillance, reconnaissance, medicine, and the creation of simulation environments for entertainment, training, education, and marketing.
By recovering the estimated scene structure, it is possible to treat the elements of a visual scene as abstract three-dimensional geometric and/or volumetric objects that can be processed, manipulated and combined with other data objects. There are multiple techniques to estimate the parameters of a visual scene from a series of two-dimensional images of that scene. Various aspects of the visual scene, including both two-dimensional and three-dimensional information, can be estimated using techniques such as:
estimating extrinsic camera parameters (such as camera path and camera orientation); PA1 estimating intrinsic camera parameters (such as focal length and optical center); PA1 estimating depths in the scene for pixels in the images (depth maps); PA1 estimating geometry and/or materials properties of objects and their surfaces in the scene (estimating scene structure); PA1 estimating areas of motion in the scene (motion detection); and PA1 estimating paths of feature points and/or objects in the scene (2D and 3D feature tracking).
Underlying these techniques are algorithms that can be categorized into a few general classes. Methods based on dense optical flow attempt to recover the optical flow vectors across pairs of images, typically on a per-pixel basis. Methods based on feature tracking typically select visual features in one image (based on criteria such as areas of high contrast) and attempt to track the path of each selected feature across a series of related images. Most feature tracking methods track a relatively sparse array of features, ranging from a single feature to a few hundred.
For example, Horn, B. K. P. and Schunck, B. G., in "Determining Optical Flow," Artificial Intelligence, Vol. 17, pp. 185-203 (1981) describe how so-called optical flow techniques may be used to detect velocities of brightness patterns in an image stream to segment the image frames into pixel regions corresponding to particular visual objects.
Becker, S. and Bove, V. M., in "Semiautomatic 3D Model Extraction from Uncalibrated 2-D Camera Views," Proceedings SPIE Visual Data Exploration and Analysis II, vol. 2410, pp. 447-461 (1995) describe a technique for extracting a three-dimensional (3-D) scene model from two-dimensional (2-D) pixel-based image representations as a set of 3-D mathematical abstract representations of visual objects in the scene as well as camera parameters and depth maps.
Sawhney, H. S., in "3D Geometry from Planar Parallax", IEEE 1063-6919/94 (1994), pp. 929-934 discusses a technique for deriving 3-D structure through perspective projection using motion parallax defined with respect to an arbitrary dominant plane.
Poelman, C. J. et al in "A Paraperspective Factorization Method for Shape and Motion Recovery", Dec. 11, 1993, Carnegie Mellon University Report (MU-CS-93-219), elaborates on a factorization method for recovery both the shape of an object and its motion from a sequence of images, using many images and tracking many feature points.
A new method, the dense feature array method, tracks a dense array of features across a series of images. The dense feature array method can track feature densities up to a per-pixel level, and is described in a co-pending United States patent application by Moorby, P. R., entitled "Feature Tracking Using a Dense Feature Array", serial number 09/054,866 filed on Apr. 3, 1998 and assigned to SynaPix, Inc., the assignee of the present application.
However, a key barrier to widespread adoption of these techniques is the inherent uncertainty and inaccuracy of any estimation technique. The information encoded in the images about the actual elements in the visual scene and their relationships is incomplete, and typically distorted by artifacts of the imaging process. As a result, the utility of automated scene analysis drops dramatically as the estimates become unreliable and unpredictable wherever the information encoded in the images is noisy, incomplete or simply missing.
For example, inter-object occlusions, shadows, specular highlights and imaging artifacts such as noise and distortion can all interact in the pixel values of the images. When some or all of these factors are introduced, the probability increases that the scene analysis process will produce results that are not meaningful, unstable or inaccurate for some areas of the scene.
An error in a single result is often propagated within the analysis process. Errors are propagated when results for local regions of the image are correlated, interpolated or used in fitting global parameters of the scene. For example, parallax methods of deriving depths of objects in the scene perform a global fit of scene parameters based on a set of local parameters computed across the images being matched.
Other examples of potential error propagation can be found in feature tracking algorithms such as described in Tomasi, C., et al., "Shape and Motion from Image Streams: a Factorization Method", Technical Report CMU-CS-91-172, Carnegie Mellon University (1991), or in a scene structure recovery method based on feature tracking as in the above-referenced Poelman publication. In these types of scene analysis methods, errors are propagated when a match between local features is used to either estimate or constrain matches in neighboring local regions, or estimate or constrain matches across images.
For certain applications, the time-based sequence of input images is analyzed in order to develop a continuous sequence of interpreted images as an output result. Depending on the technique in use, these results can be visually interpreted and presented as pixel values in images associated with the original source images, or as geometric objects (such as sets of points, curves or meshes) related to the original source images.
When such visually interpreted results are themselves presented in a temporal sequence of images generated in whole or part from such results, the errors and artifacts in the results appear as anomalies in the generated images. Since the errors and uncertainties in the results can change and shift in the visual sequence, the visual anomalies in the generated images will appear to occur in either random patterns ("dancing" around in the image sequence) or discernible patterns ("rolling" through the image sequence). These visual anomalies can be confusing, visually displeasing, distracting and/or annoying to the viewer.
In some applications, such as military intelligence systems, the trained professionals using such systems are expected to tolerate some level of visual anomalies. However, in design visualization applications, the eventual viewer of the images generated from the interpreted results is expected to have a relatively low tolerance for visual anomalies.
For example, in the production (including the post-production phase) of media content, every effort must be made to avoid, correct or otherwise rectify such visual anomalies. The same is true in most design visualization systems. If such visual anomalies are difficult to filter or smooth automatically (both within each generated image and across a temporal sequence of such generated images), painstaking labor-intensive adjustments are required in order to achieve acceptable quality.
Consider also that the producer of multimedia content typically wishes to use one or more scene models in the first place to create as accurate a representation of the scene as possible. For example, consider a motion picture environment where computer-generated special effects are to appear in a scene with real world objects and actors. The content creator may choose to start by creating models that represent various static and/or dynamic aspects of the scene. Some of these models can be derived from digitized motion picture film using automatic image-interpretation techniques and then proceed to combine computer-generated abstract elements with the elements derived from image-interpretation in a visually and aesthetically pleasing way.
Problems can occur with this approach, however, since automatic image-interpretation processes are statistical in nature, and the input image pixels are themselves the results of a sampling and filtering process. Consider that images are sampled from two-dimensional (2-D) projections (onto a camera's imaging plane) of three-dimensional (3-D) physical scenes. Not only does this sampling process introduce errors, but also the projection into the 2-D image plane of the camera limits the amount of 3-D information that can be recovered from these images. The 3-D characteristics of objects in the scene, 3-D movement of objects, and 3-D camera movements can typically only be partially quantified from sequences of images provided by cameras.
As a result, image-interpretation processes do not always automatically converge to the correct solution. For example, even though one might think it is relatively straight forward to derive a 3-D mathematical representation of a simple object such as a soda can from sequences of images of that soda can, a process for determining the location and size of a 3-D wire frame mesh needed to represent the soda can may not properly converge, depending upon the lighting, camera angles, and so on used in the original image capture. Because of the probabilistic nature of this type of model, the end result cannot be reliably predicted.