This invention relates generally to computer vision, and more particularly, to estimating characteristics of scenes represented by images.
One general problem in computer vision is how to determine the characteristics of a scene from images representing the underlying scene. Following are some specific problems. For motion estimation, the input is usually a temporally ordered sequence of images, e.g., a xe2x80x9cvideo.xe2x80x9d The problem is how to estimate the projected velocities of various thingsxe2x80x94people, cars, balls, background moving in the video. Another problem deals with recovering real-world three-dimensional (3D) structure from a 2D image. For example, how to recover the shape of an object from a line drawing, a photograph, or a stereo pair of photographs. Yet another problem is how to recover high-resolution scene details from a low-resolution image.
Humans make these types of estimates all the time, frequently subconsciously. There are many applications for machines to be able to do this also. These problems have been studied by many workers with different approaches and varying success for many years. The problem with most known approaches is that they lack machine learning methods that can exploit the power of modern processors within a general framework.
In the prior art, methods have been developed for interpreting blocks world images. Other prior art work, using hand-labeled scenes, has analyzed local features of aerial images based on vector codes, and has developed rules to propagate scene interpretations. However, these solutions are for specific one-step classifications, and therefore, cannot be used for solving a general class of low-level vision problems. Methods to propagate probabilities have been used, but these methods have not been put in a general framework for solving vision problems.
Alternatively, optical flow can be estimated from images by using a quad-tree to propagate motion information across scale. There, a brightness constancy assumption is used, and beliefs about the velocity of the optical flow is presented as a gaussian probability distribution.
The present invention analyzes statistical properties of a labeled visual world in order to estimate a visual scene from corresponding image data. The image data might be single or multiple frames; the scene characteristics to be estimated could be projected object velocities, surface shapes, reflectance patterns, or colors. The invention uses statistical properties gathered from labeled training data to form xe2x80x9cbest-guessxe2x80x9d estimates or optimal interpretations of underlying scenes.
The invention operates in a learning phase and an inference phase. During the learning phase statistical properties about the training data are modeled as probability density functions, for example, a mixture of Gaussians. A Markov network is built. During the inference phase beliefs derived from a particular images and the density functions are propagated around the network to make estimates about a particular scene corresponding to the particular image.
Accordingly during the learning phase, training data for typical images and scenes are synthetically generated. A parametric vocabulary for both images and scenes is generated. The probability of image parameters, conditioned on scene parameters (the likelihood function), is modeled, as is the probability of scene parameters, conditioned on neighboring scene parameters. These relationships are modeled with a Markov network where local evidence is propagated to neighboring nodes during an inference phase to determine the maximum a posteriori probability of the scene estimate.
Humans perform scene interpretations in ways that are largely unknown and certainly mathematically undeterminable. We describe a visual system that interprets a visual scene by determining the probability of every possible scene interpretation for all local images, and by determining the probability of any two local scenes neighboring each other. The first probability allows the visual system to make scene estimates from local image data, and the second probability allows the these local estimates to propagate. One embodiment uses a Bayesian method constrained by Markov assumptions.
The method according to the invention can be applied to various low-level vision problems, for example, estimating high-resolution scene detail from a low-resolution version of the image, and estimating the shape of an object from a line drawing. In these applications, the spatially local statistical information, without domain knowledge, is sufficient to arrive at a reasonable global scene interpretation.
Specifically, the invention provides a method for estimating scenes from images. A plurality of scenes are generated and an image is rendered for each scene. These form the training data. The scenes and corresponding images are partitioned into patches. Each patch is quantified as a vector, and the vectors are modeled as a probability densities, for example, a mixture of gaussian distributions. The statistical relationship between patches are modeled as a Markov network. Local probability information is iteratively propagated to neighboring nodes of the network, and the resulting probability densities at each node, a xe2x80x9cbelief,xe2x80x9d is read to estimate the scene.
In one application of our invention, it is possible to estimate high-resolution details from a blurred, or lower-resolution image. A low-resolution image is the input xe2x80x9cimagexe2x80x9d data, and the xe2x80x9cscenexe2x80x9d data are the image intensities of the high-resolution details. The invention can also be used to estimate scene motion from a sequence of images. In this application, the image data are the image intensities from two successive images of the sequence, and the scene data are successive velocity maps indicating projected velocities of the visible objects at each pixel position. Another application for our invention is shading and reflectance disambiguation.
The invention can also be applied to other estimation problems where training and target data can be modeled with probability density functions, for example, in speech recognition, seismological studies, medical diagnostic signals such as EEG and EIKG. In addition, the probability representations can be either discrete or continuous in either the learning or inference phases.