This invention relates generally to processing digital signals, and more particularly, to inferring scenes from test images.
One general problem in digital signal processing is how to infer the overall characteristics of some real world signal from isolated instances or degraded signal samples. For example, in vision applications, some part of a real world scene is measured by acquiring one or more images. For motion estimation, the input is usually a temporally ordered sequence of images, e.g., a xe2x80x9cvideo.xe2x80x9d The problem is how to estimate the projected velocities of various objects, e.g., people, cars, balls, or background, moving in the video. Another vision application deals with recovering real-world three-dimensional structure from two-dimensional images. For example, how to recover the shape of an object from a line drawing, a photograph, or a stereo pair of photographs. Yet another problem is how to recover high-resolution details from low-resolution signals. For example, how to produce high-resolution details from a low-resolution image, or how to provide a high-resolution audio signal on a noisy transmission line.
Humans make these types of estimates all the time, frequently subconsciously. There are many applications for machines to be able to do this also. These problems have been studied by many workers with different approaches and varying success for many years. The problem with most known approaches is that they lack machine learning methods that can exploit the power of modem processors and memory capacities within a general framework.
In the prior art, methods have been developed for interpreting blocks of world images. Other prior art work, using hand-labeled scenes, has analyzed local features of aerial images based on vector codes, and has developed rules to propagate scene interpretations. However, these solutions are for specific one-step classifications, and therefore, cannot be used for solving a general class of low-level vision problems. Methods to propagate probabilities have been used, but these methods have not been put in a general framework for solving vision problems.
Alternatively, optical flow can be estimated from images by using a quad-tree to propagate motion information across scale. There, a brightness constancy assumption is used, and probabilities about the velocity of the optical flow is presented as a Gaussian probability distribution, both of which limit the generality of that approach.
U.S. patent application Ser. No. 09/203,108 xe2x80x9cEstimating Scenes Using Statistical Properties of Images and Scenes,xe2x80x9d by Freeman et al., uses probability densities modeled as a mixture of Gaussian distributions to make scene estimates. In a Continuation-in-Part application Ser. No. 09/236,839 xe2x80x9cEstimating Targets Using Statistical Properties of Observations of Known Targets,xe2x80x9d by Freeman et al., the estimation is done by a combination of probability densities and vectors. The problem with these prior art techniques is that Gaussian mixtures consume substantial time and storage during processing and give a poor approximation to the probability density. It is desired to improve over these prior art techniques.
The invention provides a method for inferring a scene from a test image.
During a training phase, a plurality of labeled images and corresponding scenes are acquired. Each of the images and corresponding scenes is partitioned respectively into a plurality of image patches and scene patches.
Each image patch is represented as an image vector, and each scene patch is represented as a scene vector. The probabilistic relationship between the image vectors and scene vectors is modeled as a network. The probabilistic relationship between the image vectors and scene vectors is modeled as a network, preferably a Markov network.
During an inference phase, the test image corresponding to an unknown scene is acquired. The test image is partitioned into a plurality of test image patches. Each test image patch is represented as a test image vector. Candidate scene vectors corresponding to the test image vectors are located in the training data. Compatibility matrices for candidate scene vectors at neighboring network nodes are determined, and probabilities are propagated in the network until a predetermine termination condition is reached to infer the scene from the test image. The termination condition can be some level of convergence or a fixed number of iterations.