1. Technical Field
The invention relates to electronic imaging. More particularly, the invention relates to real time stereo and motion analysis.
2. Description of the Prior Art
There are many applications for electronic vision systems. For example, robotic vehicles may be operable in both a teleoperated mode, where stereo cameras on board the vehicle provide three-dimensional scene information to human operators via stereographic displays; and a semi-autonomous mode, where rangefinders on board the vehicle provide three-dimensional information for automatic obstacle avoidance.
Stereo vision is a very attractive approach for such electronic vision applications as on-board rangefinding, in part because the necessary video hardware is already required for teleoperation, and in part because stereo vision has a number of potential advantages over other rangefinding technologies, e.g. stereo is passive, nonscanning, nonmechanical, and uses very little power.
The practicality of stereo vision has been limited by the slow speed of existing systems and a lack of consensus on basic paradigms for approaching the stereo problem. Previous stereo vision work has been grouped into categories according to which geometric model of the world was employed, which optimization (i.e. search) algorithms were employed for matching, and which constraints were imposed to enhance the reliability of the stereo matching process.
Primary approaches to geometry have used either feature-based or field-based world models:                Feature-based models typically extract two-dimensional points or line segments from each image, match these, and output the parameters of the corresponding three-dimensional primitives.        Field-based models consist of discrete raster representations. In particular, a disparity field that specifies stereo disparity at each pixel in an image.        
Field-based models typically perform matching by area correlation. A wide variety of search algorithms have been used, including dynamic programming, gradient descent, simulated annealing, and deterministic, iterative local support methods. Coarse-to-fine search techniques using image pyramids can be combined with most of these methods to improve their efficiency. Finally, many sources of search constraint have been used to reduce the likelihood of false matches, including multispectral images, surface smoothness models, and redundant images, such as in trinocular stereo or motion-based bootstrap strategies.
Statistical modeling and estimation methods are increasingly used in both feature-based and field-based models. The use of surface smoothness models, which is known to be effective in practice, fits image information into a statistical framework based upon a relationship to prior probabilities in Bayesian estimation. The power of coarse-to-fine search, redundant images, and active or exploratory sensing methods are all well known.
A basic issue is the question of which type of feature-based or field-based model provides the most general approach to stereo vision. The roots of stereo vision lie in the use of area correlation for aerial triangulation. In the past, correlation was thought to be too slow or to be inappropriate for other reasons. As a result, methods based on edges or other types of features became popular. However, feature-based methods also have limitations due to feature instability and the sparseness of estimated range images.
Another important issue is which combination or combinations of search algorithms and constraints provides the most efficient and reliable performance. Global search algorithms, such as simulated annealing and three-dimensional dynamic programming, may give accurate results but they are very expensive computationally. Analogously, multispectral or redundant images provide more information, but increase the hardware and computational cost of a system. It is likely that comparatively simple methods may lead to fast and usually reliable performance, as described in H. K. Nishihara, Practical Real-Time Imaging Stereo Matcher, Optical Engineering, volume 23, number 5 (September/October 1984).
U.S. Pat. No. 4,905,081 to Morton discloses a method and apparatus for transmitting and receiving three-dimensional video pictures. Transmission of video pictures containing depth information is achieved by taking video signals from two sources, showing different representations of the same scene, and correlating them to determine a plurality of peak correlation values which correspond to vectors representing depth information. The first video signal is divided into elementary areas and each block is tested, pixel by pixel, with each vector to determine which vector gives the best fit in deriving the second video signal from the first. The vectors that give the best fit are then assigned to their respective areas of the picture and constitute difference information. The first video signal and the assigned vectors are then transmitted in parallel. The first video signal can be received as a monoscopic picture, or alternatively the vectors can be use to modified the first signal to form a display containing depths.
Morton discloses a method that provides a remote sensing technique for use, for example, with robots in hazardous environments. Such robots often use stereoscopic television to relay a view of their surroundings to an operator. The technique described by Morton could be used to derive and display the distance of an object from a robot to avoid the need for a separate rangefinder. For autonomous operation of the robot, however, information concerning the distance to a hazardous object in the environment of the robot must be available in near real-time.
The slow speed of prior art stereo vision systems has posed a major limitation, e.g. in the performance of semi-autonomous robotic vehicles. Semi-autonomy, in combination with teleoperation, is desired for many tasks involving remote or hazardous operations, such as planetary exploration, waste cleanup, and national security. A major need has been a computationally inexpensive method for computing range images in near real time by cross-correlating stereo images.
C. Anderson, L. Matthies, Near Real-Time Stereo Vision System, U.S. Pat. No. 5,179,441 (Jan. 12, 1993) discloses an apparatus for a near real-time stereo vision system that is used with a robotic vehicle that comprises two cameras mounted on three-axis rotation platforms, image-processing boards, and a CPU programmed with specialized stereo vision algorithms. Bandpass-filtered image pyramids are computed, stereo matching is performed by least-squares correlation, and confidence images are estimated by means of Bayes' theorem.
In particular, Laplacian image pyramids are built and disparity maps are produced from a 60×64 level of the pyramids at rates of up to 2 seconds per image pair. All vision processing is performed by the CPU board augmented with the image processing boards.
Anderson et al disclose a near real-time stereo vision apparatus for use with a robotic vehicle that comprises a first video camera, attached to mounting hardware for producing a first video output image responsive to light from an object scene; and a second videocamera, also attached to the mounting hardware for producing a second video output image responsive to light from the object scene; a first digitizer for digitizing the first video image having an input connected to an output of the first videocamera, and having an output at which digital representations of pixels in the first video image appear; a second digitizer for digitizing the second video image having an input connected to an output of the second video camera, and having an output at which digital representations of pixels in the second video image appear; a video processor for successively producing sequential stereo Laplacian pyramid images at left and right stereo outputs thereof from the digital representations of the first and second video images at first and second inputs connected to the outputs of the first and second digitizers; a stereo correlation means for correlating left and right stereo Laplacian pyramid images at the left and right stereo outputs of the video processor, where the stereo correlation means have an output and first and second inputs connected to the left and right inputs of the video processor; a disparity map calculator connected to the output of the stereo correlation means, for calculating a disparity map of the object scene; and means for storing an array of numerical values corresponding to the stereo disparity at each pixel of a digital representation of the object scene.
Zabih, R. And J. Woodfill, Non-parametric local transforms for computing visual correspondence, 3rd European Conference on Computer Vision, Stockholm (1994) disclose the use of non-parametric local transforms as a basis for performing correlation. Such non-parametric local transforms rely upon the relative ordering of local intensity values, and not on the intensity values themselves. Correlation using such transforms is thought to tolerate a significant number of outliers. The document discusses two non-parametric local transforms, i.e. the rank transform, which measures local intensity, and the census transform, which summarizes local image structure.
In view of the various shortcomings associated with the prior art, as discussed above, it would be advantageous to provide a new algorithm for image-to-image comparison that requires less storage and fewer operations than other algorithms. It would be of additional advantage to provide a hardware/software electronic vision solution having an implementation that is a combination of standard, low-cost and low-power components programmed to perform such new algorithm.