The ability to perceive a real-world scene in three dimensions finds particular utility in applications like computer vision, 3D video, 3D modeling, image-based rendering, intelligent transportation, and many others. Because only limited information about the depth of objects in a scene can be obtained from a single monoscopic image of that scene, 3D perception of a scene is often obtained using two or more monoscopic images of the scene. When the two or more monoscopic images are taken from different visual perspectives of the scene, for example, the absolute depth of objects in the scene can be derived.
Each of these monoscopic images may include a plurality of pixel blocks for depicting the scene at some given pixel block resolution, where a pixel block in turn includes one or more individual picture elements (pixels). Because the monoscopic images are taken of the same scene, a pixel block in one image may correspond to the same real-world object depicted by a pixel block in the other image(s). Yet because the monoscopic images are taken from different visual perspectives of the scene, these corresponding pixel blocks may be located at different positions within the images. That is, the position of a pixel block depicting an object in one image may be offset (horizontally and/or vertically) relative to the position of a pixel block corresponding to that same object in another image. This relative offset between corresponding pixel blocks is referred to as the disparity between those pixel blocks. The collection of disparities between all corresponding pixel blocks represents a disparity map of the scene.
The disparity between corresponding pixel blocks is inversely proportional to the depth of the object that they depict (i.e., the greater the offset in the position of an object when viewed from different perspectives, the closer the object). From a disparity map of a scene, therefore, one can derive a depth map of the scene, which identifies the depth of all objects in the scene. Thus, a common approach to obtaining 3D perception of a scene involves first estimating a disparity map of a scene, and then deriving from that disparity map a depth map of the scene.
However, the computational complexity involved in estimating a disparity map of a scene is generally quite high. The search for corresponding pixel blocks contributes to much of this computational complexity. For example, some conventional searching approaches define one of the monoscopic images as a reference image and search the other monoscopic image(s) for pixel blocks that correspond to those in the reference image. Notably, to search another image for a pixel block that corresponds to a given pixel block in the reference image, these conventional approaches search for the corresponding pixel block over a full range of candidate pixel blocks. The range is a full range in the sense that it includes all pixel blocks in the other image that might possibly correspond to the pixel block in the reference image, given some maximum possible disparity for the scene (which can be quite large, e.g., over a hundred pixels). To determine which of these candidate pixel blocks corresponds to the pixel block in the reference image, conventional approaches calculate a matching cost for each candidate pixel block that indicates how similar or dissimilar that candidate pixel block is to the pixel block in the reference image. The matching costs are then evaluated to determine which candidate pixel block has a matching cost indicating the most similarity to the pixel block in the reference image; this candidate pixel block is determined as the corresponding pixel block.
Conventional approaches perform such a full range search independently for each pixel block in the reference image, which helps to ensure that the disparity map estimated is accurate. However, this accuracy comes at the expense of significant computational complexity, since performing so many full range searches requires conventional approaches to calculate a very large number of matching costs. As a result of this computational complexity, most conventional approaches cannot be implemented on devices with limited image processing or memory storage capabilities. The computational complexity of conventional approaches also limits the speed at which they can estimate a disparity map, restricting them to applications that do not require real-time 3D perception.
Various known improvements to conventional approaches attempt to accelerate the disparity map estimation process. See, e.g., C. Zach, K. Kramer, and H. Bischof, “Hierarchical Disparity Estimation with Programmable 3D Hardware,” The 12th International Conference in Central Europe on Computer Graphics, Visualization, and Computer Vision '2004, (which describes refining disparity maps estimated at coarse pixel block resolutions to obtain disparity maps with finer resolution). Yet these known improvements sacrifice the accuracy of the disparity map estimated, since fine spatial details of the disparity map are often lost.