Dense stereo matching is an important component of many computer vision applications, like 3D reconstruction and robot navigation, and hence has been studied for decades Many recent methods, such as ones described by D. Scharstein and R. Szeliski in “Taxonomy and evaluation of dense two frame stereo correspondence algorithms”, IJCV 47:7-42, 2002 are based on global or semi-global optimization algorithms. They produce high quality results, but due to their complexity are far from being real time. Other known methods are Local block based methods which are fast and amenable to parallelization and so were among the first to be implemented on Graphics Processing Units (GPUs). However, they typically perform poorly near occlusion boundaries and in low-textured regions, which prompted the development of various post-processing steps.
Recently, some non-local methods have been ported to the GPU such as Dynamic Programming described by Wang et al. in High-quality Real-time Stereo using Adaptive Cost Aggregation and Dynamic Programming”, 3DPVT, 2006, pp 798-805 and Belief Propagation described by Yang et al. in “Real-time global stereo matching using hierarchical belief propagation”, BMVC, 2006, pp 989-998. These implementations support small image sizes and/or small disparity ranges. Consequently, local methods are still the popular choice for real time stereo implementations. A class of methods that do not fit in this classification of local and non-local methods are those built on a coarse-to fine (CTF) architecture as proposed by M. Sizintsev and R. P. Wildes in “Coarse-to-fine stereo with accurate 3D boundaries,” Image and Vision Computing 28, 2010, pp. 352-366. A schematic representation of this CTF architecture is illustrated in FIG. 1. Initially, image pyramids are constructed and then stereo estimations are performed progressively from coarser to finer levels of the pyramid. Disparity estimates are propagated from coarser-to-finer levels, while local block based matching is used at each level of the pyramid to perform disparity estimation about the initial guess from the coarser level. Due to their simplicity and parallelizability, these are very fast, and at the same time more accurate than local algorithms applied only at the input resolution. However, CTF stereo is known to perform poorly near occluding boundaries.
To improve performance at occlusion boundaries, an adaptive coarse-to-fine (ACTF) stereo algorithm was proposed by M. Sizintsev and R. P. Wildes in “Coarse-to-fine stereo with accurate 3D boundaries,” Image and Vision Computing 28, 2010, pp. 352-366. This ACTF stereo algorithm uses non-centered windows for matching and also adaptively upsamples coarser level disparity estimates as illustrated in FIG. 2. So, for every pixel, centered windows, e.g. 3×3 window 20 as shown in FIG. 2 are first utilized to choose the disparity value d with the highest correlation score. Then, a search is performed at all non-centered windows, e.g. 22, 24, which are the same size as the centered window 20 as shown in FIG. 2 that includes the reference pixel, to find the window which has the highest correlation (C) score. This best non-centered window is the centered window for some pixel, and the corresponding disparity value d of that pixel is selected as the disparity value for the reference pixel. The correlation C is defined as below:
  C  =                    ∑        i            ⁢              Li        ×        Ri                                      ∑          i                ⁢                              L            i            2                    ×                      ∑                          R              i              2                                          
where L & R refer to corresponding left and right image patches and i indexes corresponding pixels in the patches. ACTF performs well for most part, but often makes disparity errors such as the one shown in the disparity map in FIG. 3(c) based on the left stereo image in FIG. 3(a) and the right stereo image in FIG. 3(b). As shown in the disparity map of FIG. 3(c), part of the head is cut off and the raised arm appears to have an abnormal shape. These failures are due to the fact that disparity value for a pixel at a pyramid level are drawn only from the immediate coarser level. An incorrect estimate of the disparity value at a coarser level can propagate all the way to the finest level.
The fact that a large number of stereo matching algorithms have been proposed clearly indicates that there exist tradeoffs between speed, flexibility, accuracy and even density of stereo matching. Therefore, there is a need in the art to provide an improved stereo matching algorithm that overcomes the deficiencies of prior art and computes accurate disparity maps in real time while maintaining a high execution speed.