Disparity estimation is a necessary component in stereo video processing and 3D video processing. Video disparity is used for 3D video processing. In a two-camera imaging system, disparity is defined as the vector difference between the imaged object points in each image relative to the focal point. It is this disparity that allows for depth estimation of objects in the scene via triangulation of the point in each image. In rectified stereo, where both camera images are in the same plane, only horizontal disparity exists. In this case, multiview geometry shows that disparity is inversely proportional to actual depth in the scene.
Estimating disparity has been extensively studied for images. The existing image-based methods are ill-suited to video disparity estimation on a frame-by-frame basis because temporal consistency is not guaranteed. Using these methods for video disparity estimation often leads to poor spatial and temporal consistency. Temporal consistency is the smoothness of the disparity in time. If a video disparity is temporally consistent, then an observer will see flickering artifacts. Temporally inconsistent disparity degrades the performance of view synthesis and 3D video coding.
Existing disparity estimation methods are also tuned for specific datasets such as Middlebury stereo database. See, D. Scharstein and R. Szeliski, “A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms” International Journal of Computer Vision, vol 47, pp. 7-42 (April 2002). Such methods tend to perform poorly when applied to real video sequences. Many common real video sequences have lighting conditions, color distributions and object shapes that can be very different from the images on Middlebury stereo database. For methods that require training, applying such methods to real videos is almost impossible and at least is highly impractical from a perspective of speed of execution and complexity of computation.
Existing image-based disparity estimation techniques may be categorized into one of two groups: local or global methods. Local methods treat each pixel (or an aggregated region of pixels) in the reference image independently and seek to infer the optimal horizontal displacement to match it with the corresponding pixel/region. Global methods incorporate assumptions about depth discontinuities and estimate disparity values by minimizing an energy function over all pixels using techniques such as Graph Cuts or Hierarchical Belief Propagation. Y. Boykov et al, “Fast Approximate Energy Minimization via Graph Cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222-1239 (February 2004); V. Kolmogorov and R. Zabih, “Computing Visual Correspondence with Occlusions via Graph Cuts,” International Conference on Computer Vision Proceedings, pp. 508-515 (2001). Local methods tend to be very fast but global methods tend to be more accurate. Most implementations of global methods tend to be unacceptably slow. See, D. Scharstein and R. Szeliski, “A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms,” International Journal of Computer Vision, vol. 47, pp. 7-42 (April 2002).
Attempts to solve stereo-matching problems for video have had limited success. Difficulties encountered have included the computational bottleneck of dealing with multidimensional data, lack of any real datasets with ground-truth, and the unclear relationship between optimal spatial and temporal processing for correspondence matching. Most have attempted to extend existing image-methods to video and have produced computational burdens that are impractical for most applications.
One attempt to extend the Hierarchical Belief Propagation method to video extends the matching cost representation to video by a 3-dimensional Markov Random Field (MRF). O. Williams, M. Isard, and J. MacCormick, “Estimating Disparity and Occlusions in Stereo Video Sequences,” in Computer Vision and Pattern Recognition Proceedings (2005). Reported algorithmic run times were as high as 947.5 seconds for a single 320×240 frame on a powerful computer, which is highly impractical.
Other approaches have used motion flow fields to attempt to enforce temporal coherence. One motion flow field technique makes use of a motion vector field. F. Huguet and F. Devernay, “A Variational Method for Scene Flow Estimation from Stereo Sequences,” in International Conference on Computer Vision Proceedings pp. 1-7 (2007). Another makes use of See, M. Bleyer and M. Gelautz, “Temporally Consistent Disparity Maps from Uncalibrated Stereo Videos,” in Proceedings of the 6th International Symposium on Image and Signal Processing (2009).
One computationally practical method is a graphics processing unit (GPU) implementation of Hierarchical Belief Propagation that relies upon locally adaptive support weights. See, C. Richardt et al, “Realtime Spatiotemporal Stereo Matching Using the Dual-Cross-Bilateral Grid,” in European Conference on Computer Vision Proceedings (2010); K. J. Yoon and I. S. Kweon, “Locally Adaptive Support-Weight Approach for Visual Correspondence Search,” in Computer Vision and Pattern Recognition Proceedings (2005). This method integrates temporal coherence in a similar way to Williams et al. (O. Williams, M. Isard, and J. MacCormick, “Estimating Disparity and Occlusions in Stereo Video Sequences,” in Computer Vision and Pattern Recognition Proceedings (2005)) and also provides a synthetic dataset with ground-truth disparity maps. Other methods that are practical require specific hardware or place data constraints. See, J. Zhu et al, “Fusion of Time-of-Flight Depth and Stereo for High Accuracy Depth Maps,” in Computer Vision and Pattern Recognition Proceedings (2008) pp. 1-8; G. Zhang, J. Jia, T. T. Wong, and H. Bao, “Consistent Depth Maps Recovery from a Video Sequence,” PAMI, vol. 31, no. 6, pp. 974-988 (2009).