Stereo vision matching is used to determine disparity values between image frames received from multiple image sensors (e.g., cameras). Disparity estimation is a process of identifying differences between corresponding points between stereo image frames or more image frames), which are captured from the image sensors. Disparity estimation is used in virtual reality systems, object tracking and recognition systems, and depth-image based rendering systems to determine depths of objects in image frames.
Disparity estimation can be performed between rectified stereo images, meaning the corresponding points of the rectified stereo images reside along a same row (or in a same line in a Y-direction). FIG. 1 shows stereo images (a left image frame 100 and a right image frame 102) received from two image sensors. A point 104 of an object 106 shown in the left image 100 is at a different position than the same point 104 of the object 106 as shown in the right image frame 102. The point 104 is shifted to the left and in an X-direction from a position X to a position X-d. A dashed line view 108 of the object 106 is shown in the right image frame 102 at a position of the object 106 as shown in the left image frame 100 to illustrate the differences in the positioning of the object between the left image frame 100 and the right image frame 102. The shift d in terms of pixel position is the disparity between the position of the point 104 in the right image frame 102 relative to the position of the same point 104 in the left image frame 100.
Disparity estimation processing can be challenging due to sizes of the local support regions, radiometric variations, texture-less regions depth discontinuity regions, etc. Designing a stereo vision matching system with good balance between accuracy and efficiency remains a challenging problem. Current disparity estimation algorithms can be classified into local algorithms and global algorithms. Local algorithms compute disparity values for each pixel within a selected local region of an image frame. Global algorithms compute disparity values for each pixel of a whole image frame.
The local algorithms perform well for well-textured simple image frames, but not as well for natural (or complex) image frames. A well-textured image frame refers to an image frame that has a large amount of luminance change across a local region of the image frame. An image frame or local region of an image frame is defined as being complex if the image frame and/or local region has a homogeneous region for which it is difficult to determine disparity values. Natural image frames are complex due to (i) different sensor response curves provided by image sensors for the environment captured, and (ii) different exposure times of the image sensors. The sensor response curves are based on location, movement and/or exposure control of the image sensors. Global algorithms perform better than local algorithms for stereo vision matching of complex images. Global algorithms treat stereo vision matching as an energy minimization problem and obtain global disparity allocation via optimization methods such as dynamic programming (DP), graph cuts (GC), and belief propagation (BP).
Implementation of local algorithms is computationally less expensive than implementation of global algorithms due to the use of less memory and lower processing requirements. Recent advances in adaptive selection of local regions of image frames have improved results of using a local algorithm for a complex image frame while requiring less memory and lower processing requirements than a global algorithm. Adaptive selection can include selection of size and location of a region within an image frame. The adaptive selection of local regions allows regions that are less homogeneous and/or not homogeneous to be selected to aid in determining disparity values.