The correspondence problem in stereo vision has been a well-studied problem for the past several decades. While great progress has been made over the years and especially in the past decade or so, there is room for further improvement. Impressive results have been reported in recent years for the sub-class of problems that involves only or predominantly fronto-parallel surfaces. A current snapshot of the state-of-the-art in stereo vision algorithms can be found at the popular Middlebury web site. However, the general class of problems involving arbitrary surfaces at arbitrary orientations is still difficult to solve. Lighting changes and other environmental factors also complicate the problem.
The earliest and the most straight-forward techniques for stereo matching are the window-based techniques, where a small window is held fixed around a pixel in one image and is moved along the corresponding epipolar line in the other image to find the corresponding pixel according to some matching criterion. The simplest of the matching criteria available in the literature is to minimize the sum of the absolute intensity differences (SAD) between the two windows. Such window-based matching approaches have been shown to produce noisy results. This is due to several factors such as: (1) the affine nature of the motion in the image plane, (2) existence of depth discontinuities in the scene, (3) occurrences of occlusions and disocclusions, (4) specular reflections, and (5) sensor noise. One idea suggested in the literature to alleviate the noise issue is to use variably-sized windows for different pixels. This idea, however, leads to the question of how to choose the correct window size for a given pixel for general scenes. This issue has not been satisfactorily resolved.
Recent approaches such as graph cuts and belief propagation are (or can be) cast without windows (in their data term) and instead invoke the Markovian assumption and consider the disparity field as a Markov random field with usually first order cliques for simplicity. Then, this field is modeled with a data term and a smoothness term and the data term is modeled with a pixel-based SAD-score, for example, and the smoothness term is modeled with usually the Potts model and clever optimization strategies are devised to solve the resultant minimization problem. Pixel-based SAD score is noisy and so robust distance metrics such as truncated distance metrics are introduced. Truncating this distance metric is an art. Also, for a given pixel on the left image, the number of pixels on the right image to be searched for a match is a user-specified parameter, whose design is an art. These schemes typically take a large amount of wall-clock compute time to produce a disparity map for a typical-sized image. These approaches produce less satisfactory results for scenes involving non fronto-parallel surfaces. The reason for this behavior might be due to the smoothness term used since the data term comes from actual data. Specifically, smoothness terms currently used might be over-constraining the disparity field in the vertical direction for non fronto-parallel scenes.
The level set method was introduced by Osher and Sethian to solve front propagation problems. They designed the so-called numerical fluxes for the level set equation by applying ideas originally developed for hyperbolic conservation law solvers. Their design of the numerical flux implies that shock waves and expansion waves occur only at the local extrema of the level set function. Shock waves and expansion waves correspond to discontinuities in the speed function that appear in the level set equation. Shock waves and expansion waves occur in several hyperbolic flow situations. In traffic flow for example, when a slow moving vehicle is approached by fast moving vehicles a shock wave develops, and when fast moving vehicles leave (or pass) a slow moving vehicle an expansion wave develops. In optical flow and stereo vision, occlusion and disocclusion correspond to shock waves and expansion waves, respectively. In the level set method, the level set function is typically initialized to be the signed distance from the initial front and the propagation of this front over time with a known speed function is what is of interest. The theory behind the level set method is not, however, restricted to the signed distance function.
Both optical flow and stereo vision problems can be characterized as inverse problems since one's objective is to find the optical flow velocity vector or the disparity map given two images. In this work, the stereo vision problem is interpreted as a 1-D optical flow problem. Though this interpretation was used earlier in Scheuing and Niemann, the present invention and algorithm is completely different from their methodology.
By considering the image intensity field as the level set function, one can apply the theoretical framework of the level set method to solving optical flow and stereo vision problems. For example, according to the level set method, discontinuities in depth can occur only at the extrema of the image intensity field. This is because the speed function in the level set method can be expressed in terms of optical flow velocity or disparity in the case of stereo vision and it is well-known that both optical flow velocity and disparity can be expressed in terms of depth. In other words, the level set method guarantees that the speed function (and thereby depth) is smooth away from the extrema of the level set function. Thus, for stereo vision, if one knows the disparity at two consecutive extrema of the image intensity field along an epipolar line, then one can obtain the disparity values at in-between pixels using interpolation since they vary smoothly at these pixels.
The earliest works that are conceptually similar to the present approach are that of Marr and Poggio, Baker and Binford, and Ohta and Kanade. These approaches are categorized as primitive matching approaches in the literature. The key ideas in these works are (1) find primitives (usually edges) in both left and right images, and (2) match edges. Both Marr and Poggio and Baker and Binford use the zero-crossings of the second derivative of the image intensity field and use fixed size windows to match the pixel locations of the edges along scanlines. Ohta and Kanade use the peaks and valleys of the first derivative of the image intensity field and use a matching cost defined in terms of the variance of the combined set of image intensity values from both intervals in the left and right scanlines between pairs of edges under consideration. The Ohta and Kanade approach is an interval matching technique. Since the matched intervals need not be of the same length, they must define the matching cost using the variance.
Marr and Poggio also suggest the use of interpolation to fill in the disparity values at in-between pixels once the disparities have been computed at the edges. However, interpolation is done only in the image where the “hidden” discontinuity is actually hidden, which means what is visible varies smoothly. In terms of the level set terminology, if the motion is from left to right and the discontinuity is hidden in the left image then this corresponds to an expansion wave and interpolation is permissible in such regions. Marr and Poggio had the correct insight and the correct physical argument about a decade before the level set method was introduced. However, the level set method uses the extrema of the gradient of the image intensity field which are typically found using the product of forward and backward difference operators.
Lowe developed the SIFT (Scale Invariant Feature Transform) approach for detecting features that are persistent across image transforms. He first identifies strict local extrema in scale space and then encodes the image region around such extrema into a high-dimensional feature vector. He then uses a variation of the k-d tree algorithm called best-bin-search method to determine the nearest neighbors in the feature space in order to match key-points across two images. The present approach is conceptually and algorithmically simpler and computationally much faster than Lowe's method. Also, the present method is about computing dense disparity fields.
Tuytelaars and Van Gool identify regions around local image intensity extrema and encode these regions into a feature vector. Then they use these feature vectors along with other feature vectors identified with other criteria such as corner detectors in an opportunistic manner to find matches across images. They use nearest neighbor with respect to the Mahalanobis-distance in the feature space as the criterion for establishing matches. They apply this approach for wide-baseline stereo and show coarse correspondences between image regions.
As discussed, Ohta and Kanade and Marr and Poggio use different aspects that are consistent with the level set method. However, in order to find image intensity extrema, Ohta and Kanade use several central difference operators with different sizes and combine their results giving priority to the smaller-sized operators. Since smaller-sized operators result in better edge localization, they adapt this strategy. They include the edges detected by the larger-sized operators if the smaller ones don't find any edge within the stencil of the larger operator. They also neglect extrema where the absolute value of the image intensity gradient is below a threshold value.
Based on the prior art, there exists a need for a method for simple, accurate, efficient, and fast front feature matching stereo vision image processing. The present invention provides such a method and system.