The present invention is directed to the automatic detection of areas of correspondence in two or more images, and more particularly to a robust technique for detecting correspondence of points which lie on occluding boundaries.
In the field of computer vision, there exist a variety of situations in which it is desirable to be able to automatically identify corresponding points in multiple images. Examples of these situations include (1) object tracking, in which the location of a given object is identified over a sequence of images, (2) image morphing, in which corresponding feature points must be identified on two images, to define the beginning and ending constraints for the morph, (3) rotoscoping, in which the outline of an object is determined over a sequence of images, so that the object can be selectively segregated from the remainder of the scene, and (4) estimation of camera motion, for special effects and the like. Initially, the identification of corresponding points, or features, in related images was done manually. It can be appreciated that the need to manually annotate each of a series of images can be quite time consuming, and therefore does not facilitate the implementation of any of the foregoing techniques in lengthy video sequences, or other such situations involving more than a few related images.
To this end, therefore, automatic techniques for locating corresponding points in different images have been developed. Many of the earlier techniques are based on the assumption that the brightness of the relevant features in the image remain constant over the various images. Techniques that are based upon this assumption perform best when tracking high-contrast regions that lie on a single surface. However, many images have visually important features that do not follow this assumption. For instance, if one desires to track complicated objects with multiple articulated surfaces, such as the human body, a technique is required which is capable of identifying corresponding points that lie on occluding boundaries, e.g., the apparent interface of features which are at different depths from the viewer.
Recently, robust estimation methods have been applied to the image correspondence problem, and have been shown to provide improved performance in cases where the points to be identified include occlusion. For instance, a robust optic flow method using redescending error norms that substantially discount the effect of outliers was described by M. Black and P. Anandan in xe2x80x9cA Framework for Robust Estimation of Optical Flow,xe2x80x9d 4th Proc. ICCV, pp. 263-274, 1993. Methods for transparent local flow estimation are described in Shizawa et al, xe2x80x9cSimultaneous Multiple Optical Flow Estimation,xe2x80x9d Proc. CVPR, 1990. The use of rank statistics for robust correspondence is disclosed by D. Bhar and S. Nayar in xe2x80x9cOrdinal Measures for Visual Correspondence,xe2x80x9d Proc. CVPR, pp. 351-357, 1994. Another technique using ordering statistics, combined with spatial structure in the CENSUS transform, is described in R. Zabih and J. Woodfill, xe2x80x9cNon-parametric Local Transforms for Computing Visual Correspondence,xe2x80x9d Proc. 3rd ECCV, pp. 151-158, 1994. Yet another approach uses methods of finding image xe2x80x9clayersxe2x80x9d to pool motion information over arbitrarily shaped regions of support and to iteratively refine parameter estimates. Examples of this approach are described in T. Darrell, and A. Pentland, xe2x80x9cRobust Estimation of a Multi-Layer Motion Representation,xe2x80x9d Proc. IEEE Workshop on Visual Motion, Princeton, N.J., 1991; J. Wang and E. H. Adelson, xe2x80x9cLayered Representations for Image Sequence Coding,xe2x80x9d Proc. CVPR, 1993; and S. Ayer and H. Sawhney, xe2x80x9cLayered Representation of Motion Video Using Robust Maximum Likelihood Estimation of Mixture Models and MDL Encoding,xe2x80x9d Proc. ICCV, 1995. These latter approaches rely upon models of global object motion to define coherence.
While the foregoing techniques provide varying degrees of acceptable performance in the case of tracking occluding features, they all rely upon the assumption that there exists sufficient contrast in the foreground object to localize a correspondence match. In many cases, however, this assumption does not apply, for instance due to uniform foreground surfaces or low-resolution video sampling. An example of this problem is illustrated in FIGS. 1A and 1B. In these examples, a foreground object 10 having no internal contrast moves from a relatively dark background 12 (FIG. 1A) to an area that provides a relatively light background 14 (FIG. 1B). This example may correspond to the situation in which a uniformly colored object, such as a human finger, moves in front of differently colored background objects. As a result of this movement, the contrast at the occlusion boundary, i.e., the edge of the object 10, changes sign between the two images. An analysis window is indicated by the reference A in FIG. 1A. When attempting to identify a corresponding area in FIG. 1B, the prior art techniques described above are as likely to equate either of the areas B or C to the area A, due to the lack of internal contrast within the object.
Many of the foregoing correspondence methods are not able to adequately deal with the situation in which there is no coherent foreground contrast. In these types of situations, attempts at using transparent-motion analysis, to detect the motion of the object, have been employed. However, these techniques have not been able to provide precise spatial localization of corresponding points. In some cases, smoothing methods such as regularization or parametric motion constraints can provide approximate localization when good estimates are available in nearby image regions, but do not provide consistent results across a variety of situations.
Many detailed image analysis/synthesis tasks require that precise correspondence be found between images. For instance, image compositing, automatic morphing and video resynthesis require accurate correspondence, and slight flaws can yield perceptually significant errors. To minimize the effect of such errors, practitioners of the foregoing prior art techniques have relied upon extreme redundancy of measurement, human-assisted tracking, substantial smoothing, and/or domain-specific feature-appearance models. However, each of these approaches further complicates the process, thereby increasing the time and effort that is required. Even then, they cannot guarantee acceptable results.
It is an objective of the present invention, therefore, to provide a technique for automatically identifying corresponding points in two or more images that is capable of providing good performance near occluding boundaries, particularly when foreground objects contain little or no contrast, and to do so without the need for prior training, smoothing, or pooling of motion estimates.
In accordance with the present invention, the foregoing objective is achieved by defining an image transform, which characterizes the local structure of an image in a manner that is insensitive to points from a different surface, but which recognizes the shape of the occlusion boundary itself. In general, matching is performed on a redundant, local representation of image homogeneity.
In accordance with the invention, a given point of interest in an image is defined by two properties, a local attribute and a neighborhood function that describes a similarity pattern. In one embodiment of the invention, the local attribute can be the color of a pixel, or an area of pixels. Since the color value is not influenced by nearby background regions of the image, it can function in certain cases as a reliable descriptor for each location. The neighborhood function distinguishes locations of similar color from one another, by capturing patterns of change in the local color. The neighborhood function highlights occlusion boundaries, while removing the effects of background pixels. In essence, it measures the similarity between the central local color and colors at nearby points, and reduces the measured similarity values that lie beyond contrast boundaries. This approach reduces the influence of background pixels, since an intervening occlusion boundary is typically characterized by a contrast boundary as well. The remapping of similarity values creates a rapid transition from high to low similarity, which further serves to highlight the occlusion boundaries.
Through the computation of such a transform for points of interest in an image, corresponding points in other images can be readily identified. The results provided by the invention are particularly useful in techniques for tracking object contours, for applications such as rotoscoping. Specific features of the invention are described hereinafter with reference to the illustrative examples depicted in the accompanying drawings.