Stereo matching refers generally to a method for processing two or more images in an attempt to recover information about the objects portrayed in the images. Since each image is only two dimensional, it does not convey the depth of the objects portrayed in the image relative to the camera position. However, it is possible to recover this depth information by processing two or more images of the same object taken from cameras located at different positions around the object. There are two primary elements to extracting depth information: 1) finding picture elements (pixels) in each image that correspond to the same surface element on an object depicted in each image; and 2) using triangulation to compute the distance between the surface element and one of the cameras. Knowing the camera position and the corresponding picture elements, one can trace a ray from each camera through corresponding picture elements to find the intersection point of the rays, which gives the location of a surface element in three-dimensional (3D) space. After computing this intersection point, one can then compute the distance or "depth" of the surface element relative to one of the cameras.
The difficult part of this method is finding matching picture elements in two or more input images. In the field of computer vision, this problem is referred to as stereo matching or stereo correspondence. Finding matching picture elements or "pixels" is difficult because many pixels in each image have the same color.
In the past, researchers have studied the stereo matching problem in attempt to recover depth maps and shape models for robotics and object recognition applications. Stereo matching is relevant to these applications because it can be used to compute the distances or "depths" of visible surface elements relative to a camera from two or more input images. These depth values are analogous to the depths of surface elements on a 3D object (sometimes referred to as the Z coordinate in an (x,y,z) coordinate system) in the field of computer graphics. Depth or "z" buffers are a common part of 3D graphics rendering systems used to determine which surface elements of 3D objects are visible while rendering a 3D scene into a two-dimensional image.
The term "disparity" is often used in the computer vision field and represents the change in position of a surface element on an object when viewed through different cameras positioned around the object. Since disparity is mathematically related to depth from the camera, it can be used interchangeably with depth. In other words, once one has determined disparity, it is trivial to convert it into a depth value.
A typical stereo matching algorithm will attempt to compute the disparities for visible surface elements. These disparities can be converted into depths to compute a depth map, an array of depth values representing the depth of visible surface elements depicted in an image.
Recently, depth maps recovered from stereo images have been painted with texture maps extracted from the input images to create realistic 3D scenes and environments for virtual reality and virtual studio applications. A "texture map" is another term commonly used in computer graphics referring to a method for mapping an image to the surface of 3D objects. This type of stereo matching application can be used to compute a 3D virtual environment from a video sequence. In a game, for example, this technology could be used to create the effect of "walking through" a virtual environment and viewing objects depicted in a video sequence from different viewing perspectives using a technique called view interpolation. View Interpolation refers to a method for taking one image and simulating what it would look like from a different viewpoint. In another application called z-keying, this technology can be used to extract depth layers of video objects and then insert graphical objects between the depth layers. For example, z-keying can be used to insert computer-generated animation in a live video sequence.
Unfortunately, the quality and resolution of most stereo algorithms is insufficient for these types of applications. Even isolated errors in the depth map become readily visible when synthetic graphical objects are inserted between extracted foreground and background video objects.
One of the most common types of errors occurs in stereo algorithms when they attempt to compute depth values at the boundary where a foreground object occludes a background object (the occlusion boundary). Some stereo algorithms tend to "fatten" depth layers near these boundaries, which causes errors in the depth map. Stereo algorithms based on variable window sizes or iterative evidence aggregation can in many cases reduce these types of errors. (T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptive window: Theory and experiment. IEEE Trans. Patt. Anal. Machine Intel., 16(9):920-932, September 1994) (D. Scharstein and R. Szeliski. Stereo matching with non-linear diffusion. In Computer Vision and Pattern Recognition (CVPR '96), pages 343-350, San Franciso, Calif., June 1996). Another problem is that stereo algorithms typically only estimate disparity values to the nearest pixel, which is often not sufficiently accurate for tasks such as view interpolation.
While pixel level accuracy is sufficient for some stereo applications, it is not sufficient for challenging applications such as z-keying. Pixels lying near occlusion boundaries will typically be "mixed" in the sense that they contain a blend of colors contributed by the foreground and background surfaces. When mixed pixels are composited with other images or graphical objects, objectionable "halos" or "color bleeding" may be visible in the final image.
The computer graphics and special effects industries have faced similar problems extracting foreground objects in video using blue screen techniques. The term blue screen generally refers to a method for extracting an image representing a foreground object from the rest of an image. A common application of this technique is to extract the foreground image and then superimpose it onto another image to create special effects. For example, a video sequence of a spaceship can be shot against a blue background so that the spaceship's image can be extracted from the blue background and superimposed onto another image (e.g., an image depicting a space scene). The key to this approach is that the background or "blue screen" is comprised of a known, uniform color, and therefore, can be easily distinguished from the foreground image.
Despite the fact that the background color is known, blue screen techniques still suffer from the same problem of mixed pixels at the occlusion boundary of the foreground object (e.g., the perimeter of the spaceship in the previous example). To address the problems of mixed pixels in blue screen techniques, researchers in these fields have developed techniques for modeling mixed pixels as combinations of foreground and background colors. However, it is insufficient to merely label pixels as foreground and background because this approach does not represent a pixel's true color and opacity.
The term "opacity" (sometimes referred to "transparency" or "translucency") refers to the extent to which an occluded background pixel is visible through the occluding foreground pixel at the same pixel location. An image comprises a finite number of pixels arranged in a rectangular array. Each pixel, therefore, covers an area in two-dimensional screen coordinates. It is possible for sub-pixel regions of pixels at occlusion boundaries to map to surface elements at different depths (e.g., a foreground object and a background object). It is also possible for a pixel to represent a translucent surface such as window that reflects some light and also allows light reflected from a background object to pass through it. In order for a pixel to represent the foreground and background colors accurately, it should represent the proper proportion of foreground and background colors in its final color values. The opacity value can be used to represent the extent to which a pixel is composed of colors from foreground and background surface elements.
As alluded to above, one way to approximate opacity is merely to assume some predefined blending factor for computing colors of mixed pixels. While this type of blending foreground and background colors can make errors at the occlusion boundaries less visible for some applications, it does not remove the errors and is insufficient for demanding applications such as z-keying. Moreover, in the context of stereo matching, the background colors are usually not known. A stereo matching method has to attempt to distinguish background and foreground colors before "mixed" pixels can be computed.