The term image segmentation denotes the art of hard-cutting important objects out of an image, in the sense that each pixel is assigned to only one object. This is in contrast to alpha matting, where a continuous blending function is estimated. The general image segmentation problem is ill-defined as the definition of what an object actually is strongly depends on the task at hand. This high-level knowledge usually needs to be provided by a human operator.
A simpler scenario is given by the bi-layer segmentation, where a foreground object merely needs to be separated from the scene background. One common solution to this task is called background subtraction, where the foreground object is assumed to be moving in front of a static background. During the last years active depth sensors have become standard components of vision systems and have received much interest from the research community. For image segmentation, the scene depth represents a strong additional cue to color information as it is independent of the lighting conditions. Additionally it does not suffer from ambiguous statistical distributions between foreground objects and background scene, which are typically encountered in the color data. However, depth images captured from active sensors typically have a low resolution and are affected by parallax displacement with respect to the corresponding color images due to the physical distance of their projection centers. Therefore, depth maps and color images need to be registered, which is also a non-trivial task. In O. Wang et al.: “Automatic Natural Video Matting with Depth”, PG '07. Proceedings of the 15th Pacific Conference on Computer Graphics and Applications (2007), pp. 469-472, a super resolution technique is used to up-sample the depth map of a time-of-flight camera to the resolution of the main color camera. Passive depth estimation like stereo matching and structure from motion do not suffer from these issues but have their own difficulties. In J. Zhu et al.: “Joint depth and alpha matte optimization via fusion of stereo and time-of-flight sensor”, CVPR 2009. IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 453-460, a sensor setup consisting of a time-of-flight camera in conjunction with a stereo camera is used to combine the robustness of the former with the resolution of the latter.
Despite the resolution and alignment issues, depth maps can be successfully used to extract a rough initial segmentation fully automatically, without any user input. Furthermore, they allow for the construction of a rough trimap, where alpha matting techniques are directly applied. However, a precise trimap typically allows to produce much better alpha mattes and although the employed alpha matting schemes are also extended towards exploiting the available depth information, their results still suffer from the rather broad initial trimap.
The research activities on binary segmentation using color and depth may be split into two camps: feature-level fusion and decision-level fusion. In all approaches based on feature-level fusion a k-means clustering is performed on feature vectors consisting of the color components and the spatial position, including the depth, for each pixel. Basically all the decision-level fusion approaches employ a graph-cuts framework, where the depth is integrated into the data term as a statistically independent additional source of information. They typically use classical Bayesian inference. However, also a voting scheme has been proposed to combine the output of three separate classifiers based on background subtraction, color statistics and depth/motion consistency.
For many video and image editing applications, like clean background plate creation, background substitution, object tracking, 2D to 3D conversion, and many others, a robust segmentation of the foreground object from the background scene is required. Despite the ongoing research on automatic segmentation, currently this task is manually performed by an operator, drawing the silhouette of the segmentation target. This operation is called rotoscoping in the post-production industry.