The term image segmentation denotes the art of automatically partitioning an image in a set of connected segments, where each of the segments can be individually classified as a single object. When the aim reduces to the identification of the closest object, then the algorithm is called bilayer segmentation, highlighting the fact that a single foreground object must be segmented from the remaining background scene.
One class of algorithms, denoted as Background Subtraction, assume that a foreground object moves in front of a static background and a video-sequence captured by an electro-optical camera is available as an input. In this context the color distribution of the background scene can be assumed constant or slowly variable and the temporal color variation of each pixel provides the local clue of a foreground. The most important drawback of such an approach is that a static foreground necessarily falls into the background layer.
During the last years active depth sensors have become available as off-the-shelf components and the research community has addressed its interest toward the simultaneous exploitation of color and depth data within several vision systems. For image segmentation, and even more for bilayer segmentation, the scene depth represents a strong additional cue, as it provides several additional benefits. The distance from the camera is indeed the feature defining what a foreground is, and therefore the depth data is carrying the richest information. Furthermore, range sensors are sufficiently independent of the lighting conditions and the scene depth does not suffer from ambiguous statistical distributions between foreground objects and the background scene. The latter are instead typical issues encountered in the color data.
However, depth images captured from active sensors typically have a low resolution and are affected by parallax displacement with respect to the corresponding color images due to the physical distance of the cameras' projection centers. Therefore, for a successful joint processing, depth maps and color images need to be registered, which is also a non-trivial task. For example, it has been proposed to use a super resolution technique to up-sample the depth map of a time-of-flight camera to the resolution of the main color camera. Furthermore, though passive depth estimation approaches, like stereo matching and structure from motion, do not suffer from these issues, they have their own difficulties. In order to combine the advantages of both passive and active depth sensors, a sensor setup comprised of a time-of-flight camera in conjunction with a stereo camera has been proposed.
Despite the resolution and alignment issues, certainly depth maps can be used to extract a rough initial segmentation in a fully automatic manner. Such segmentation allows for the construction of a rough trimap, where alpha matting techniques can be directly applied. However, a precise trimap, tightly aligned with the actual foreground contour, typically leads to much better alpha mattes. As a consequence, although the state of the art alpha matting schemes have been extended towards exploiting the depth information, their results still suffer from the rather broad initial trimap.
The research activities on binary segmentation using color and depth may be split into two camps, namely feature-level fusion and decision-level fusion. The approaches in the first group are typically based on a k-means clustering framework, to extract the image segments in the feature space, and the feature vectors are constructed using a weighted mixture of the color components and the pixel image location and the depth data.
On the other hand, basically all the decision-level fusion approaches employ a graph-cuts framework, where the depth is integrated into the data term as a statistically independent additional source of information. They typically use classical Bayesian inference. As an alternative, a voting scheme has been employed to combine the output of three separate classifiers based on background subtraction, color statistics and depth/motion consistency.
A remarkable aspect is that almost any graph based segmentation technique integrates the depth information only in the data term of the objective function. Only in O. Arif et al.: “Visual tracking and segmentation using Time-of-Flight sensor”, 17th IEEE International Conference on Image Processing (2010), pp. 2241-2244 also the smoothness terms are computed as a function of the distance between neighboring pixels in color and space. However, this approach suffers from several drawbacks. The depth measure is simply added to the pixel coordinates to get a 3D space location, but the intrinsic difference in resolution and unit measure between the depth measure and the pixel coordinates is not directly taken into account. This leads to a non-isotropic scaling of the Euclidean space. Furthermore, no theoretical motivation is provided for the actual form of the smoothness terms, the depth is arbitrarily included in both the data term and smoothness term using heterogeneous functions. Besides, the depth is included asymmetrically only in the data term of the foreground pixels, whereas the data terms of the background pixels are charged only with a color based cost.
Many video and image editing applications, like clean background plate creation, background substitution, object tracking, 2D to 3D conversion, and many others, need a robust segmentation of the foreground object from the background scene. Currently the automatic tools are not sufficiently reliable for a massive exploitation on real operative scenarios and commonly the task is manually performed by an operator, drawing the silhouette of the segmentation target. This operation, called rotoscoping in the post-production industry, is extremely long and expensive.
Currently the most reliable techniques available in literature are based on graph cut, but interestingly the general trend of this class of algorithms is to embed the depth information only in the data term, whereas it can play a better role when embedded into the smoothness term.
Another issue, which raises in any graph based segmentation algorithm, is the definition of the most suitable function for the cost terms. In the typical probabilistic framework, the likelihood measures of the color and depth data are used. In this case the problem transfers to the online estimation of the best statistical models for the data. This is not a trivial task and in some case it provides only a poor discrimination capability, as the statistical distributions of the background and the foreground can be significantly overlapping. This happens mainly for the color data. However, if the probabilistic framework is dropped, some empirical function needs to be defined. In this case it is not clear how to compute the function parameters and how general they can be considered.
Also the computational cost of the graph based segmentation techniques can be a problem. Segmentation graphs can be naturally extended to three dimensional graphs when the whole video-sequence is processed in a single step, but in this case the overall computational cost becomes very high. Even when images are processed independently, the number of edges scales with the square of image resolution and, therefore, for HD sequences it is difficult to reach real time performances.