Segmentation of video scenes into meaningful region-layers such that each region-layer represents a grouping of regions or objects that share a number of common spatio-temporal properties has been and remains a difficult task despite considerable effort. The task becomes even more challenging if this segmentation needs to be performed in real-time or even faster, with high reliability, in moderate compute complexity, and with good quality as necessary in a new generation of critical image processing and computer vision applications such as surveillance, autonomous driving, robotics, and real-time video with high quality/compression.
The state of the art in image/video segmentation is not able to provide good quality segmentation consistently on general video scenes in a compute effective manner. If a very large amount of compute resources are not available, to get good quality segmentation such segmentation must still be performed manually or in a semi-automatic manner. This, however, limits its use to non-real time applications where cost and time of manual segmentation can be justified. For real time applications, either a tremendous amount of compute resources have to be provided, alternate mechanisms have to be devised, or poor quality has to be tolerated.
For example, a current technique (please see J. Guo, J. Kim, C.-C. J. Kuo, “New Video Object Segmentation Technique with Color/Motion Information and Boundary Postprocessing,” Applied Intelligence Journal, March 1999) for video segmentation based on color/motion has been provided for foreground/background segmentation for MPEG-4 video, which supported object based video coding. It operates in L*u*v* color space using its unique properties to derive a color feature, which is a gradient based iterative color clustering algorithm called mean shift algorithm to segment homogenous color regions per colors that are dominant. The color feature is then combined with a motion feature. Moving regions are detected by motion detection method and analyzed by region based affine model and tracked to increase the spatial and temporal consistency of extracted objects. The motion detection is a high order statistics based algorithm, and the motion estimation is done by region based affine model on spatial regions identified by color based feature. The process involves determining if a feature belongs to foreground or background. The boundary of regions can be of variable precision to match the bit-rate needs of MPEG-4 due to implications to coding efficiency. This approach is primarily designed for 2 level segmentation of a scene into a foreground and a background object; it is designed for a constrained class of sequences such as typically encountered in videoconferencing application.
Another approach (please see M. Grundmann, V. Kwatra, M. Han, I. Essa, “Efficient Graph Based Video Segmentation,” CVPR 2010, IEEE Conference on Computer Vision and Pattern Recognition, pp. 2141-2148, June 2010, San Francisco, USA) is volumetric and starts by oversegmenting a graph of a volume (chunk) of video into space-time regions with a grouping based on appearance. Following this, a region graph is created over obtained segmentation and iteratively repeated on multiple levels until a tree of spatio-temporal segmentations is generated. It is asserted that the approach generates temporally coherent segmentations with stable region boundaries, and allows for choice of different levels of granularity. Furthermore, the segmentation quality may be improved by using dense optical flow to help temporal connections within initial graph. To improve scalability, two variants of the algorithm include an out-of-core parallel algorithm that can process much larger volumes than the in-core algorithm and an algorithm that temporally segments video into scene into overlapping clips and then segments them successively. The basic algorithm's processing is complex with processing time for a 40 sec video of CIF to SD resolution of around 20 min, which implies 1 sec (30 frames) video takes 20/40=½min (30 secs); thus the processing time for 1 frame of low to medium resolution is around ˜1 sec or 1000 msec. A portion of the method relating to selection and tracking may be semi-automatic as a postprocessing operation left to user, while the main segmentation is automatic. The may be high-delay as performing segmentation of a volume involves first collecting all frames that make up the volume before processing. FIG. 1 illustrates the key principles of this segmentation approach including (i) an example region graph and (ii) example segmentation with and without optical flow edges and features.
Yet another approach (please see X. Bai, J. Wang, G. Sapiro, “Dynamic Color Flow: A Motion Adaptive Model for Object Segmentation in Video,” ECCV 2010, European Conference on Computer Vision and Patten Recognition, pp. 617-630, September 2010, Heraklion, Crete, Greece) for segmentation uses a scalable hierarchical graph based algorithm. The algorithm may also use modeling of object features including color and other features. The algorithm goes beyond color models such as Gaussian mixture model, localized Gaussian mixtures model, and pixel-wise adaptive modesl as they fail in complicated scenes leading to incorrect segmentation. The segmentation algorithm introduces a new color model called Dynamic Color Flow that incorporates motion estimation into color modeling and adaptively changes model parameters to match local properties of motion. The proposed model attempts to accurately reflect changes in a scene's appearance caused by motion and may be applied to both background and foreground layers for efficient segmentation of video. The model may provide more accurate foreground and background estimation allowing video object separation from scenes. FIG. 2 illustrates the key principles of this segmentation approach including (i) example variance adapting to local intensity across an object and (ii) example segmentation.
Therefore, current approaches have limitations in that they either require high-delay due to operating on a volume of frames, lack flexibility beyond 2 layer segmentation (such as for video conferencing scenes), require a-priori knowledge of parameters needed for segmentation, lack sufficient robustness, or are not practical general purpose solutions as they require manual interaction, provide good quality region boundary while offering complexity tradeoffs, exhibit scale complexity depending on how many regions are segmented, or some combination of the aforementioned limitations.
As such, existing techniques do not provide fast segmentation of video scenes in real time. Such problems may become critical as segmentation of video becomes more widespread.