Segmentation of video scenes into meaningful region-layers such that each region-layer represents a grouping of regions or objects that share a number of common spatio-temporal properties has been and remains a difficult task despite considerable effort. The task becomes even more challenging if this segmentation needs to be performed in real-time or even faster, with high reliability, in moderate compute complexity, and with good quality as necessary in a new generation of critical image processing and computer vision applications such as surveillance, autonomous driving, robotics, and real-time video with high quality/compression.
The state of the art in image/video segmentation is not able to provide good quality segmentation consistently on general video scenes in a compute effective manner. If a very large amount of compute resources are not available, to get good quality segmentation such segmentation must still be performed manually or in a semi-automatic manner. This, however, limits its use to non-real time applications where cost and time of manual segmentation can be justified. For real time applications, either a tremendous amount of compute resources have to be provided, alternate mechanisms have to be devised, or poor quality has to be tolerated.
For example, a current technique (please see J. Lezama, K. Alahari, J. Sivic, “Track to the Future: Spatio-temporal Video Segmentation with Long Range Motion Cues,” CVPR 2011, IEEE Conference on Computer Vision and Pattern Recognition, pp. 3369-3376, June 2011, Colorado Springs, USA) provides a method for spatio-temporal oversegmentation of video into regions with the goal that the resulting segmented regions respect object boundaries, and at the same time associates object pixels over many video frames in time. For example, in the spatio-temporal domain long range motion cues from past and future frames in terms of “clusters of point-tracks” are associated coherent motion. Furthermore, a clustering solution function that includes reasoning related to occlusion due to use of depth ordering basis, as well as the notion of motion similarity along the tracks. The proposed approach is thus a motion-based graph theoretic approach to video segmentation. Furthermore, the approach uses long range motion cues (including into future), building of long range connections, and tracking. Thus the approach is complex in the amount of compute for high resolution video and uses long range motion and thus requires considerable memory and generates significant delay.
Another approach (please see A. Papazoglou, V. Ferrari, “Fast Object Segmentation in Unconstrained Video,” ICCV 2013, International Conference on Computer Vision, pp. 1777-1784, December 2013, Sydney, Australia) is designed for segmentation of each frame into two regions: a foreground region and a background region. The method is fast, automatic, and makes minimal assumptions about content of the video and thus enables handling unconstrained settings including fast moving background, arbitrary object appearance and motion, and further non-rigid deformations of objects. It purports to outperform background subtraction techniques and point-cluster tracking type methods while being faster those approaches. The method does not assume a particular motion model from the objects. It includes a method to determine which pixels are inside an object based on motion boundaries in pairs of subsequent frames such that an initial estimate is refined by integrating information over the whole video and a second stage instantiates an appearance model based on the initial foreground estimate and uses it to refine the spatial accuracy of segmentation and to segment the object in frames where it does not move. While a detailed discussion of the approach is outside the scope of this discussion, the key principles and sample results of the approach are illustrated in FIG. 1.
Yet another approach (please see D. Zhang, O. Javed, M. Shah, “Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions,” CVPR 2013, IEEE Conference on Computer Vision and Pattern Recognition, pp. 628-635, June 2013, Portland, USA) for segmentation is object based and first extracts primary object regions and then uses the primary object segments to build object models for optimized segmentation. The approach includes a layered directed acyclic graph (DAG) based framework for detection and segmentation of primary object in video that is based on objects that are spatially cohesive and have locally smooth motion trajectories, which allows it to extract primary object from a set of available proposals based on motion, appearance, and predicted shape across frames. Furthermore, the DAG is initialized with enhanced object proposal set where motion based proposal predictions are used from adjacent frames to expand the set of proposals for a frame. Lastly, the proposal presents a motion scoring function for selecting object proposals that emphasize high optical flow, and proposal boundaries to differentiate moving objects from background. The approach is said to outperform both unsupervised, and supervised state of the art techniques. While a detailed discussion of the approach is again outside the scope of this discussion, the key principles and sample results of the approach are illustrated in FIG. 2. The approach, as it uses a layered DAG, optical flow, dynamic programming and per pixel segmentation is complex with high delay.
Therefore, current techniques have limitations in that they either require high-delay due to operating on a volume of frames, lack flexibility beyond 2 layer segmentation (such as for video conferencing scenes), require a-priori knowledge of parameters needed for segmentation, lack sufficient robustness, are not practical general purpose solutions as they require manual interaction, provide good quality region boundary while offering complexity tradeoffs, are scale complexity depending on how many regions are segmented, or some combination of the aforementioned limitations. Furthermore, none of the techniques discussed provide real-time or faster region segmentation at HD resolution on general purpose resource limited computing devices while achieving acceptable quality.
As such, existing techniques do not provide fast segmentation of video scenes in real time. Such problems may become critical as segmentation of video becomes more widespread.