In recent years, with the combination and application of random forest, multiple instance learning, stacked auto-encoders, deep neural network and other technologies, in the aspect of image foreground objects segmentation which is also referred to as image-based salient object detection, there have been many powerful detection models trained on large-scale image datasets. As a result, impressive development as well as progress has been made.
Primary video objects are intrinsically related to the image salient object, that is, the foreground object sequence in a video is a video salient object in most video frames. However, the two also have a fundamental difference, manifested in: firstly, the foreground object sequence in all video frames in the video is not always a image salient object; secondly, the consistent relation between the video frames in the video provides an additional clue for segmenting foreground object sequence from the background; lastly, due to the various actions of a camera and the object, the foreground object sequence may fall at the boundary of the video, resulting in the invalidation of background prior which is widely used in the image salient detection model.
Segmentation of the primary video objects is a very important step in many computer vision applications. However, there are still huge challenges against the segmentation of the primary video objects. Due to a lack of large-scale training video datasets, it hard to use machine learning methods to train time-space domain detection models whose performances are powerful enough. In addition, due to a movement of a camera and a subject, one video foreground object sequence generates different manners of appearance in different video frames, or multiple foreground object sequences appears simultaneously, or an occlusion phenomenon is occurred with an interference background, thus making it difficult to highlight the foreground object sequence throughout the whole video consistently.
In order to solve the problem of the segmentation of the primary video objects, there are three types of models in the current research: a full-automatic segmentation model, an interactive segmentation model and a semantic information guidance segmentation model.
The interactive segmentation model requires manual labeling of the foreground object sequence for the first video frame or several key video frames, followed by an automatic segmentation process. Whereas the semantic information guidance segmentation model requires to set a semantic category of the primary video objects before the segmentation process, so this model can segment the primary video objects in conjunction with the object detector or other tools. In general, these two models can both achieve good performance relying on priori knowledge obtained by manual annotation or data learning. However, the required interaction and semantic labeling make it difficult for them to be promoted and applied in large-scale data sets.
The full-automatic segmentation model is intended to directly segment the foreground object sequence for a single video or to separate a foreground object sequence from a video set. In general, the full-automatic segmentation model requires a definitive assumption for a spatial visual attribute or a manner of time domain motion of the primary video objects. For example, Papazoglou et al. proposed at the ICCV conference in 2013 that the foreground object in the most video segments should as possible have an assumption different from the surrounding background. They first got the foreground probability graph based on a motion information initialization, and then optimized the resolution in the time-space domain to improve the smoothness of the foreground object motion. For another example, Zhang et al. proposed at the CVPR conference in 2013 the segmentation of the primary video objects on the basis of a system framework of hierarchical directed acyclic graphs, with the assumption that the objects are compact in the spatial domain and their shapes and positions change with the time domain smoothly. In fact, similar assumptions occur in many full-automatic segmentation models and good performance is achieved in several small data (data sets such as SegTrack and SegTrackV2). However, for the big data sets such as Youtube-Objects and VOS that include complicated scenarios, the assumption may not hold true, and such models sometimes produce negative cases. Moreover, many full-automatic segmentation models require to calculate an optical flow to the video, or to iterate to solve the complex optimization problem, which renders the computation overhead significantly increased in the process of segmenting the primary video objects by this model and results in a lower segmentation speed.