Image segmentation can be used to, for example, determine related areas of an image, such as related areas that form a figure of an object. Video object segmentation, on the other hand, is generally performed to separate one or more foreground objects from the background and output one or more masks of the one or more foreground objects in each frame of a video stream for applications, such as video analysis and editing, or video compression. Video object segmentation is generally more difficult than image segmentation due to, for example, the motions of the target objects. Some real-life video scenarios, such as deforming shapes, fast movements, and multiple objects occluding each other, pose significant challenges to video object segmentation. While recent work has tried to address these challenges, performance is still limited in terms of both the quality and the speed. For example, post-production video editing often requires a significant amount of manual interaction to achieve satisfactory results.
To temporally and spatially smooth estimated object mask, graphical model based techniques have been proposed. While graphical models enable an effective mask propagation across an entire video stream, they are often sensitive to certain parameters of the graphical models. Recently, deep learning-based techniques have been applied to video object segmentation. The deep learning-based techniques generally predict the segmentation mask frame-by-frame, or incorporate additional cues from a preceding frame using, for example, optical flow, semantic segmentations, or mask propagation. Most deep learning-based video object segmentation techniques are based on semi-supervised learning, where the ground-truth segmentation mask of a reference frame (e.g., the first frame) is used to segment a target object in every consecutive frames. Two example deep learning-based video object segmentation techniques are one shot video object segmentation (OSVOS) and MaskTrack techniques. Most existing deep learning-based techniques are built on one of these two techniques. The OSVOS technique is generally based on the appearance of the target object in an annotated frame, and often fails to adapt to appearance changes and has difficulty separating multiple objects with similar appearances. The MaskTrack technique may be vulnerable to temporal discontinuities like occlusions and rapid motion, and can suffer from drifting once the propagation becomes unreliable. As a result, some post-processing may be required in order to achieve a desired result.
In addition, most of these approaches rely heavily on online training, where a pre-trained deep network is fine-tuned on the test video. While online training improves segmentation accuracy by letting the network adapt to the target object appearance, it is computationally expensive and time consuming (e.g. it may require several minutes of GPU-powered training for each test video), thus limiting its practical use.
Furthermore, the available annotated video datasets for training a deep neural network for video object segmentation are very limited. Thus, it is challenging to train the deep neural network with the limited available training samples.