Many practical applications rely on the availability of semantic information about the content of the media, such as images, videos, etc. Semantic information is represented by metadata which may express the type of scene, the occurrence of a specific action/activity, the presence of a specific object, etc. Such semantic information can be obtained by analyzing the media.
Semantically segmenting object from video remains an open challenge with recent advances relying upon prior knowledge supplied via interactive initialization or correction. Yet fully automatic semantic video object segmentation remains useful in scenarios where the human in the loop is impractical, such as video recognition or summarization or 3D modelling.
Semantic video object segmentation, which aims to recognize and segment objects in video according to known semantic labels, has recently made much progress by incorporating middle- and high-level visual information, such as object detection, which enables building an explicit semantic notion of video objects. However, these approaches typically fail to capture long-range and high-level contexts and may therefore introduce significant errors due to changing object appearance and occlusions.