The identification of objects in video has different applications in Medical Imaging, Content Analysis, the Film Industry, transport and vehicle control. However, in order for objects to be identified, at least during training of the system, typically a human operator label them explicitly. Presently, even automatic object recognition algorithms require the production of hand-labelled visual data in order for them to be trained. The task of manual labelling objects is tedious, in particular for videos with a large number of individual images in a sequence of any length.
Label propagation is a very challenging problem because it requires tracking of object regions which lack “visual identity”. Adjacent video images in a sequence often have a large noise level making label propagation inherently unstable. Different problems related to labelling and performing segmentation have been discussed in the literature and solutions for facilitating these task have been proposed. One example is the use of an interactive approach whereby a distinct foreground object is precisely extracted from its background. In this approach, the user is closely involved in the refinement of the segmentation of the images. A related problem to label propagation is the colourisation problem. With a few coloured strokes on a greyscale video image, the user specifies how to colourize it in a realistic manner. Whereas the use of colourization is widely spread, converting the produced colours into a label map is not straightforward.
Some segmentation based techniques, used to identify regions to be labelled, more directly address the problem of label propagation. Such systems often rely on motion and colour information to propagate regions through the sequence. Comparison of such techniques is difficult in the absence of labelled groundtruth data.
Other local representations exist for segmenting and robust tracking, such as Maximally Stable External Region (MSER) tracking or Scale Invariant Feature Tracking (SIFT) tracking, however they are sparse features and are, as such, not sufficient to track entire regions.