With the rapid development and lower cost of smartphones and new digital capture devices, consumer videos are becoming ever popular as is evident by the large volume of YouTube video upload, as well as video viewing in the Facebook social network. These large amounts of videos also pose a challenge for organizing and retrieving videos for the consumers. From an information perspective, the video content and metadata could be associated with higher level semantic information such as objects, scenes, activities, events, locations, people, themes, etc. Therefore, segmenting and extracting content information are an important capability for indexing and organizing the large amount of video data. Object level segments in a video sequence are semantically meaningful spatiotemporal units such as moving persons, static tree with waving leaves, flowing river, etc. Different from two dimensional image segmentation, the segmented semantic key-segments have to maintain both visual and motion coherence. Furthermore, a consumer video may only target a single moving object of interest in a dynamic cluttered background that a binary segmentation would be more desirable regarding consumer needs.
In general, video segmentation techniques can be grouped into three categories: 1) Spatial-first segmentation; 2) Temporal-first segmentation; and 3) Joint spatiotemporal segmentation. The first method is the most intuitive approach, as it directly inherits the methods used for static image segmentation. Methods from this category first focus on frame by frame color/motion segmentation and are followed by region matching techniques between successive frames to maintain a certain degree of visual continuity. The second category may also be called ‘trajectory grouping’ video segmentation. It starts from tracking discrete feature points to the extraction of their trajectories from all frames. Then, those trajectories belonging to the same moving objects are spatially grouped using individual appearance features. Compared to the previous method which only considers short-term motion between every pair of frames, this category focuses on long-term motion consistency from multiple successive frames. In contrast to the previous two categories, the third category ‘Joint spatiotemporal segmentation’ processes all frames together as a spatiotemporal volume in which an object is a three dimensional tube in which all pixels in it have both location and feature coherence. It defines the grouping criterion in both spatial and temporal domains so as to avoid the spatial/temporal correspondence matching step for the methods in the first two categories.
Segmentation of a video sequence into a number of component regions would benefit many higher level vision based applications such as video content retrieval, summarization, and repurposing of video content. However, single target object extraction would be a more demanding task considering a consumer's needs. In many cases, a consumer video sequence simply targets at capturing a single object's movement in a specific environment such as dancing, skiing, running, etc. The multi-region video segmentation and target object extraction are two closely related and mutually beneficial tasks in video processing that can be improved when solved jointly by passing information from one to the other. In general, motion object detection and extraction for a static video camera is relatively straightforward since the background barely changes and a simple frame differencing would be able to extract a moving foreground object. For a more cluttered background with waving trees or pedestrians passing by, a pixelwise background model, such as a Gaussian model or a Bayesian model can learn from old frames and classify pixels as either background or target motion object accordingly. With the growing emergence of portable camera platforms, a larger percentage of video contents are produced by hand held cameras which are no longer strictly static. Research into relaxing this assumption includes camera motion compensation so that pixelwise model from the previous frame can be adjusted with a homography or a 2D affine transform to maintain its accuracy. However, those methods assume the background can be approximated as a plane or the camera motion only includes pan, tilt, or zoom.