Rapidly evolving technologies for acquiring and sharing video data make video analysis an increasingly relevant problem. Segmentation of a video into spatio-temporally consistent regions is a core concern of early vision, with many applications like summarization, compression and scene understanding. However, it remains a significant challenge. This is partly due to the difficulty of tractably scaling image segmentation approaches to more complex video data, where several recent works have made important progress. However, another important aspect, namely development of better features specifically designed for video segmentation and their combination in a principled framework, is not well-addressed yet.
Temporal coherence is the key distinction between videos and static images. Conceptually, motion field between images is the physical manifestation of temporal coherence. Optical flow is an efficient approximation to the motion field. Not only does optical flow establish a temporal connection between voxels, but also motion change is an important indicator of a segmentation boundary. Consequently, many video segmentation methods employ optical flow as a key cue that captures motion information. The graph-based hierarchical (GBH) segmentation method which performs the best among current methods uses histogram features of color and optical flow.
Video segmentation inherently involves combination of different feature channels—the two most evident ones being based on appearance and motion. An effective distance metric between regions combines multiple cues in a way that boosts the segmentation performance over that achievable by individual cues. Clearly, this distance metric has an important effect on segmentation quality and the importance increases for greater number of feature channels. The framework uses a straightforward multiplicative combination of individual distances with good results.
As undersegmentation error is biased to treat small and large segments differently, the system corrects for this by proposing a normalized undersegmentation error. Our features and their combinations are evaluated over the various metrics, on several different datasets including the large-scale scene data. In each case, we observe that our learned feature combinations that include trajectory cues achieve better segmentation quality than existing systems.
A popular approach to superpixel segmentation of images initially puts each node (pixel) in its own region, with an edge between neighboring regions encoding their dissimilarity. For a region R, its internal variation Int(R) is defined as the heaviest edge weight of its minimum spanning tree. The edges are traversed in non-decreasing order. Regions Ri and Rj linked by an edge of weight wij are merged if there is no evidence of a boundary. A boundary is deemed present if
                                          w            ij                    ≥                      min            ⁢                          {                                                                    Int                    ⁡                                          (                                              R                        i                                            )                                                        +                                      k                                                                                        R                        i                                                                                                                  ,                                                      Int                    ⁡                                          (                                              R                        j                                            )                                                        +                                      k                                                                                        R                        j                                                                                                                              }                                      ,                            (        1        )            where |R| denotes size of region R and k is a parameter that roughly controls the segment size. Sorting makes the overall complexity O(m log n) for a graph with m edges and n nodes and the subsequent segmentation is nearly O(m).
The graph-based paradigm is extended to segment videos in a graph-based hierarchical (GBH) framework. At the lowest level, a graph is constructed where each voxel represents a vertex. Iteratively, the graph at a level is partitioned and the resulting regions are used as vertices to construct the graph at the next higher level (called region graphs). The size parameter k is scaled by a constant factor s>1 for each level higher in the hierarchy.
At the lowest level, absolute color (in RGB space) difference is used to model the dissimilarity between voxels. For higher levels, histogram-based features encode dissimilarities between regions:
Color Histogram:
This feature captures appearance information. It is defined as the χ2-distance between color histograms (in Lab color space) of two regions. Regions often appear across multiple frames in the video and the color histograms are computed using voxels in all the frames where a region appears.
Histogram of Optical Flows:
This feature captures motion information. Optical flows are only consistent within the same frame, so a χ2-distance between flow histograms within the same frame is computed. If two regions appear in N frames, their distance is defined as the average of the χ2-distances in the N frames. itemize While flow histograms capture some motion information, longer range trajectories can provide a stronger cue. However, unlike color and flow, trajectories are not per-pixel entities, so it is not immediately clear how they can be encoded into histogram-based features consistent with the above features. The first contribution of this paper is to do so in a probabilistically meaningful and efficient manner.
To assign a single distance metric between regions, the GBH framework uses an intuitive combination:d=(1−(1−dc)(1−df))2  (2)where dc,df are the above-mentioned distances based on color and flow histograms. This combination has some desirable properties, for instance, d is normalized within [0,1] and its value is high unless two regions are similar with respect to both the cues. However, two important drawbacks are that this combination is not probabilistically meaningful and does not reflect the relative importance of each cue.