The ability to segment or separate foreground objects from background objects in video images is useful to a number of applications including video compression, human-computer interaction, and object tracking—to name a few. In order to generate such segmentation—in both a reliable and visually pleasing manner—the fusion of both spatial and temporal information is required. As can be appreciated, this fusion requires that large amounts of information be processed thereby imposing a heavy computational cost and/or requiring substantial manual interaction. This heavy computation cost unfortunately limits its applicability.
Video matting is a classic inverse problem in computer vision research that involves the extraction of foreground objects and alpha mattes which describe their opacity from image sequences. Chuang et al proposed a video matting method based upon Bayesian matting performed on each individual frame. (See, e.g., Y. Y. Chuang, A. Agarwala, B. Curless, D. H. Salesin and R. Szeliski, “Video Matting of Complex Scenes”, ACM SIGGRAPH 2002, pp. II:243-248, 2002, and Y. Y. Chuang, B. Curless, D. H. Salesin, and R Szeliski, “A Bayesian Approach To Digital Matting”, CVPR01, pp. II:264-271, 2001). Such methods require accurate user-labeled “trimaps” that segment each image into foreground, background, and unknown regions. Computationally, it is quite burdensome to periodically provide such trimap labels for long video sequences.
Apostolof and Fitzgibbon presented a matting approach for natural scenes assuming a camera capturing the scene is static and the background is known. (See., e.g., N. Apostoloff and A. W. Fitzgibbon, “Bayesian Video Matting Using Learnt Image Priors”, CVPR04, pp. I:407-414, 2004).
Li, et. al. used a 3D graph cut based segmentation followed by a tracking-based local refinement to obtain a binary segmentation of video objects, then adopt coherent matting as a prior to produce the alpha matte of the object. (See., e.g., J. Shum, J. Sun, S. Yamazaki, Y. Li and C. Tang, “Pop-Up Light Field: An Interactive Image-Based Modeling and Rendering System”, ACM Transaction of Graphics, 23(2):143-162, 2004). This method too suffers from high computational cost and possible need for user input to fine tune the results.
Motion based segmentation methods perform motion estimation and cluster pixels or color segments into regions of coherent motion. (See., e.g., R. Vidal and R. Hartley, “Motion Segmentation With Missing Data Using Powerfactorization and GPCA”, CVPR04, pp. II-310-316, 2004). Layered approaches represent multiple objects in a scene with a collection of layers (See, e.g., J. Xiao and M. Shah, “Motion Layer Extraction In the Presence Of Occlusion Using Graph Cuts”, CVPR04, pp. II:972-79, 2004; N. Jojic and B. J. Frey, “Learning Flexible Sprites in Video Layers”, CVPR01, pp. I:255-262, 2001; J. Y. A. Wang and E. H. Adelson, “Representing Moving Images With Layers”, IP, 3(5):625-638, September, 1994). Wang and Ji described a dynamic conditional random field model to combine both intensity and motion cues to achieve segmentation. (See., e.g., Y. Wang and Q. Ji, “A Dynamic Conditional Random Field Model For Object Segmentation In Image Sequences”, CVPR05, pp. I:264-270, 2005). Finally, Ke and Kanade described a factorization method to perform rigid layer segmentation in a subspace because all of the layers share the same camera motion. (See., e.g., Q. Ke and T. Kanade, “A Subspace Approach To Layer Extraction”, CVPR01, pp. I:255-262, 2001). Unfortunately, many of these methods assume that objects are rigid and/or the camera is not moving.