A semantic video object represents a meaningful entity in a digital video clip, e.g., a ball, car, plane, building, cell, eye, lip, hand, head, body, etc. The term “semantic” in this context means that the viewer of the video clip attaches some semantic meaning to the object. For example, each of the objects listed above represent some real-world entity, and the viewer associates the portions of the screen corresponding to these entities with the meaningful objects that they depict. Semantic video objects can be very useful in a variety of new digital video applications including content-based video communication, multimedia signal processing, digital video libraries, digital movie studios, and computer vision and pattern recognition. In order to use semantic video objects in these applications, object segmentation and tracking methods are needed to identify the objects in each of the video frames.
The process of segmenting a video object refers generally to automated or semi-automated methods for extracting objects of interest in image data. Extracting a semantic video object from a video clip has remained a challenging task for many years. In a typical video clip, the semantic objects may include disconnected components, different colors, and multiple rigid/non-rigid motions. While semantic objects are easy for viewers to discern, the wide variety of shapes, colors and motion of semantic objects make it difficult to automate this process on a computer. Satisfactory results can be achieved by having the user draw an initial outline of a semantic object in an initial frame, and then use the outline to compute pixels that are part of the object in that frame. In each successive frame, motion estimation can be used to predict the initial boundary of an object based on the segmented object from the previous frame. This semi-automatic object segmentation and tracking method is described in co-pending U.S. patent application Ser. No. 09/054,280, by Chuang Gu, and Ming Chieh Lee, entitled Semantic Video Object Segmentation and Tracking, which is hereby incorporated by reference.
Object tracking is the process of computing an object's position as it moves from frame to frame. In order to deal with more general semantic video objects, the object tracking method must be able to deal with objects that contain disconnected components and multiple non-rigid motions. While a great deal of research has focused on object tracking, existing methods still do not accurately track objects having multiple components with non-rigid motion.
Some tracking techniques use homogeneous gray scale/color as a criterion to track regions. See F. Meyer and P. Bouthemy, “Region-based tracking in an image sequence”, ECCV'92, pp. 476–484, Santa Margherita, Italy, May 1992; Ph. Salembier, L. Torres, F. Meyer and C. Gu, “Region-based video coding using mathematical morphology”, Proceeding of the IEEE, Vol. 83, No. 6, pp. 843–857, June 1995; F. Marques and Cristina Molina, “Object tracking for content-based functionalities”, VCIP'97, Vol. 3024, No. 1, pp. 190–199, San Jose, February, 1997; and C. Toklu, A. Tekalp and A. Erdem, “Simultaneous alpha map generation and 2-D mesh tracking for multimedia applications”, ICIP'97, Vol. 1, page 113–116, October, 1997, Santa Barbara.
Some employ homogenous motion information to track moving objects. See for example, J. Wang and E. Adelson, “Representing moving images with layers”, IEEE Trans. on Image Processing, Vol. 3, No. 5. pp. 625–638, September 1994 and N. Brady and N. O'Connor, “Object detection and tracking using an em-based motion estimation and segmentation framework”, ICIP'96, Vol. 1, pp. 925–928, Lausanne, Switzerland, September 1996.
Others use a combination of spatial and temporal criteria to track objects. See M. J. Black, “Combining intensity and motion for incremental segmentation and tracking over long image sequences”, ECCV'92, pp. 485–493, Santa Margherita, Italy, May 1992; C. Gu, T. Ebrahimi and M. Kunt, “Morphological moving object segmentation and tracking for content-based video coding”, Multimedia communication and Video Coding, pp. 233–240, Plenum Press, New York, 1995; F. Moscheni, F. Dufaux and M. Kunt, “Object tracking based on temporal and spatial information”, in Proc. ICASSP'96, Vol. 4, pp. 1914–1917, Atlanta, Ga., May 1996; and C. Gu and M. C. Lee, “Semantic video object segmentation and tracking using mathematical morphology and perspective motion model”, ICIP'97, Vol. II, pages 514–517, October 1997, Santa Barbara.
Most of these techniques employ a forward tracking mechanism that projects the previous regions/objects to the current frame and somehow assembles/adjusts the projected regions/objects in the current frame. The major drawback of these forward techniques lies in the difficulty of either assembling/adjusting the projected regions in the current frame or dealing with multiple non-rigid motions. In many of these cases, uncertain holes may appear or the resulting boundaries may become distorted.
FIGS. 1A–C provide simple examples of semantic video objects to show the difficulties associated with object tracking. FIG. 1A shows a semantic video object of a building 100 containing multiple colors 102, 104. Methods that assume that objects have a homogenous color do not track these types of objects well. FIG. 1B shows the same building object of FIG. 1A, except that it is split into disconnected components 106, 108 by a tree 110 that partially occludes it. Methods that assume that objects are formed of connected groups of pixels do not track these types of disconnected objects well. Finally, FIG. 1C illustrates a simple semantic video object depicting a person 112. Even this simple object has multiple components 114, 116, 118, 120 with different motion. Methods that assume an object has homogenous motion do not track these types of objects well. In general, a semantic video object may have disconnected components, multiple colors, multiple motions, and arbitrary shapes.
In addition to dealing with all of these attributes of general semantic video objects, a tracking method must also achieve an acceptable level of accuracy to avoid propagating errors from frame to frame. Since object tracking methods typically partition each frame based on a previous frame's partition, errors in the previous frame tend to get propagated to the next frame. Unless the tracking method computes an object's boundary with pixel-wise accuracy, it will likely propagate significant errors to the next frame. As result, the object boundaries computed for each frame are not precise, and the objects can be lost after several frames of tracking.