The invention relates to analysis of video data, and more, specifically relates to a method for tracking meaningful entities called semantic objects as they move through a sequence of vector images such as a video sequence.
A semantic video object represents a meaningful entity in a digital video clip, e.g., a ball, car, plane, building, cell, eye, lip, hand, head, body, etc. The term xe2x80x9csemanticxe2x80x9d in this context means that the viewer of the video clip attaches some semantic meaning to the object. For example, each of the objects listed above represent some real-world entity, and the viewer associates the portions of the screen corresponding to these entities with the meaningful objects that they depict. Semantic video objects can be very useful in a variety of new digital video applications including content-based video communication, multimedia signal processing, digital video libraries, digital movie studios, and computer vision and pattern recognition. In order to use semantic video objects in these applications, object segmentation and tracking methods are needed to identify the objects in each of the video frames.
The process of segmenting a video object refers generally to automated or semi-automated methods for extracting objects of interest in image data. Extracting a semantic video object from a video clip has remained a challenging task for many years. In a typical video clip, the semantic objects may include disconnected components, different colors, and multiple rigid/non-rigid motions. While semantic objects are easy for viewers to discern, the wide variety of shapes, colors and motion of semantic objects make it difficult to automate this process on a computer. Satisfactory results can be achieved by having the user draw an initial outline of a semantic object in an initial frame, and then use the outline to compute pixels that are part of the object in that frame. In each successive frame, motion estimation can be used to predict the initial boundary of an object based on the segmented object from the previous frame. This semi-automatic object segmentation and tracking method is described in co-pending U.S. patent application Ser. No. 09/054,280, by Chuang Gu, and Ming Chieh Lee, entitled Semantic Video Object Segmentation and Tracking, which is hereby incorporated by reference.
Object tracking is the process of computing an object""s position as it moves from frame to frame. In order to deal with more general semantic video objects, the object tracking method must be able to deal with objects that contain disconnected components and multiple non-rigid motions. While a great deal of research has focused on object tracking, existing methods still do not accurately track objects having multiple components with non-rigid motion.
Some tracking techniques use homogeneous gray scale/color as a criterion to track regions. See F. Meyer and P. Bouthemy, xe2x80x9cRegion-based tracking in an image sequencexe2x80x9d, ECCV""92, pp. 476-484, Santa Margherita, Italy, May 1992; Ph. Salembier, L. Torres, F. Meyer and C. Gu, xe2x80x9cRegion-based video coding using mathematical morphologyxe2x80x9d, Proceeding of the IEEE, Vol. 83, No. 6, pp. 843-857, June 1995; F. Marques and Cristina Molina, xe2x80x9cObject tracking for content-based functionalitiesxe2x80x9d, VCIP""97, Vol. 3024, No. 1, pp. 190-199, San Jose, February, 1997; and C. Toklu, A. Tekalp and A. Erdem, xe2x80x9cSimultaneous alpha map generation and 2-D mesh tracking for multimedia applicationsxe2x80x9d, ICIP""97, Vol. 1, page 113-116, October, 1997, Santa Barbara.
Some employ homogenous motion information to track moving objects. See for example, J. Wang and E. Adelson, xe2x80x9cRepresenting moving images with layersxe2x80x9d, IEEE Trans. on Image Processing, Vol. 3, No. 5. pp. 625-638, September 1994 and N. Brady and N. O""Connor, xe2x80x9cObject detection and tracking using an em-based motion estimation and segmentation frameworkxe2x80x9d, ICIP""96, Vol. I, pp. 925-928, Lausanne, Switzerland, September 1996.
Others use a combination of spatial and temporal criteria to track objects. See M. J. Black, xe2x80x9cCombining intensity and motion for incremental segmentation and tracking over long image sequencesxe2x80x9d, ECCV""92, pp. 485-493, Santa Margherita, Italy, May 1992; C. Gu, T. Ebrahimi and M. Kunt, xe2x80x9cMorphological moving object segmentation and tracking for content-based video codingxe2x80x9d, Multimedia Communication and Video Coding, pp. 233-240, Plenum Press, New York, 1995; F. Moscheni, F. Dufaux and M. Kunt, xe2x80x9cObject tracking based on temporal and spatial informationxe2x80x9d, in Proc. ICASSP""96, Vol. 4, pp. 1914-1917, Atlanta, Ga., May 1996; and C. Gu and M. C. Lee, xe2x80x9cSemantic video object segmentation and tracking using mathematical morphology and perspective motion modelxe2x80x9d, ICIP""97, Vol. II, pages 514-517, October 1997, Santa Barbara.
Most of these techniques employ a forward tracking mechanism that projects the previous regions/objects to the current frame and somehow assembles/adjusts the projected regions/objects in the current frame. The major drawback of these forward techniques lies in the difficulty of either assembling/adjusting the projected regions in the current frame or dealing with multiple non-rigid motions. In many of these cases, uncertain holes may appear or the resulting boundaries may become distorted.
FIGS. 1A-C provide simple examples of semantic video objects to show the difficulties associated with object tracking. FIG. 1A shows a semantic video object of a building 100 containing multiple colors 102, 104. Methods that assume that objects have a homogenous color do not track these types of objects well. FIG. 1B shows the same building object of FIG. 1A, except that it is split into disconnected components 106, 108 by a tree 110 that partially occludes it. Methods that assume that objects are formed of connected groups of pixels do not track these types of disconnected objects well. Finally, FIG. 1C illustrates a simple semantic video object depicting a person 112. Even this simple object has multiple components 114, 116, 118, 120 with different motion. Methods that assume an object has homogenous motion do not track these types of objects well. In general, a semantic video object may have disconnected components, multiple colors, multiple motions, and arbitrary shapes.
In addition to dealing with all of these attributes of general semantic video objects, a tracking method must also achieve an acceptable level of accuracy to avoid propagating errors from frame to frame. Since object tracking methods typically partition each frame based on a previous frame""s partition, errors in the previous frame tend to get propagated to the next frame. Unless the tracking method computes an object""s boundary with pixel-wise accuracy, it will likely propagate significant errors to the next frame. As result, the object boundaries computed for each frame are not precise, and the objects can be lost after several frames of tracking.
The invention provides a method for tracking semantic objects in vector image sequences. The invention is particularly well suited for tracking semantic video objects in digital video clips, but can also be used for a variety of other vector image sequences. While the method is implemented in software program modules, it can also be implemented in digital hardware logic or in a combination of hardware and software components.
The method tracks semantic objects in an image sequence by segmenting regions from a frame and then projecting the segmented regions into a target frame where a semantic object boundary or boundaries are already known. The projected regions are classified as forming part of a semantic object by determining the extent to which they overlap with a semantic object in the target frame. For example, in a typical application, the tracking method repeats for each frame, classifying regions by projecting them into the previous frame in which the semantic object boundaries are previously computed.
The tracking method assumes that semantic objects are already identified in the initial frame. To get the initial boundaries of a semantic object, a semantic object segmentation method may be used to identify the boundary of the semantic object in an initial frame.
After the initial frame, the tracking method operates on the segmentation results of the previous frame and the current and previous image frames. For each frame in a sequence, a region extractor segments homogenous regions from the frame. A motion estimator then performs region based matching for each of these regions to identify the most closely matching region of image values in the previous frame. Using the motion parameters derived in this step, the segmented regions are projected into the previous frame where the semantic boundary is already computed. A region classifier then classifies the regions as being part of semantic object in the current frame based on the extent to which the projected regions overlap semantic objects in the previous frame.
The above approach is particularly suited for operating on an ordered sequence of frames. In these types of applications, the segmentation results of the previous frame are used to classify the regions extracted from the next frame. However, it can also be used to track semantic objects between an input frame and any other target frame where the semantic object boundaries are known.
One implementation of the method employs a unique spatial segmentation method. In particular, this spatial segmentation method is a region growing process where image points are added to the region as long as the difference between the minimum and maximum image values for points in the region are below a threshold. This method is implemented as a sequential segmentation method that starts with a first region at one starting point, and sequentially forms regions one after the other using the same test to identify homogenous groups of image points.
Implementations of the method include other features to improve the accuracy of the tracking method. For example, the tracking method preferably includes region-based preprocessing to remove image errors without blurring object boundaries, and post-processing on the computed semantic object boundaries. The computed boundary of an object is formed from the individual regions that are classified as being associated with the same semantic object in the target frame. In one implementation, a post processor smooths the boundary of a semantic object using a majority operator filter. This filter examines neighboring image points for each point in a frame and determines the semantic object that contains the maximum number of these points. It then assigns the point to the semantic object containing the maximum number of points.