The invention relates to semantic video object extraction and tracking.
A video sequence is composed of a series of video frames, where each frame records objects at discrete moments in time. In a digital video sequence, each frame is represented by an array of pixels. When a person views a video frame, it is easy to recognize objects in the video frame, because the person can identify a portion of the video frame as being meaningful to the user. This is called attaching semantic meaning to that portion of the video frame. For example, a ball, an aircraft, a building, a cell, a human body, etc., all represent some meaningful entities in the world. Semantic meaning is defined with respect to the user""s context. Although vision seems simple to people, a computer does not know that a certain collection of pixels within a frame depicts a person. To the computer, it is only a collection of pixels. However, a user can identify a part of a video frame based upon some semantic criteria (such as by applying an is a person criteria), and thus assign semantic meaning to that part of the frame; such identified data is typically referred to as a semantic video object.
An advantage to breaking video stream frames into one or more semantic objects (segmenting, or content based encoding) is that in addition to compression efficiency inherent to coding only active objects, received data may also be more accurately reconstructed because knowledge of the object characteristics allows better prediction of its appearance in any given frame. Such object tracking and extraction can be very useful in many fields. For example, in broadcasting and telecommunication, video compression is important due to a large bandwidth requirement for transmitting video data. For example, in a newscast monologue with a speaker in front of a fairly static background, bandwidth requirements may be reduced if one identifies (segments) a speaker within a video frame, removes (extracts) the speaker off the background, and then skips transmitting the background unless it changes.
Using semantic video objects to improve coding efficiency and reduce storage and transmission bandwidth has been investigated in the up-coming international video coding standard MPEG4. (See ISO/IEC JTC1/SC29/WG11. MPEG4 Video Verification Model Version 8.0, July. 1997; Lee, et al., A layered video object coding system using sprite and affine motion model, IEEE Tran. on Circuits and System for Video Technology, Vol. 7, No. 1, January 1997.) In the computer domain, web technology has new opportunities involving searching and interacting with meaningful video objects in a still or dynamic scene. To do so, extraction of semantic video objects is very important. In the pattern recognition domain, accurate and robust semantic visual information extraction aids medical imaging, industrial robotics, remote sensing, and military applications. (See Marr, Vision, W. H. Freeman, New York, 1982 (hereafter Marr).)
But, although useful, general semantic visual information extraction is difficult. Although human eyes see data that is easily interpreted by our brains as semantic video objects, such identification is a fundamental problem for image analysis. This problem is termed a segmentation problem, where the goal is to aid a computer in distinguishing between different objects within a video frame. Objects are separated from each other using some homogeneous criteria. Homogeneity refers to grouping data according to some similar characteristic. Different definitions for homogeneity can lead to different segmentation results for the same input data. For example, homogeneous segmentation may be based on a combination of motion and texture analysis. The criteria chosen for semantic video object extraction will determine the effectiveness of the segmentation process.
During the past two decades, researchers have investigated unsupervised segmentation. Some researches proposed using homogeneous grayscale/or homogenous color as a criterion for identifying regions. Others suggest using homogenous motion information to identify moving objects. (See Haralick and Shapiro, Image segmentation techniques, CVGIP, Vol. 29, pp. 100-132, 1985; C. Gu, Multi-valued morphology and segmentation-based coding, Ph.D. dissertation, LTS/EPFL, (hereafter Gu Ph.D.), http://Itswww.epfl.-ch/Staff/gu.html, 1995.)
This research in grayscale-oriented analysis can be classified into single-level methods and multi-level approaches. Single-level methods generally use edge-based detection methods, k-nearest neighbor, or estimation algorithms. (See Canny, A computational approach to edge detection, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 8, pp. 679-698, 1986; Cover and Hart, Nearest neighbor pattem classification, IEEE Trans. Information Theory, Vol. 13, pp. 21-27, 1967; Chen and Pavlidis, Image segmentation as an estimation problem, Computer Graphics and Image Processing, Vol. 13, pp. 153-172, 1980).
Unfortunately, although these techniques work well when the input data is relatively simple, clean, and fits the model well, they lack generality and robustness. To overcome these limitations, researchers focused on multi-level methods such as split and merge, pyramid linking, and morphological methods. (See Burt, et al., Segmentation and estimation of image region properties through cooperative hierarchical computation, IEEE Trans. On System, Man and Cybernetics, Vol. 11, pp. 802-809, 1981).
These technologies provide better performance than the prior single-level methods, but results are inadequate because these methods do not properly handle video objects that contain completely different grayscales/colors. An additional drawback to these approaches is that research in the motion oriented segmentation domain assumes that a semantic object has homogeneous motion.
Well known attempts have been made to deal with these problems. These include Hough transformation, multi-resolution region-growing, and relaxation clustering. But, each of these methods is based on optical flow estimation. This estimation technique is known to frequently produce inaccurately determined motion boundaries. In addition, these methods are not suitable to semantic video object extraction because they only employ homogeneous motion information while a semantic video object can have complex motions inside the object (e.g. rigid-body motion).
In an attempt to overcome these limitations, subsequent research focused on object tracking. This is a class of methods related to semantic video object extraction, and which is premised on estimating an object""s current dynamic state based on a previous one, where the trajectory of dynamic states are temporally linked. Different features of an image have been used for tracking frame to frame changes, e.g., tracking points, intensity edges, and textures. But these features do not include semantic information about the object being tracked; simply tracking control points or features ignores important information about the nature of the object that can be used to facilitate encoding and decoding compression data. Notwithstanding significant research in video compression, little of this research considers semantic video object tracking.
Recently, some effort has been invested in semantic video object extraction problem with tracking. (See Gu Ph.D.; C. Gu, T. Ebrahimi and M. Kunt, Morphological moving object segmentation and tracking for content-based video coding, International Symposium on Multimedia Communication and Video Coding, New York, 1995, Plenum Press.) This research primarily attempts to segment a dynamic image sequence into regions with homogeneous motions that correspond to real moving objects. A joint spatio-temporal method for representing spatial and temporal relationships between objects in a video sequence was developed using a morphological motion tracking approach. However, this method relies on the estimated optical flow, which, as noted above, generally is not sufficiently accurate. In addition, since different parts of a semantic video object can have both moving and non-moving elements, results can be further imprecise.
Thus, methods for extracting semantic visual information based on homogeneous color or motion criteria are unsatisfactory, because each homogeneous criterion only deals with a limited set of input configurations, and cannot handle a general semantic video object having multiple colors and multiple motions. Processing such a restricted set of input configurations results in partial solutions for semantic visual information extraction.
One approach to overcome limited input configurations has been to detect shapes through user selected points using an energy formulation. However, a problem with this approach is that positioning the points is an imprecise process. This results in imprecise identification of an image feature (an object within the video frame) of interest.
The invention allows automatic tracking of an object through a video sequence. Initially a user is allowed to roughly identify an outline of the object in a first key frame. This rough outline is then automatically refined to locate the object""s actual outline. Motion estimation techniques, such as global and local motion estimation, are used to track the movement of the object through the video sequence. The motion estimation is also applied to the refined boundary to generate a new rough outline in the next video frame, which is then refined for the next video frame. This automatic outline identification and refinement is repeated for subsequent frames.
Preferably, the user is presented with a graphical user interface showing a frame of video data, and the user identifies, with a mouse, pen, tablet, etc., the rough outline of an object by selecting points around the perimeter of the object. Curve-fitting algorithms can be applied to fill in any gaps in the user-selected points. After this initial segmentation of the object, the unsupervised tracking is performed. During unsupervised tracking, the motion of the object is identified from frame to frame. The system automatically locates similar semantic video objects in the remaining frames of the video sequence, and the identified object boundary is adjusted based on the motion transforms.
Mathematical morphology and global perspective motion estimation/compensation (or an equivalent object tracking system) is used to accomplish these unsupervised steps. Using a set-theoretical methodology for image analysis (i.e. providing a mathematical framework to define image abstraction), mathematical morphology can estimate many features of the geometrical structure in the video data, and aid image segmentation. Instead of simply segmenting an image into square pixel regions unrelated to frame content (i.e. not semantically based), objects are identified according to a semantic basis and their movement tracked throughout video frames. This object-based information is encoded into the video data stream, and on the receiving end, the object data is used to re-generate the original data, rather than just blindly reconstruct it from compressed pixel regions. Global motion estimation is used to provide a very complete motion description for scene change from frame to frame, and is employed to track object motion during unsupervised processing. However, other motion tracking methods, e.g. block-based, mesh-based, parametric estimation motion estimation, and the like, may also be used.
The invention also allows for irregularly shaped objects, while remaining compatible with current compression algorithms. Most video compression algorithms expect to receive a regular array of pixels. This does not correspond well with objects in the real world, as real-world objects are usually irregularly shaped. To allow processing of arbitrarily shaped objects by conventional compression schemes, a user identifies a semantically interesting portion of the video stream (i.e. the object), and this irregularly shaped object is converted into a regular array of pixels before being sent to a compression algorithm.
Thus, a computer can be programmed with software programming instructions for implementing a method of tracking rigid and non-rigid motion of an object across multiple video frames. The object has a perimeter, and initially a user identifies a first boundary approximating this perimeter in a first video frame. A global motion transformation is computed which encodes the movement of the object between the first video frame and a second video frame. The global motion transformation is applied to the first boundary to identify a second boundary approximating the perimeter of the object in the second video frame. By successive application of motion transformations, boundaries for the object can be automatically identified in successive frames.
Alternatively, after the user identifies an initial approximate boundary near the border/perimeter of the object, an inner boundary inside the approximate boundary is defined, and an outer boundary outside the approximate boundary is defined. The inner border is expanded and the outer boundary contracted so as to identify an outline corresponding to the actual border of the object roughly identified in the first frame. Preferably expansion and contraction of the boundaries utilizes a morphological watershed computation to classify the object and its actual border.
A motion transformation function representing the transformation between the object in the first frame and the object of the second frame, can be applied to the outline to warp it into a new approximate boundary for the object in the second frame. In subsequent video frames, inner and outer boundaries are defined for the automatically generated new approximate boundary, and then snapped to the object. Note that implementations can provide for setting an error threshold on boundary approximations (e.g. by a pixel-error analysis), allowing opportunity to re-identify the object""s boundary in subsequent frames.
The foregoing and other features and advantages will be more readily apparent from the following detailed description, which proceeds with reference to the accompanying drawings.