The invention relates generally to the field of video processing, and in particular to the characterization of object motion and the interaction between video objects in a video sequence.
Rapid proliferation of multimedia applications presents a growing need for new effective representations of video sequences that allow not only compact storage but also content-based functionalities such as object-oriented editing, navigation, search, and browsing. The huge size and rich content of video data makes organization, indexing and management of visual databases for efficient and effective browsing and retrieval an extremely challenging task. Most of the existing technologies in video indexing and retrieval are frame-based. In frame-based indexing, a shot is the basic unit for indexing and retrieval. As the term is used herein, a shot is ordinarily a set of consecutive frames captured by a single operation of a camera, representing a continuous action in time and space. (Accordingly, a story is a collection of shots, which are semantically related.) Global image features such as frame-based color histograms are extracted from frames within a shot to characterize a shot. Alternatively, each shot is characterized by a representative frame and global characteristic of the representative frame (see commonly assigned U.S. patent application Ser. No. 08/902,545, filed Jul. 29, 1997, entitled xe2x80x9cA method for content-based temporal segmentation of videoxe2x80x9d by James Warnick, et al).
The main shortcoming of this approach is that humans do not usually process video contents in terms of frames or shots. Therefore, a frame or shot-based video description and indexing approach is not in agreement with a human""s process for interpreting video data. Humans analyze a dynamic scene or video data in terms of the objects of interest. In other words, a scene or imagery data is processed by the human visual system to identify objects of interest and then the scene is characterized in terms of these objects, their spatial and temporal properties and interactions. In order to adopt this human visual system-based approach, an object-oriented video description and indexing approach is essentially required. From a digital image-video processing viewpoint, an object is defined as a meaningful spatial/temporal region of an image or a video sequence.
One approach to object-oriented video description and indexing is to first segment a video sequence into shots and then to represent each shot by a representative or key frame. The next step is to identify objects of interest present in each representative frame, and then to describe and index the video sequence in terms of the identified objects (see P. Alshuth, T. Hermes, L. Voight and O. Herzog, xe2x80x9cOn Video Retrieval: Content Analysis by Image Minerxe2x80x9d, SPIE: Storage and Retrieval for Image and Video Databases, vol. 3312, pp. 236-247, 1998). However, this approach treats a video sequence as a set of still images and thus completely ignores the time-variant or dynamic characteristics of the object. Examples of dynamic characteristics include object motion, variation in an object""s shape, and interactions of an object with other objects. An object-oriented video description and indexing system should be able to characterize video data in terms of the time-variant features of the objects. Unlike the description of still images which consists solely of spatial features such as color, texture, shape and spatial composition, temporal features such as object motion, variation of object shape, and interaction between multiple objects are key features for describing video content.
Some existing approaches have integrated motion into video description and indexing. Netra-V is an object/region-based video indexing system, which employs affine motion representation for each region (see Y. Deng, D. Mukherjee and B. S. Manjunath, xe2x80x9cNetra-V: Toward an Object-based Video Representationxe2x80x9d, SPIE: Storage and Retrieval for Image and Video Databases, vol. 3312, pp. 202-213, 1998). Motion is the key attribute in Video Q, in which a web interface allows users to specify an arbitrary polygonal trajectory for a query object, thereby allowing objects that have similar motion trajectories to be retrieved (see S.-F. Chang, W. Chen, H. J. Meng, H. Sundaram and D. Zhong, xe2x80x9cA Fully Automated Content-based Video Search Engine Supporting Spatiotemporal Queriesxe2x80x9d, IEEE Trans. Circuits and Systems for Video Tech., vol. 8, pp. 602-615, 1998 and commonly assigned U.S. patent application Ser. No. 09/059,817, filed Apr. 14, 1998 and entitled xe2x80x9cA computer program product for generating an index and summary of a videoxe2x80x9d by Bilge Gunsel, et al). These approaches are limited in the sense that temporal characterization of an object is simply in terms of its low-level motion characteristics. Other time variant features such as changes in object shape or object interactions have been completely ignored. Also, high-level or semantic temporal characteristics of the objects are not used for description or indexing of the video sequence.
What is needed is an object-oriented description of video contents in terms of both low-level and semantic level time varying characteristics of an object. Each object should be described by its spatial and temporal features, with object temporal actions, and interaction viewed as primary attributes of objects. Objects should be segmented and tracked within shots, and features related to object motion, actions, and interaction should be extracted and employed for content-based video retrieval and video summary and/or browsing.
An object of this invention is to provide an object-oriented description of video contents in terms of both low-level and semantic level time varying characteristics of an object.
Another object is to provide a procedure to develop the object-oriented description of video content.
The present invention is directed to overcoming one or more of the problems set forth above. Briefly summarized, according to one aspect of the present invention, an object-oriented method for describing the content of a video sequence comprises the steps of (a) establishing an object-based segment for an object of interest; (b) describing the object-based segment by describing one or more semantic motions of the object within its segments; and (c) describing the object-based segment by describing one or more semantic interactions of the object with one or more other objects within its object-based segment. The semantic motions of the object may be further described in terms of the properties of elementary coherent motions within the semantic motion. Additionally, the semantic interactions of the object may be further described in terms of the properties of the elementary spatio-temporal relationships among the interacting objects.