This invention relates generally to processing multimedia content, and more particularly, to representing and comparing multimedia content.
There exist many standards for encoding and decoding multimedia content. The content can include audio signals in one dimension, images with two dimensions in space, video sequences with a third dimension in time, text, or combinations thereof. Numerous standards exist for audio and text.
For images, the best known standard is JPEG, and for video sequences, the most widely used standards include MPEG-1, MPEG-2 and H.263. These standards are relatively low-level specifications that primarily deal with the spatial compression in the case of images, and spatial and temporal compression for video sequences. As a common feature, these standards perform compression on a frame basis. With these standards, one can achieve high compression ratios for a wide range of applications.
Newer video coding standards, such as MPEG-4, see xe2x80x9cInformation Technologyxe2x80x94Generic coding of audio/visual objects,xe2x80x9d ISO/IEC FDIS 14496-2 (MPEG4 Visual), November 1998, allow arbitrary-shaped objects to be encoded and decoded as separate video object planes (VOP). This emerging standard is intended to enable multimedia applications, such as interactive video, where natural and synthetic materials are integrated, and where access is universal. For example, one might want to xe2x80x9ccut-and-pastexe2x80x9d a moving figure or object from one video to another. In this type of scenario, it is assumed that the objects in the multimedia content have been identified through some type of segmentation algorithm, see for example, U.S. patent application Ser. No. 09/326,750 xe2x80x9cMethod for Ordering Image Spaces to Search for Object Surfacesxe2x80x9d filed on Jun. 4, 1999 by Lin et al.
The most recent standardization effort taken on by the MPEG committee is that of MPEG-7, formally called xe2x80x9cMultimedia Content Description Interface,xe2x80x9d see xe2x80x9cMPEG-7 Context, Objectives and Technical Roadmap,xe2x80x9d ISO/IEC N2729, March 1999. Essentially, this standard plans to incorporate a set of descriptors and description schemes that can be used to describe various types of multimedia content. The descriptor and description schemes are associated with the content itself and allow for fast and efficient searching of material that is of interest to a particular user. It is important to note that this standard is not meant to replace previous coding standards, rather, it builds on other standard representations, especially MPEG-4, because the multimedia content can be decomposed into different objects and each object can be assigned a unique set of descriptors. Also, the standard is independent of the format in which the content is stored. MPEG-7 descriptors can be attached to compressed or uncompressed data.
Descriptors for multimedia content can be used in a number of ways, see for example xe2x80x9cMPEG-7 Applications,xe2x80x9d ISO/IEC N2728, March 1999. Most interesting, for the purpose of the description below, are database search and retrieval applications. In a simple application environment, a user may specify some attributes of a particular object. At this low-level of representation, these attributes may include descriptors that describe the texture, motion and shape of the particular object. A method of representing and comparing shapes has been described in U.S. patent application Ser. No. 09/326,759 xe2x80x9cMethod for Ordering Image Spaces to Represent Object Shapesxe2x80x9d filed on Jun. 4, 1999 by Lin et al. One of the drawbacks of this type of descriptor is that it is not straightforward to effectively combine this feature of the object with other low-level features. Another problem with such low-level descriptors, in general, is that a high-level interpretation of the object or multimedia content is difficult to obtain. Hence, there is a limitation in the level of representation.
To overcome the drawbacks mentioned above and obtain a higher-level of representation, one may consider more elaborate description schemes that combine several low-level descriptors. In fact, these description schemes may even contain other description schemes, see xe2x80x9cMPEG-7 Description Schemes (V0.5),xe2x80x9d ISO/IEC N2844, July 1999.
As shown in FIG. 1a, a generic description scheme (DS) has been proposed to represent multimedia content. This generic audio-visual DS 100 includes a separate syntactic DS 101, and a separate semantic DS 102. The semantic structure refers to the physical and logical signal aspects of the content, while the semantic structure refers to the conceptual meaning of the content. For a video sequence, the syntactic elements may be related to the color, shape and motion of a particular object. On the other hand, the semantic elements may refer to information that cannot be extracted from low-level descriptors, such as the time and place of an event or the name of a person in the multimedia content. In addition to the separate syntactic and semantic DSs, a syntactic-semantic relation graph DS 103 has been proposed to link the syntactic and semantic DSs.
The major problem with such a scheme is that the relations and attributes specified by the syntactic and semantic DS are independent, and it is the burden of the relation graph DS to create a coherent and meaningful interpretation of the multimedia content. Furthermore, the DSs mentioned above are either tree-based or graph-based. Tree-based representations provide an efficient means of searching and comparing, but are limited in their expressive ability; the independent syntactic and semantic DSs are tree-based. In contrast, graph-based representations provide a great deal of expressive ability, but are notoriously complex and prone to error for search and comparison.
For the task at hand, it is crucial that a representation scheme is not limited to how multimedia content is interpreted. The scheme should also provide an efficient means of comparison. From a human perspective, it is possible to interpret multimedia content in many ways; therefore, it is essential that any representation scheme allows multiple interpretations of the multimedia content. Although the independent syntactic and semantic DS, in conjunction with the relation graph DS, may allow multiple interpretations of multimedia content, it would not be efficient to perform comparisons.
As stated above, it is possible for a DS to contain other DSs. In the same way that the generic DS includes a syntactic DS, a semantic DS, and a syntactic/semantic relation graph DS. It has been proposed that the syntactic DS 101 includes a segment DS 105, a region DS 106, and a segment/region relation graph DS 107. As shown in FIG. 1b, the segment and region DSs may be used to define the temporal and spatial tree structure of multimedia content, respectively, and the segment/region relation graph DS may be used to describe the spatio-temporal relationships between segments and regions. Similarly, as shown in FIG. 1c, the semantic DS 102 includes an event DS 108, an object DS 109, and an event/object relation graph DS 110. The event and object DSs may be used to define event and object trees that define semantic index tables for temporal events and spatial objects, respectively. The event/object relation graph DS may be used to describe any type of spatio-temporal relationship between events and objects. As with the higher level DSs, namely the semantic and syntactic DSs, these lower-level DSs suffer the same problems with expressiveness and computational complexity.
Therefore, there is a need for representing syntactic and semantic attributes of multimedia content that balances the complexities of data structures and the methods that operate on the structures.
The present invention provides a new method of representing syntactic and semantic attributes of multimedia content. It is an object of the invention to use existing attributes that may be contained within a semantic or syntactic description scheme using a framework that balances the restrictions on structure and expressiveness of elements, with the computational complexity of operations on those elements.
The method according to the invention is based in part on directed acyclic graphs (DAG). It is well known that the DAG occupies a middle ground between tree-based and graph-based representations. In addition, the DAG provides a new functionality of composition. In other words, many structural compositions of an entity can be described by many structural compositions of its contained elements.
Most importantly though, the similarity between these structural compositions and the structural compositions created by another entity can be easily computed. Within this composition framework, the DAG also provides a means of combining syntactic and semantic elements so that similarity comparisons may seamlessly switch between both types of descriptions. In some sense, this can be viewed as a unification between the syntactic and semantic parts of the description scheme.
The method for representing the semantic and syntactic elements in a unified way also provides a means for unifying the spatial and temporal elements of multimedia content. The invention relies on the fact that the compositions referred to earlier are spatio-temporal compositions that contain both syntactic and semantic elements. The important points to keep in mind is that the compositions according to the invention are DAG representations, which facilitate multiple interpretations and low complexity comparison, and that the compositions, which define spatio-temporal attributes, both syntactic and semantic, are contained within the respective content entities.
More particularly, the method generates a representation of multimedia content by first segmenting the multimedia content spatially and temporally to extract objects. Feature extraction is applied to the objects to produce semantic and syntactic attributes, relations, and a containment set of content entities. The content entities are coded to produce directed acyclic graphs of the content entities. Nodes of the directed acyclic graphs represent the content entities, and edges represent breaks in the segmentation. Each directed acyclic graph represents a particular interpretation of the multimedia content.
In one aspect the multimedia content is a two dimensional image, and in another aspect the multimedia content is a three dimensional video sequence.
In a further aspect of the invention, representations for different multimedia contents are compared based on similarity scores obtained for the directed acyclic graphs.