I. Field of the Invention
The present invention relates to techniques for describing multimedia information, and more specifically, to techniques which describe both video and image information, or audio information, as well as to content of such information. The techniques disclosed are for content-sensitive indexing and classification of digital data signals (e.g., multimedia signals).
II. Description of the Related Art
With the maturation of the global Internet and the widespread employment of regional networks and local networks, digital multimedia information has become increasingly accessible to consumers and businesses. Accordingly, it has become progressively more important to develop systems that process, filter, search and organize digital multimedia information, so that useful information can be culled from this growing mass of raw information.
At the time of filing the instant application, solutions exist that allow consumers/and business to search for textual information. Indeed, numerous text-based search engines, such as those provided by yahoo.com, goto.com, excite.com and others are available on the World Wide Web, and are among the most visited Web sites, indicating the significant of the demand for such information retrieval technology.
Unfortunately, the same is not true for multimedia content, as no generally recognized description of this material exists.
The recent proliferation of digital images and video has brought new opportunities to end-users that now have a large amount of resources when searching for content. Visual information is widely available on diverse topics, from many different sources, and in many different formats. This is an advantage, but at the same time a challenge since users cannot review large quantities of data when searching such content. It is imperative, therefore, to allow users to efficiently browse content or perform queries based on their specific needs. In order to provide such functionalities in a digital library, however, it is essential to understand the data, and index it appropriately. This indexing must be structured and it must be based on how users will want to access such information.
In traditional approaches, textual annotations are used for indexing— a cataloguer manually assigns a set of key words or expressions to describe an image. Users can then perform text-based queries or browse through manually assigned categories. In contrast to text-based approaches, recent techniques in content-based retrieval have focused on indexing images based on their visual content. Users can perform queries by example (e.g., images that look like this one) or user-sketch (e.g., image that looks like this sketch). More recent efforts attempt automatic classification of images based on their content: a system classifies each image, and assigns it a label (e.g., indoor, outdoor, contains a face, etc.).
In both paradigms there are classification issues which are often overlooked, particularly in the content-based retrieval community. The main difficulty in appropriately indexing visual information can be summarized as follows: (I) there is a large amount of information present in a single image (e.g., what to index?), and (2) different levels of description are possible (e.g., how to index?). Consider, for example, a portrait of a man wearing a suit. It would be possible to label the image with the terms “suit” or “man”. The term “man”, in turn, could carry information at multiple levels: conceptual (e.g., definition of man in the dictionary), physical (size, weight) and visual (hair color, clothing), among others. A category label, then, implies explicit (e.g., the person in the image is a man, not a woman), and implicit or undefined information (e.g., from that term alone it is not possible to know what the man is wearing).
In this regard, there have been past attempts to provide multimedia databases which permit users to search for pictures using characteristics such as color, texture and shape information of video objects embedded in the picture. However, at the closing of the 20th Century, it is not yet possible to perform a general search the Internet or most regional or local networks for multimedia content, as no broadly recognized description of this material exists. Moreover, the need to search for multimedia content is not limited to databases, but extends to other applications, such as digital broadcast television and multimedia telephony.
One industry wide attempt to develop such standard a multimedia description framework has been through the Motion Pictures Expert Group's (“MPEG”) MPEG-7 standardization effort. Launched in October 1996, MPEG-7 aims to standardize content descriptions of multimedia data in order to facilitate content-focused applications like multimedia searching, filtering, browsing and summarization. A more complete description of the objectives of the MPEG-7 standard are contained in the International Organisation for Standardisation document ISO/IEC JTC1/SC29/WG11 N2460 (October 1998), the content of which is incorporated by reference herein.
The MPEG-7 standard has the objective of specifying a standard set of descriptors as well as structures (referred to as “description schemes”) for the descriptors and their relationships to describe various types of multimedia information. MPEG-7 also proposes to standardize ways to define other descriptors as well as “description schemes” for the descriptors and their relationships. This description, i.e. the combination of descriptors and description schemes, shall be associated with the content itself, to allow fast and efficient searching and filtering for material of a user's interest. MPEG-7 also proposes to standardize a language to specify description schemes, i.e. a Description Definition Language (“DDL”), and the schemes for binary encoding the descriptions of multimedia content.
At the time of filing the instant application, MPEG is soliciting proposals for techniques which will optimally implement the necessary description schemes for future integration into the MPEG-7 standard. In order to provide such optimized description schemes, three different multimedia-application arrangements can be considered. These are the distributed processing scenario, the content-exchange scenario, and the format which permits the personalized viewing of multimedia content.
Regarding distributed processing, a description scheme must provide the ability to interchange descriptions of multimedia material independently of any platform, any vendor, and any application, which will enable the distributed processing of multimedia content. The standardization of interoperable content descriptions will mean that data from a variety of sources can be plugged into a variety of distributed applications, such as multimedia processors, editors, retrieval systems, filtering agents, etc. Some of these applications may be provided by third parties, generating a sub-industry of providers of multimedia tools that can work with the standardized descriptions of the multimedia data.
A user should be permitted to access various content providers' web sites to download content and associated indexing data, obtained by some low-level or high-level processing, and proceed to access several tool providers' web sites to download tools (e.g. Java applets) to manipulate the heterogeneous data descriptions in particular ways, according to the user's personal interests. An example of such a multimedia tool will be a video editor. A MPEG-7 compliant video editor will be able to manipulate and process video content from a variety of sources if the description associated with each video is MPEG-7 compliant. Each video may come with varying degrees of description detail, such as camera motion, scene cuts, annotations, and object segmentations.
A second scenario that will greatly benefit from an interoperable content description standard is the exchange of multimedia content among heterogeneous multimedia databases. MPEG-7 aims to provide the means to express, exchange, translate, and reuse existing descriptions of multimedia material.
Currently, TV broadcasters, Radio broadcasters, and other content providers manage and store an enormous amount of multimedia material. This material is currently described manually using textual information and proprietary databases. Without an interoperable content description, content users need to invest manpower to translate manually the descriptions used by each broadcaster into their own proprietary scheme. Interchange of multimedia content descriptions would be possible if all the content providers embraced the same scheme and content description schemes. This is one of the objectives of MPEG-7.
Finally, multimedia players and viewers that employ the description schemes must provide the users with innovative capabilities such as multiple views of the data configured by the user. The user should be able to change the display's configuration without requiring the data to be downloaded again in a different format from the content broadcaster.
The foregoing examples only hint at the possible uses for richly structured data delivered in a standardized way based on MPEG-7. Unfortunately, no prior art techniques available at present are able to generically satisfy the distributed processing, content-exchange, or personalized viewing scenarios. Specifically, the prior art fails to provide a technique for capturing content embedded in multimedia information based on either generic characteristics or semantic relationships, or to provide a technique for organizing such content. Accordingly, there exists a need in the art for efficient content description schemes for generic multimedia information.
During the MPEG Seoul Meeting (March 1999), a Generic Visual Description Scheme (Video Group, “Generic Visual Description Scheme for MPEG-7”, ISO/IEC JTC1/SC29/WG11 MPEG99/N2694, Seoul, Korea, March 1999) was generated following some of the recommendations from the DS1 (still images), DS3++ (multimedia), DS4 (application), and, especially, DS2 (video) teams of the MPEG-7 Evaluation AHG (Lancaster, U.K., February 1999) (AHG on MPEG-7 Evaluation Logistics, “Report of the Ad-hoc Group on MPEG-7 Evaluation Logistics”, ISO/IEC JTC1/SC29/WG11 MPEG99/N4524, Seoul, Korea, March 1999). The Generic Visual DS has evolved in the AHG on Description Schemes to the Generic Audio Visual Description Scheme (“AV DS”) (AHG on Description Scheme, “Generic Audio Visual Description Scheme for MPEG-7 (V0.3)”, ISO/IEC JTC1/SC29/WG11 MPEG99/M4677, Vancouver, Canada, July 1999). The Generic AV DS describes the visual content of video sequences or images and, partially, the content of audio sequences; it does not address multimedia or archive content.
The basic components of the Generic AV DS are the syntactic structure DS, the semantic structure DS, the syntactic-semantic links DS, and the analytic/synthetic model DS. The syntactic structure DS is composed of region trees, segment trees, and segment/region relation graphs. Similarly, the semantic structure DS is composed of object trees, event trees, and object/event relation graphs. The syntactic-semantic links DS provide a mechanism to link the syntactic elements (regions, segments, and segment/region relations) with the semantic elements (objects, events, and event/object relations), and vice versa. The analytic/synthetic model DS specifies the projection/registration/conceptual correspondence between the syntactic and the semantic structure. The semantic and syntactic elements, which we will refer to as content elements in general, have associated attributes. For example, a region is described by color/texture, shape, 2-D geometry, motion, and deformation descriptors. An object is described by type, object-behavior, and semantic annotation DSs.
We have identified possible shortcomings in the current specification of the Generic AV DS. The Generic AV DS includes content elements and entity-relation graphs. The content elements have associated features, and the entity-relation graphs describe general relationships among the content elements. This follows the Entity-Relationship (ER) modeling technique (P. P-S. Chen, “The Entity-Relation Model—Toward a Unified View of Data”, ACM Transactions on Database Systems, Vol. 1, No. 1, pp. 9-36, March 1976). The current specification of these elements in the Generic AV DS, however, is too generic to become a useful and powerful tool to describe audio-visual content. The Generic AV DS also includes hierarchies and links between the hierarchies, which is typical of physical hierarchical models. Consequently, the Generic AV DS is a mixture of different conceptual and physical models. Other limitations of this DS may be the rigid separation of the semantic and the syntactic structures and the lack of explicit and unified definitions of its content elements.
The Generic AV DS describes images, video sequences, and, partially, audio sequences following the classical approach for book content descriptions: (1) definition of the physical or syntactic structure of the document; the Table of Contents; (2) definition of the semantic structure, the Index; and (3) definition of the locations where semantic notions appear. It consists of (1) syntactic structure DS; (2) semantic structure DS; (3) syntactic-semantic links DS; (4) analytic/synthetic model DS; (5) visualization DS; (6) meta information DS; and (7) media information DS.
The syntactic DS is used to specify physical structures and the signal properties of an image or a video sequence defining the table of contents of the document. It consists of (1) segment DS; (2) region DS; and (3) segment/region relation graph DS. The segment DS may be used to define trees of segments that specify the linear temporal structure of the video program. Segments are a group of continuous frames in a video sequence with associated features: time DS, meta information DS, media information DS. A special type of segment, a shot, includes editing effect DS, key frame DS, mosaic DS, and camera motion DS. Similarly, the region DS may be used to define a tree of regions. A region is defined as group of connected pixels in a video sequence of an image with associated features: geometry DS, color/texture DS, motion DS, deformation DS, media information DS, and meta information DS. The segment/region relation graph DS specifies general relationships among segments and regions, e.g. spatial relationships such as “To The Left Of”; temporal relationships such as “Sequential To”; and semantic relationships such as “Consist Of”.
The semantic DS is used to specify semantic features of an image or a video sequence in terms of semantic objects and events. It can be viewed as a set of indexes. It consists of (1) event DS; (2) object DS; and (3) event/object relation graph DS. The event DS may be used to form trees of events that define a semantic index table for the segments in the segment DS. Events contain an annotation DS. Similarly, the object DS may be used to form trees of objects that define a semantic index table for the objects in the object DS. The event/object relation graph DS specifies general relationships among events and objects.
The syntactic-semantic links DS are bi-directional between the syntactic elements (segments, regions, or segment/region relations) and the semantic elements (events, objects, or event/object relations). The analytic/synthetic model DS specifies the projection/registration/conceptual correspondence between syntactic and semantic structure DSs. The media and meta information DS contains descriptors of the storage media and the author-generated information, respectively. The visualization DS contains a set of view DS to enable efficient visualization of a video program. It includes the following views: multi-resolution space-frequency thumbnail, key-frame, highlight, event, and alternate views. Each one of these views is independently defined.
Shortcomings of Generic AV DS
The Generic AV DS includes content elements (i.e. regions, objects, segments, and events), with associated features. It also includes entity-relation graphs to describe general relationships among content elements following the entity-relationship model. A drawback of the current DS is that the features and the relationships among elements can have a broad range of values, which reduces their usefulness and expressive power. A clear example is the semantic annotation feature in the object element. The value of the semantic annotation could be a generic (“Man”), a specific (“John Doe”), or an abstract (“Happiness”) concept.
The initial goal of the development leading to the present invention was to define explicit entity-relationship structures for the Generic AV DS to address this drawback. The explicit entity-relationship structures would categorize the attributes and the relationships into relevant classes. During this process, especially during the generation of concrete examples (see the baseball example shown in FIGS. 6-9), we became aware of other shortcomings of the current Generic AV DS, this time, related to the DS's global design. We shall present these in this section. In this application, we propose complete fundamental entity-relationship models that try to address these issues.
First, the full specification of the Generic DS could be represented using an entity-relationship model. As an example, the entity-relation models provided in FIGS. 7-9 for the baseball example in FIG. 6, include the functionality addressed by most of the components of the Generic AV DS (e.g. the event DS, the segment DS, the object DS, the region DS, the syntactic-semantic links DS, the segment/region relation graph DS, and the event/object relation graph DS) and more. The entity-relationship (E-R) model is a popular high-level conceptual data model, which is independent of the actual implementation as hierarchical, relational, or object-oriented models, among others. The current version of the Generic DS seems to be a mix of multiple conceptual and implementation data models: the entity-relationship model (e.g. segment/region relation graph), the hierarchical model (e.g. region DS, object DS, and syntactic-semantic links DS), and the object-oriented model (e.g. segment DS, visual segment DS, and audio segment DS).
Second, the separation between syntax and semantics in the current Generic DS is too rigid. For the example in FIG. 6, we have separated the descriptions of the Batting Event and the Batting Segment (see FIG. 7), as the current Generic AV DS proposes. In this case, however, it would have been more convenient to merge both elements into a unique Batting Event with semantic and syntactic features. Many groups working on video indexing have advocated the separation of the syntactic structures (Table of Contents: segments and shots) and the semantic structures (Semantic Indexes: events). In describing images or animated objects in video sequences, however, the value of separating these structures is less clear. “Real objects” are usually described by their semantic features (e.g. semantic class—person, cat, etc.) as well as by their syntactic features (e.g. color, texture, and motion). The current Generic AV DS separates the definition of “real objects” in the region and the object DSs, which may cause inefficient handling of the descriptions.
Finally, the content elements, especially the object and the event, lack explicit and unified definitions in the Generic DS. For example, the current Generic DS defines an object as having some semantic meaning and containing other objects. Although objects are defined in the object DS, event/object relation graphs can describe general relationships among objects and events. Furthermore, objects are linked to corresponding regions in the syntactic DS by the syntactic-semantic links DS. Therefore, the object has a distributed definition across many components of the Generic Visual DS, which is less than clear. The definition of an event is very similar and as vague.
Entity-Relationship Models for Generic AV DS
The Entity-Relationship (E-R) model first presented in P. P-S. Chen, “The Entity-Relation Model—Toward a Unified View of Data”, ACM Transactions on Database Systems, Vol. 1, No. 1, pp. 9-36, March 1976 describes data in terms of entities and their relationships. Both entities and relationships can be described by attributes. The basic components of the entity-relationship model are shown in FIG. 1. The entity, the entity attribute, the relationship, and the relationship attribute correspond very closely to the noun (e.g. a boy and an apple), the adjective (e.g. young), the verb (e.g. eats), and the verb complement (e.g. slowly), which are essential components for describing general data. “A young boy eats an apple slowly”, which could be the description of a video shot, is represented using an entity-relationship model in FIG. 2. This modeling technique has been used to model the contents of pictures and their features for image retrieval.
In this section, we propose fundamental entity-relationship models for the current Generic AV DS to address the shortcomings discussed previously. The fundamental entity-relation models index (1) the attributes of the content elements, (2) the relationships among content elements, and (3) the content elements themselves. These models are depicted in FIG. 5. Our proposal builds on top of the conceptual framework for indexing visual information presented in A. Jaimes and S.-F. Chang, “A Conceptual Framework for Indexing Visual Information at Multiple Levels”, Submitted to Internet Imaging 2000.