Field of Invention
The present invention generally relates to audiovisual data processing. More particularly, this invention relates to the description of synthetic audiovisual content to allow such content to be searched or browsed with ease in digital libraries, Internet web sites and broadcast media.
Description of Related Art
More and more audiovisual information is becoming available from many sources around the world. Various forms of media, such as still pictures, video, graphics, 3D models, audio and speech, may represent such information. In general, audiovisual information plays an important role in our society, be it recorded in such media as film or magnetic tape or originating, in real time, from some audio or visual sensors, be it analogue or, increasingly, digital. While audio and visual information used to be consumed directly by the human being, computational systems are increasingly creating, exchanging, retrieving and re-processing this audiovisual information. Such is the case for image understanding, e.g., surveillance, intelligent vision, smart cameras, etc., media conversion, e.g., speech to text, picture to speech, speech to picture, etc., information retrieval, e.g., quickly and efficiently searching for various types of multimedia documents of interest to the user, and filtering to receive only those multimedia data items that satisfy the user's preferences in a stream of audiovisual content.
For example, a code in a television program triggers a suitably programmed VCR to record that program, or an image sensor triggers an alarm when a certain visual event happens. Automatic transcoding may be performed based on a string of characters or audible information or a search may be performed in a stream of audio or video data. In all these examples, the audiovisual information has been suitably “encoded” to enable a device or a computer code to take some action.
In the infancy of web-based information communication and access systems, information is routinely transferred, searched, retrieved and processed. Presently, much of the information is predominantly represented in text form. This text-based information is accessed using text-based search algorithms. However, as web-based systems and multimedia technology continue to improve, more and more information is becoming available in a form other than text, for instance as images, graphics, speech, animation, video, audio and movies. As the volume of such information is increasing at a rapid rate, it is becoming important to be easily able to search and retrieve a specific piece of information of interest. It is often difficult to search for such information by text-only search. Thus, the increased presence of multimedia information and the need to be able to find the required portions of it in an easy and reliable manner, irrespective of the search engines employed, have spurred on the drive for a standard for accessing such information.
The Moving Pictures Expert Group (MPEG) is a working group under the International Standards Organizationllnternational Electrotechnical Commission in charge of the development of international standards for compression, decompression, processing and coded representation of video data, audio data and their combination.
MPEG previously developed the MPEG-1, MPEG-2 and MPEG-4 standards, and is presently developing the MPEG-7 standard, formally called “Multimedia Content Description Interface”, hereby incorporated by reference in its entirety. MPEG-7 will be a content representation standard for multimedia information search and will include techniques for describing individual media content and their combination. Thus, the goal of the MPEG-7 standard is to provide a set of standardized tools to describe multimedia content. Thus, the MPEG-7 standard, unlike the MPEG-i, MPEG-2 or MPEG-4 standards, is not a media-content coding or compression standard but rather a standard for representation of descriptions of media content.
The data representing descriptions is called meta data. Thus, irrespective of how the media content is represented, e.g., analogue, PCM, MPEG-1, MPEG-2, MPEG-4, Quicktime, Windows Media etc., the meta data associated with this content, may in the future, be MPEG-7.
Often, the value of multimedia information depends on how easy it can be found, retrieved, accessed, filtered and managed. In spite of the fact that users have increasing access to this audiovisual information, searching, identifying and managing it efficiently is becoming more difficult because of the sheer volume of the information. Moreover, the question of identifying and managing multimedia content is not just restricted to database retrieval applications such as digital libraries, but extends to areas such as broadcast channel selection, multimedia editing and multimedia directory services.
Although known techniques for tagging audiovisual information allow some limited access and processing based on text-based search engines, the amount of information that may be included in such tags is somewhat limited. For example, for movie videos, the tag may reflect the name of the movie or a list of actors; however, this information must apply to the entire movie and may not be sub-divided to indicate the content of individual shots and objects in such shots. Moreover, the amount of information that may be included in such tags and architecture for searching and processing that information is severely limited.
Additionally, image, video, speech, audio, graphics and animation are becoming increasingly important components of multimedia information. While image and video are “natural” representations of a scene captured by a scanner or camera, graphics and animation are “synthetic” representations of a scene often generated using parametric models on a computer. Similarly, while speech and audio are natural representations of “natural” sound captured using a microphone, text-to-speech or Musical Instrument Digital Interface (MIDI) are representations of “synthetic” sound generated via a model on a computer or computerized synthesizer. The synthetic audiovisual scenes may contain one or more types of synthetic objects, each represented by an underlying model and animation parameters for that model. Such models may typically be two-dimensional (2d) or three-dimensional (3d). For instance, data of a wireframe model in three dimensions and a number of texture maps may be used to represent an F-15 fighter aircraft. This means that not only a realistic rendering of the F-15 is possible using this data but that this amount of data is sufficient to either synthesize any view of F-15 or interact with it in three dimensions. Due to popularity of such 3d content on the web, a textual language for representation of 3d models and their animation has been standardized by ISO and is called Virtual Reality Modeling Language (VRML). However, this language is designed at a very low level and, although it is quite useful for synthesis of 3d content, it may not be fully appropriate for its description in the ways users might want to query for such content.