1. Field of Invention
The present invention generally relates to audiovisual data processing. More particularly, this invention relates to the description of synthetic audiovisual content to allow such content to be searched or browsed with ease in digital libraries, Internet web sites and broadcast media.
2. Description of Related Art
More and more audiovisual information is becoming available from many sources around the world. Various forms of media, such as still pictures, video, graphics, 3D models, audio and speech, may represent such information. In general, audiovisual information plays an important role in our society, be it recorded in such media as film or magnetic tape or originating, in real time, from some audio or visual sensors, be it analogue or, increasingly, digital. While audio and visual information used to be consumed directly by the human being, computational systems are increasingly creating, exchanging, retrieving and re-processing this audiovisual information. Such is the case for image understanding, e.g., surveillance, intelligent vision, smart cameras, etc., media conversion, e.g., speech to text, picture to speech, speech to picture, etc., information retrieval, e.g., quickly and efficiently searching for various types of multimedia documents of interest to the user, and filtering to receive only those multimedia data items that satisfy the user""s preferences in a stream of audiovisual content.
For example, a code in a television program triggers a suitably programmed VCR to record that program, or an image sensor triggers an alarm when a certain visual event happens. Automatic transcoding may be performed based on a string of characters or audible information or a search may be performed in a stream of audio or video data. In all these examples, the audiovisual information has been suitably xe2x80x9cencodedxe2x80x9d to enable a device or a computer code to take some action.
In the infancy of web-based information communication and access systems, information is routinely transferred, searched, retrieved and processed. Presently, much of the information is predominantly represented in text form. This text-based information is accessed using text-based search algorithms. However, as web-based systems and multimedia technology continue to improve, more and more information is becoming available in a form other than text, for instance as images, graphics, speech, animation, video, audio and movies. As the volume of such information is increasing at a rapid rate, it is becoming important to be easily able to search and retrieve a specific piece of information of interest. It is often difficult to search for such information by text-only search. Thus, the increased presence of multimedia information and the need to be able to find the required portions of it in an easy and reliable manner, irrespective of the search engines employed, have spurred on the drive for a standard for accessing such information.
The Moving Pictures Expert Group (MPEG) is a working group under the International Standards Organization/International Electrotechnical Commission in charge of the development of international standards for compression, decompression, processing and coded representation of video data, audio data and their combination. MPEG previously developed the MPEG-1, MPEG-2 and MPEG-4 standards, and is presently developing the MPEG-7 standard, formally called xe2x80x9cMultimedia Content Description Interfacexe2x80x9d, hereby incorporated by reference in its entirety. MPEG-7 will be a content representation standard for multimedia information search and will include techniques for describing individual media content and their combination. Thus, the goal of the MPEG-7 standard is to provide a set of standardized tools to describe multimedia content. Thus, the MPEG-7 standard, unlike the MPEG-1, MPEG-2 or MPEG-4 standards, is not a media-content coding or compression standard but rather a standard for representation of descriptions of media content.
The data representing descriptions is called meta data. Thus, irrespective of how the media content is represented, e.g., analogue, PCM, MPEG-1, MPEG-2, MPEG-4, Quicktime, Windows Media etc., the meta data associated with this content, may in the future, be MPEG-7.
Often, the value of multimedia information depends on how easy it can be found, retrieved, accessed, filtered and managed. In spite of the fact that users have increasing access to this audiovisual information, searching, identifying and managing it efficiently is becoming more difficult because of the sheer volume of the information. Moreover, the question of identifying and managing multimedia content is not just restricted to database retrieval applications such as digital libraries, but extends to areas such as broadcast channel selection, multimedia editing and multimedia directory services. Although known techniques for tagging audiovisual information allow some limited access and processing based on text-based search engines, the amount of information that may be included in such tags is somewhat limited. For example, for movie videos, the tag may reflect the name of the movie or a list of actors; however, this information must apply to the entire movie and may not be sub-divided to indicate the content of individual shots and objects in such shots. Moreover, the amount of information that may be included in such tags and architecture for searching and processing that information is severely limited.
Additionally, image, video, speech, audio, graphics and animation are becoming increasingly important components of multimedia information. While image and video are xe2x80x9cnaturalxe2x80x9d representations of a scene captured by a scanner or camera, graphics and animation are xe2x80x9csyntheticxe2x80x9d representations of a scene often generated using parametric models on a computer. Similarly, while speech and audio are natural representations of xe2x80x9cnaturalxe2x80x9d sound captured using a microphone, text-to-speech or Musical Instrument Digital Interface (MIDI) are representations of xe2x80x9csyntheticxe2x80x9d sound generated via a model on a computer or computerized synthesizer. The synthetic audiovisual scenes may contain one or more types of synthetic objects, each represented by an underlying model and animation parameters for that model. Such models may typically be two-dimensional (2d) or three-dimensional (3d). For instance, data of a wireframe model in three dimensions and a number of texture maps may be used to represent an F-15 fighter aircraft. This means that not only a realistic rendering of the F-15 is possible using this data but that this amount of data is sufficient to either synthesize any view of F-15 or interact with it in three dimensions. Due to popularity of such 3d content on the web, a textual language for representation of 3d models and their animation has been standardized by ISO and is called Virtual Reality Modeling Language (VRML). However, this language is designed at a very low level and, although it is quite useful for synthesis of 3d content, it may not be fully appropriate for its description in the ways users might want to query for such content.
Despite the increasing number databases of such content and users needing the ability to search these databases to find something for which they are looking, search and retrieval of synthetic audiovisual content remains a significant challenge which is yet to be addressed in a satisfactory manner. In actuality, search for synthetic content can be performed either in the rendered domain or in its original (model and animation parameters) domain. Searching in the rendered domain is in fact very similar to searching for images and video and is therefore, not the primary focus of this invention. Rather, this invention improves searching capabilities in the original domain. Within the context of requirements of proposals for the MPEG-7 standard, this invention addresses this challenge and proposes a system and a method for processing and description of synthetic audiovisual data to make it easy to search and browse.
Accordingly, the present invention addresses the need for describing synthetic audiovisual information in a way that allows humans, software components or devices to easily identify, manage and categorize it. Such a manner of describing synthetic audiovisual content data allows ease of search by a user who may be interested in locating a specific piece of synthetic content from a database, Internet, or broadcast media. Moreover, in another application of the manner of describing synthetic audiovisual content, an automated information processing device may be able to locate the unique features of synthetic audiovisual content in a database or detected in a broadcast and take the necessary actions such as filter, collect, manage or record it.
The exemplary embodiment of the present invention provides a system and method that describes information in such a way that characteristics of synthetic audiovisual information are more easily searched, located and presented. More particularly, when describing an audiovisual scene, the scene may be partitioned into individual objects and their spatial and temporal relationships. Further, a description or meta data may be associated with the synthetic objects in the scene in accordance with this invention. Therefore, the invention is directed at providing a method and a system for describing synthetic audiovisual content.
The method and system of this invention make it easier to search, browse and retrieve synthetic audiovisual content. For instance a user may wish to search for specific synthetic audiovisual objects in digital libraries, Internet web sites or broadcast media. Key characteristics of such content itself are employed to facilitate this. Since synthetic audiovisual content is usually generated using various types of 2d or 3d models and by defining animation parameters of these models, the features of these models and animation characteristics that apply to these features are used for describing the content. The synthetic audiovisual content and its descriptions may then be processed for either direct human consumption or sent to devices, e.g., computers, within a system, or sent over a network, e.g., the Internet, and easily accessed and utilized.
The exemplary embodiment of the present invention addresses the draft requirements of MPEG-7 promulgated by MPEG at the time of the filing of this patent application. That is, the present invention provides object-oriented, generic abstraction and uses objects and events as fundamental entities for description. Thus, the present invention provides an efficient framework for description of various types of synthetic visual data. The present invention is a comprehensive tool for describing synthetic visual data because it uses extensible Markup Language (XML), which is self-describing. The present invention also provides flexibility because parts can be instantiated so as to provide efficient organization. The present invention also provides extensibility and the ability to define relationships between data because elements defined in a Description Scheme (DS) can define new elements.
These and other features and advantages of this invention are described in or are apparent from the following detailed description of the system and method according to this invention.