1. Field of the Invention
This invention relates to the creation and manipulation of multimedia data, and more particularly to transcoding between multimedia content data and multimedia description data.
2. Background
Digital multimedia content is increasingly available via the Internet, digital television, and so on. To support ubiquitous digital multimedia content, a key requirement is a standard representation of multimedia information. On the Internet, the hypertext markup language (HTML) and Synchronized Media Integration Language (SMIL) are common standards for representing multimedia content. HTML is Standard Generalized Markup Language (SGML) based standard defined by the World Wide Web Consortium (W3C). HTML describes a Web page as a set of media elements or resources, such as images, video, audio, and JAVA® applications, together with a presentation structure. The presentation structure includes information about the intended presentation of the media resources when the HTML web page is displayed in an Internet browser. This includes, for example, information about the layout of the different multimedia elements. HTML uses nested tags to represent the presentation structure. A more recent version of HTML called XHTML is a functionally equivalent version of HTML that is based on XML rather than SGML.
The W3C has promulgated SMIL, an XML-based language for integrating different media resources such as images, video, audio, etc. into a single presentation. SMIL contains features that allow for referencing media resources and controlling their presentation including timing and layout, and features for linking to other presentations in order to create hypermedia presentations. SMIL is strictly an integration language which does not define any representations for the media resources or objects used in a presentation. Instead, SMIL defines a set of tags that allow media resources to be integrated together into a single presentation. While some SMIL features exist in HTML, SMIL focuses on the spatial and temporal layout of media resources and provides greater control of interactivity than HTML.
Digital multimedia information is becoming widely distributed though broadcast transmission via, for example, digital television signals and interactive transmission via, for example, the Internet. The multimedia information may be still images, audio feeds, or video data streams. However, the availability of a large volume of multimedia information has led to difficulties in identifying content that is of particular interest to a user. Various organizations have attempted to deal with the problem by providing a description of the content of the multimedia information that can be used to search, filter and/or browse to locate specified content. The Moving Picture Experts Group (MPEG) has promulgated a Multimedia Content Description Interface standard, commonly referred to as MPEG-7 to standardize content descriptions for multimedia information. In contrast to preceding MPEG standards such as MPEG-4, which define how to represent coded multimedia content, MPEG-7 specifies how to describe the multimedia content.
MPEG-4 specifies how to represent units of aural, visual or audiovisual content as media objects, each of which is represented as a single elementary stream. In MPEG-4, media objects are composed together to create audiovisual scenes. An audiovisual scene represents a complex presentation of different multimedia objects in a structured fashion. Within scenes, media objects can be natural, meaning captured from the world, or synthetic, meaning generated with a computer or other device. For example, a scene containing text and an image with an audio background would be described in MPEG-4 with media objects for the text, image, and audio stream, and a scene that describes how to compose the objects. The composition is represented by a scene graph made up of nodes for the media objects and nodes for composing the media objects into a scene. MPEG-4 audiovisual scenes are composed of media objects, organized into a hierarchical tree structure, which is called a scene graph. Primitive media objects such as still images, video, and audio are placed at the leaves of the scene graph. MPEG-4 standardizes representations for many of these primitive media objects, such as video and audio, but is not limited to use with MPEG-4 specified media representation. Each media object contains information to allow the object to be included into the audiovisual scene.
The primitive media objects are found at the bottom of the scene graph as leaves of the tree. More generally, MPEG-4 scene descriptions can place media objects spatially in two-dimensional (2-D) and three dimensional (3-D) coordinate systems, apply transforms to change the presentation of the objects (e.g. a spatial transform such as a rotation), group primitive media objects to form compound media objects, and sychronize presentation of objects within a scene. The MPEG-4 scene description builds on concepts from the Virtual Reality Modeling Language (VRML). The W3C has defined an XML-based representation of VRML scenes, called Extensible 3D (X3D). While MPEG-4 scenes are encoded for transmission in an optimized binary manner, MPEG has also defined an XML-based representation for MPEG-4 scene descriptions, called the Extensible MPEG-4 Textual format (XMT). XMT represents MPEG-4 scene descriptions using an XML-based textual syntax. Additional introductory information about MPEG-4 may be obtained from Overview of the MPEG-4 Standard, Koenen, R. ed., V.18—Singapore Version, March 2001 (ISO/IEC JTC1/SC29/WG11 N4030).
XMT can interoperate with SMIL, VRML, and MPEG-4 players. The XMT format can be interpreted and played back directly by an SMIL player and easily converted to the X3D format before being played back by a X3D or VRML player. XMT can also be compiled to an MPEG-4 representation, such as the MPEG-4 file format (called MP4), which can then be played by an MPEG-4 player. XMT contains two different formats: the XMT-A format and the XMT-Ω format. XMT-A is an XML-based version of MPEG-4 content that contains a subset of X3D with extensions to X3D to allow for representing MPEG-4 specific features. XMT-A provides a one-to-one mapping between the MPEG-4 textual and binary formats. XMT-Ω is a high-level version of an MPEG-4 scene based on SMIL.
With regard to the description of content, MPEG-7 may be used to describe MPEG-4, SMIL, HTML, VRML and other multimedia content data. MPEG-7 uses a Data Definition Language (DDL) that specifies the language for defining the standard set of description tools and for defining new description tools, and provides a core set of descriptors and description schemes. The DDL definitions for a set of descriptors and description schemes are organized into “schemas” for different classes of content. The DDL definition for each descriptor in a schema specifies the syntax and semantics of the corresponding feature. The DDL definition for each description scheme in a schema specifies the structure and semantics of the relationships among its children components, the descriptors and description schemes. The DDL may be used to modify and extend the existing description schemes and create new description schemes and descriptors. The format of the MPEG-7 DDL is based on XML and XML Schema standards. The descriptors, description schemes, semantics, syntax, and structures are represented with XML elements and XML attributes.
Using a movie as an example of multimedia content, a corresponding MPEG-7 content description would contain “descriptors” (“D”), which are components that describe the features of the movie, such as scenes, titles for scenes, shots within scenes, and time, color, shape, motion, and audio information for the shots. The content description would also contain one or more “description schemes” (“DS”), which are components that describe relationships among two or more descriptors, such as a shot description scheme that relates together the features of a shot. A description scheme can also describe the relationship among other description schemes, and between description schemes and descriptors, such as a scene description scheme that relates the different shots in a scene, and relates the title feature of the scene to the shots.
The MPEG-7 content description for a particular piece of content is defined as an instance of an MPEG-7 schema; that is, it contains data that adheres to the syntax and semantics defined in the schema. The content description is encoded in an “instance document” that references the appropriate schema. The instance document contains a set of “descriptor values” for the required elements and attributes defined in the schema, and for any necessary optional elements and/or attributes. For example, some of the descriptor values for a particular movie might specify that the movie has three scenes, with scene one having six shots, scene two having five shots, and scene three having ten shots. The instance document may be encoded in a textual format using XML, in a binary format such as the binary format specified for MPEG-7 data known as “BiM,” and in a mixture of the two formats.
The instance document is transmitted through a communication channel, such as, for example, a computer network, to another system that uses the content description data contained in the instance document to search, filter and/or browse the corresponding content data stream. Typically, the instance document is compressed for faster transmission. An encoder component may both encode and compress the instance document. Further, the instance document may be generated by one system and subsequently transmitted by a different system. A corresponding decoder component at the receiving system uses the referenced schema to decode the instance document. The schema may be transmitted to the decoder separately from the instance document, may be transmitted to the decoder as part of the same transmission, or may be obtained by the receiving system from another source. Alternatively, certain schemas may be incorporated into the decoder.
With the growing popularity of digital devices such as personal computers, digital cameras, personal digital assistants (PDAs), cellular telephones, scanners and the like, multimedia data formatted according to well known standards is being created, edited and shared by all members of society, from hobbyists to neophytes to experts. Many standards govern the capturing, storage and transmission of multimedia data. The MPEG standards are widely accepted by manufactures of digital devices and are increasingly being incorporated into digital devices that allow for creation, editing and sharing of multimedia data.