1. Field of the Invention
The present invention relates to a method of decoding coded digital signals representative of audiovisual data and available in the form of a continuous bitstream, in view of the binary description of a scene to be rendered on a displaying device, said method comprising a processing operation based on an evolutive syntactic language and provided for extracting from said bitstream, in a first step, distinct elements called objects according to the structure of said scene, defining, in a second step, an individual animation of said elements of the scene, defining, in a third step, particular interactions between a user and said elements, and organizing, in a fourth step, specific relations between said scene elements and corresponding individual animations and/or user interactions according to various classes of applications. This invention will be mainly used in the future MPEG-4 decoders.
2. Description of the Related Art
The most important goal of the well-known MPEG-1 and MPEG-2 standards, dealing with frame-based video and audio, was to make storage and transmission more efficient by compressing the concerned data. The future new MPEG-4 decoding standard will be fundamentally different, as it will represent the audiovisual scenes as a composition of objects rather than pixels only. Each scene is defined as a coded representation of audiovisual objects that have given relations in space and time, whatever the manner in which said given scene has been previously organized in these objects (or segmented).
Up to now, the standardization bodies dealing with natural and synthetic sources used to be different. As good three-dimensional (3D) capabilities are becoming an increasingly important part of many fields, including multimedia and World Wide Web applications that use VRML (VRMLxe2x80x94or Virtual-Reality Modelling Languagexe2x80x94is now the standard for specifying and delivering 3D-graphics-based interactive virtual environments), the MPEG-4 standard considers jointly the natural materials (video, audio, speech) and the synthetic ones (2D and 3D graphics and synthetic sound) and tries to combine them in a standardized bitstream, in view of the presentation of such a multimedia content on a terminal screen. In order to compose this audiovisual information within the scene, their spatio-temporal relationship needs to be transmitted to the terminal.
The MPEG-4 standard defines a syntactic description language to describe the binary syntax of an audiovisual object""s bitstream representation as well as that of the scene description information. More precisely, the MPEG-4 system Verification Model 4.0 proposes, for the description of the scenes, a binary format called the Binary Format for Scenes (BIFS). This description, constructed as a coded hierarchy of nodes with attributes and other information such as event sources and targets, is based on the assumption that the scene structure is transmitted as a parametric description (or a script) rather than as a computer program. The scene description can then evolve over time by using coded scene description updates. The node descriptions, which are conveyed in a BIFS syntax, may also be represented, for the purpose of clarity, in a textual form. Some MPEG-4 nodes and concepts are direct analogues of the VRML 2.0 nodes. Others are modified VRML 2.0, still others are added for specific NPEG-4 requirements. Like the VRML 2.0 syntax, the BIFS has provisions for describing simple behaviors and interaction with the user through an event passing mechanism. However some problems, explained hereunder, are not solved by this format.
The first of these addressed problems concerns an unified description of a mixed 2D and 3D scene. There is indeed a fundamental difference between the description of a purely 3D scene, the description of a purely 2D scene, and the description of a mixel 2D/3D scene. In a 3D scene, the layering of the objects is based on the depth information. In 2D, the notion of depth is absent and the layering should be defined explicitly. Furthermore, mixing 2D and 3D objects may be accomplished in several ways:
(1) embedding of 3D objects in a 2D scene:
(a) this is, for example, the case when one tries to render 3D objects in front of a 2D background: in this case, when the user navigates in the scene, the background does not move;
(b) another example is an application in which the user interface contains 2D objects (such as, buttons or text) and a 3D viewer where the scene is rendered;
(2) embedding of 2D objects in a 3D scene:
(a), this is for example, the case when one uses a video object as a texture map on 3D objects;
(b) another example is a texture made of 2D graphic objects (a special case of this is an xe2x80x9cactive mapxe2x80x9d, that is a 2D plane in 3D scene made of several composited 2D objects);
(3) these two schemes may be mixed recursively, for example, for embedding 3D objects in a 2D scene and using the resulting composition as a texture map on 3D objects (this may be used to simulate the reflection of a mirror);
(4) a last possibility is to view simultaneously, the same 3D scene from different view points.
At that moment, it is not possible to describe all these possibilities using a single scene graph. A scene graph is a tree that represents a scene by means of a hierarchy of objects called nodes. The scene is composed of grouping nodes and children nodes. The role of grouping nodes is to define the hierarchy and the spatial organization of the scene. Children nodes are the leaves of the tree. These nodes are used to define geometric objects, light sources as well as various types of sensors (objects that are sensitive to user interaction). Grouping nodes have children nodes. These children may be children nodes or other grouping nodes.
All nodes may have attributes which are called fields. The fields may be of any type. For example, sphere is a geometry node. It has a field that defines its radius. It is a single value field of type float (SFFloat). Children nodes of a grouping node are specified in a special field. This field is a multiple value field (a list of nodes), and each value is of type node (MFNode).
Now, in order to define animations and user interaction in the scene, it is possible to make connection between fields using an event passing mechanism called routing. Routing a field A to a field B means that whenever field A changes, field B will take the same value as field A. Only fields of the same type (or the same kind) may be connected. Fields may be specialized: some may only be the destination of a route, they are called eventln, others may only be at the origin of a route, they are called eventOut, others may be both the origin and destination of routes, they are called exposedField and, at last, others may not be connected, they are simply called field.
In VRML, four nodes (Viewpoint, Background, Fog and NavigationInfo) play a special role in the sense that only one of each may be active at a given time. These nodes are said to be bindable nodes.
There are many reasons to try to integrate both 2D and 3D features in one coherent framework:
it is possible to use the same event passing mechanism for the whole 2D/3D scene;
the representation of content can be more compact;
the implementation can be optimized because 2D and 3D specifications have been designed to work together.
In order to fulfill these requirements, one needs to be able to compose, in a 2D space, 2D and 3D layers representing the result of the rendering of a 2D or a 3D scene, as well as using the result of rendering of a 2D or 3D scene as an input to other nodes in the scene graph.
Other problems, not still solved, have also to be considered, especially the following ones:
(1) interactivity with the 2D objects: it may be necessary to be able to interact with the objects, change the layering, add or remove objects, which is not possible without a method to set the depth of a 2D object that is compatible with the event passing mechanism of VRML 2.0;
(2) single event routing mechanism, in order to be provided with interactivity and simple behavior capabilities: an example of this could be the display of a 2D map in a walk through application, the map being used to navigate, which requires the capacity to route a user triggered event from a 2D object (the map) to the 3D scene (the view point);
(3) global hierarchy of the scene: while a scene graph representation involves a hierarchical organization of the scene, 2D or 3D layers should not be considered as other graphic objects, and mixed with the global scene graph (moreover, layers may be hierarchical, as illustrated for instance in the layer graph of FIG. 1, explained later);
(4) interactivity with video objects: one of the features of MPEG-4 video is an object level interaction, i.e., the description of video as a set of objects rather than a set of pixels, which allows the interaction with the content of the video (such as cut and paste of an object within a video) and needs to be defined for each application by the content creator (said interaction, being not a feature of the terminal itself, may be described by means of BIFS, but, for this, the composition of the various video objects has to be described in the BIFS itself).
It is therefore an object of the invention to provide an enhancement of the BIFS in order to fully describe the composition of complex scene built from both 2D and 3D objects. This enhancement allows a unified representation of the complete scene and its layout, as well as event passing not only within the 3D scene (as in VRML 2.0) but also between 2D and 3D nodes, and also allows the definition of specific user interfaces that may be transmitted with the scene, rather than the use of a default user interface provided by the terminal.
To this end, the invention relates to a method as described in the preamble of the description and which is further characterized in that said processing operation also includes an additional step for describing a complex scene, built from any kind of bidimensional and tridimensional objects, according to a framework integrating both bidimensional and tridimensional features and unifying the composition and representation mechanisms of the scene structure.
More precisely, said framework may be characterized in that said additional description step comprises a first main sub-step for defining a hierarchical representation of said scene according to a tree structure organized both in grouping nodes, that indicate the hierarchical connections giving the spatial composition of the concerned scene, and in children nodes, that constitute the leaves of the tree, and a second auxiliary sub-step for defining, possible transversal connections between any kind of nodes.
In an advantageous embodiment of the proposed method, the nodes of the tree structure comprise at least bidimensional and tridimensional objects, and the auxiliary definition sub-step comprises a first operation for embedding at least one of said bidimensional objects within at least one of said tridimensional objects, an optional second operation for defining transversal connections between said tridimensional and bidimensional objects, and an optional third operation for controlling the definition step of at least one individual animation and/or at least one particular interaction both in the embedded bidimensional object(s) and in the corresponding original one(s).
In another advantageous embodiment of the method, the nodes of the tree structure comprise at least bidimensional and tridimensional objects, and the auxiliary definition sub-step comprises a first operation for embedding at least one of said tridimensional objects within at least one of said bidimensional objects, an optional second operation for defining transversal connections between said bidimensional and tridimensional objects, and an optional third operation for controlling the definition step of a least one individual animation and/or at least one particular interaction both in the embedded tridimensional object(s) and in the corresponding original one(s).
In another advantageous embodiment of the method, the nodes of the tree structure comprise at least tridimensional objects, and the auxiliary definition sub-step comprises a first operation for embedding at least one of said tridimensional objects within at least one of anyone of said tridimensional objects, an optional second operation for defining transversal connections between said tridimensional objects, and an optional third operation for controlling the definition step of at least one individual animation and/or at least one particular interaction both in the embedded tridimensional object(s) and in the corresponding original one(s).
Whatever these two last embodiments, it can be noted that said auxiliary definition sub-step may also comprise an additional operation for controlling the simultaneous rendering of at least one single tridimensional scene from various viewpoints while maintaining the third operation for controlling the definition step of the individual animation(s) and/or the particular interaction(s).
The invention relates not only to the previously described method, with or without the optional operations, but also to any signal obtained by implementing such method in any one of its variants. It is clear, for instance, that the invention relates to a signal obtained after having extracted from the input bitstream, in a first step, distinct elements called objects according to the structure of a scene, defined, in a second step, an individual animation of said elements of the scene, defined, in a third step particular interactions between a user and said elements, organized, in a fourth step, specific relations between said scene elements and corresponding individual animations and/or user interactions according to various classes of applications, and carried out an additional step for describing a complex scene, built from any kind of bidimensional and tridimensional objects, according to a framework integrating both bidimensional and tridimensional features and unifying the composition and representation mechanisms of the scene structure.
Such a signal allows to describe, together, bidimensional and tridimensional objects, and to organize a hierarchical representation of a scene according to a tree structure, itself organized in grouping nodes defining the hierarchical connections and in children nodes, said nodes allowing to form together a single scene graph constituted of a 2D scene graph, a 3D scene graph, a layers scene graph, and transversal connections between nodes of this scene graph.
Such a signal also allows to define 2D or 3D scenes already composed or that have to be composed on a screen, with a representation of their depth, or to define 3D scenes in which will be embedded other scenes already composed of 2D or 3D objects, or also to define textures for 3D objects themselves composed of other 3D or 2D objects. In fact, such a signal allows to interact with any 2D or 3D object of the scene and to organize any kind of transmission of data between all these objects of the scene. Obviously, the invention also relates to a storage medium for memorizing said signal, whatever its type or its composition. Finally, the invention also relates to a device for displaying or delivering in any other manner graphic scenes on the basis of signals such as described above, in order to reconstruct any kind of scene including bidimensional and tridimensional objects.