1. Field of the Invention
The present invention relates to the provisioning of highly interactive video/audio services, e.g. remote gaming, with reactive requirements and hard real-time conditions on required reactive and realistic dynamic visualization. More particularly, the present invention relates to a method, an action streaming service, an action streaming client, an action streaming server, an action streaming system, an action stream, an action streaming session, an action streaming protocol, and computer software products for generating an interactive virtual reality.
The invention is based on a priority application, EP 02360239.4, which is hereby incorporated by reference.
2. Background
The real-time video/audio processing for electronic gaming and other interactive virtual reality based entertainment requires specialized and performant local devices like high-end personal computers or game consoles.
There are multiple games for personal computers and consoles available allowing a plurality of player participating a (shared) game. The devices uses access network technology to share a virtual world. This is done using e.g. the Internet to exchange and align the virtual worlds. To minimize the consumed network resource a common used technique is to parameterize such virtual world.
For instance the virtual world of a soccer game is identified by the playing team and the location. The visualization of the location, i.e. the playground might be a part of the local game software itself. Hence the short string “WORLD CUP 2002 FINAL” specifies completely the players and the playground graphics. The state of the game could be specified by the orientation and position of the players and the ball. The classical distributed game architecture is to align these states via a network, e.g. the Internet, and generating the virtual reality, meaning the video and audio, locally at a game console comprising perspectives, models, and rendering. This approach avoids heavily interchanging data across the network.
The above architecture has been influenced by missing network resources, namely bandwidth or delay. In the future the situation will become slightly different. Digital video and audio is an emerging technology, deploying digital encoded audio and video streams. To support this kind of network applications, the European Telecommunications Standards Institute (ETSI) designed a standard platform, the Media Home Platform.
Media Home Platform
The Multimedia Home Platform (MHP) defines a generic interface between interactive digital applications and the terminals on which those applications execute. This interface de-couples different providers' applications from the specific hardware and software details of different MHP terminal implementations. It enables digital content providers to address all types of terminals ranging from low-end to high-end set top boxes, integrated digital TV sets and multimedia PCs. The MHP extends the existing, successful Digital Video Broadcast (DVB) standards for broadcast and interactive services in all transmission networks including satellite, cable, terrestrial, and microwave.
The architecture of the MHP is defined in terms of three layers: resources, system software and applications. Typical MHP resources are MPEG processing, I/O devices, CPU, memory and a graphics system. The system software uses the available resources in order to provide an abstract view of the platform to the applications. Implementations include an application manager (also known as a “navigator”) to control the MHP and the applications running on it.
The core of the MHP is based around a platform known as DVB-J. This includes a virtual machine as defined in the Java Virtual Machine specification from Sun Microsystems. A number of software packages provide generic application program interfaces (APIs) to a wide range of features of the platform. MHP applications access the platform only via these specified APIs. MHP implementations are required to perform a mapping between these specified APIs and the underlying resources and system software.
The main elements of the MHP specification are:                MHP architecture (as introduced above),        definition of enhanced broadcasting and interactive broadcasting profiles,        content formats including PNG, JPEG, MPEG-2 Video/Audio, subtitles and resident and downloadable fonts,        mandatory transport protocols including DSM-CC object carousel (broadcast) and IP (return channel),        DVB-J application model and signaling,        hooks for HTML content formats (DVB-HTML application model and signaling),        DVB-J platform with DVB defined APIs and selected parts from existing Java APIs, JavaTV, HAVi (user interface) and DAVIC APIs,        security framework for broadcast application or data authentication (signatures, certificates) and return channel encryption (TLS),        graphics reference model.        
The MHP specification provides a consistent set of features and functions required for the enhanced broadcasting and interactive broadcasting profiles. The enhanced broadcasting profile is intended for broadcast (one way) services, while the interactive broadcasting profile supports in addition interactive services and allows MHP to use the world-wide communication network provided by the Internet.
The MHP therefore is simply a common Application Program Interface (API) that is completely independent of the hardware platform it is running on. Enhanced Broadcasts, Interactive Broadcasts and Internet Content from different providers can be accessed through a single device e.g. Set top box or IDTV, that uses this Common DVB-MHP API.
It will enable a truly horizontal market in the content, applications and services environment over multiple delivery mechanisms (Cable, Satellite, Terrestrial, etc.).
Encoding Audio and Video Streams
Crucial for deploying interactive audio/video-streaming is encoding and decoding. In this area MPEG (pronounced M-peg), which stands for Moving Picture Experts Group, is the name of family of standards used for coding audio-visual information, e.g. movies, video, music in a digital compressed format. MPEG uses very sophisticated compression techniques.
MPEG-1 is a coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s. It addresses the problem of combining one or more data streams from the video and audio parts of the MPEG-1 standard with timing information to form a single stream. This is an important function because, once combined into a single stream, the data are in a form well suited to digital storage or transmission.
It specifies a coded representation that can be used for compressing video sequences—both 625-line and 525-lines—to bit-rates around 1.5 Mbit/s. It was developed to operate principally from storage media offering a continuous transfer rate of about 1.5 Mbit/s. Nevertheless it can be used more widely than this because the approach taken is generic.
A number of techniques are used to achieve a high compression ratio. The first is to select an appropriate spatial resolution for the information. The algorithm then uses block-based motion compensation to reduce the temporal redundancy. Motion compensation is used for causal prediction of the current picture from a previous picture, for non-causal prediction of the current picture from a future picture, or for interpolative prediction from past and future pictures. The difference signal, the prediction error, is further compressed using the discrete cosine transform (DCT) to remove spatial correlation and is then quantised. Finally, the motion vectors are combined with the DCT information, and coded using variable length codes.
MPEG-1 specifies a coded representation that can be used for compressing audio sequences—both mono and stereo. Input audio samples are fed into the encoder. The mapping creates a filtered and sub-sampled representation of the input audio stream. A psycho-acoustic model creates a set of data to control the quantiser and coding. The quantiser and coding block creates a set of coding symbols from the mapped input samples. The block ‘frame packing’ assembles the actual bit-stream from the output data of the other blocks, and adds other information, e.g. error correction if necessary.
MPEG-2 describes a generic coding of moving pictures and associated audio information addresses the combining of one or more elementary streams of video and audio, as well as, other data into single or multiple streams which are suitable for storage or transmission. This is specified in two forms: the Program Stream and the Transport Stream. Each is optimized for a different set of applications. The Program Stream is similar to MPEG-1 Systems Multiplex. It results from combining one or more Packetized Elementary Streams (PES), which have a common time base, into a single stream. The Program Stream is designed for use in relatively error-free environments and is suitable for applications which may involve software processing. Program stream packets may be of variable and relatively great length.
The Transport Stream combines one or more Packetized Elementary Streams (PES) with one or more independent time bases into a single stream. Elementary streams sharing a common time-base form a program. The Transport Stream is designed for use in environments where errors are likely, such as storage or transmission in lossy or noisy media.
MPEG-2 builds on the powerful video compression capabilities of MPEG-1 to offer a wide range of coding tools. These have been grouped in profiles to offer different functionalities.
MPEG-2 Digital Storage Media Command and Control (DSM-CC) is the specification of a set of protocols which provides the control functions and operations specific to managing MPEG-1 and MPEG-2 bit-streams. These protocols may be used to support applications in both stand-alone and heterogeneous network environments. In the DSM-CC model, a stream is sourced by a Server and delivered to a Client. Both the Server and the Client are considered to be Users of the DSM-CC network. DSM-CC defines a logical entity called the Session and Resource Manager (SRM) which provides a (logically) centralized management of the DSM-CC Sessions and Resources.
MPEG-4 builds on the three fields: Digital television, Interactive graphics applications (synthetic content), and Interactive multimedia (World Wide Web, distribution of and access to content). MPEG-4 provides the standardized technological elements enabling the integration of the production, distribution and content access paradigms of the three fields.
The following sections illustrate the MPEG-4 functionalities described above, using the audiovisual scene depicted in FIG. 2.
Coded Representation of Media Objects
MPEG-4 audiovisual scenes are composed of several media objects, organized in a hierarchical fashion. At the leaves of the hierarchy, one find primitive media objects, such as:                Still images, e.g. as a fixed background,        Video objects, e.g. a talking person—without the background,        Audio objects, e.g. the voice associated with that person, background music.        
MPEG-4 provides a number of such primitive media objects, capable of representing both natural and synthetic content types, which can be either 2- or 3-dimensional. In addition to the media objects mentioned above and shown in FIG. 1, MPEG-4 defines the coded representation of objects such as text and graphics, talking synthetic heads and associated text used to synthesize the speech and animate the head; animated bodies to go with the faces, or synthetic sound.
A media object in its coded form consists of descriptive elements that allow handling the object in an audiovisual scene as well as of associated streaming data, if needed. It is important to note that in its coded form, each media object can be represented independent of its surroundings or background.
The coded representation of media objects is as efficient as possible while taking into account the desired functionalities. Examples of such functionalities are error robustness, easy extraction and editing of an object, or having an object available in a scaleable form.
Composition of Media Objects
FIG. 2 explains the way in which an audiovisual scene in MPEG-4 is described as composed of individual objects. The figure contains compound media objects that group primitive media objects together. Primitive media objects correspond to leaves in the descriptive tree while compound media objects encompass entire sub-trees. As an example: the visual object corresponding to the talking person and the corresponding voice are tied together to form a new compound media object, containing both the aural and visual components of that talking person. Such grouping allows authors to construct complex scenes, and enables consumers to manipulate meaningful (sets of) objects.
More generally, MPEG-4 provides a way to describe a scene, allowing for example to:                place media objects anywhere in a given coordinate system,        apply transforms to change the geometrical or acoustical appearance of an object,        group primitive media objects in order to form compound media objects;        apply streamed data to media objects, in order to modify their attributes (e.g. sound or animation parameters driving a synthetic face);        change, interactively, the user's viewing and listening points anywhere in the scene.        
The scene description builds on several concepts from the Virtual Reality Modeling language (VRML) in terms of both its structure and the functionality of object composition nodes.
Description and Synchronization of Streaming Data for Media Objects
Media objects may need streaming data, which is conveyed in one or more elementary streams. An object descriptor identifies all streams associated to one media object. This allows handling hierarchically encoded data as well as the association of meta-information about the content (called ‘object content information’) and the intellectual property rights associated with it.
Each stream itself is characterized by a set of descriptors for configuration information, e.g., to determine the required decoder resources and the precision of encoded timing information. Furthermore the descriptors may carry hints to the Quality of Service (QoS) it requests for transmission; e.g., maximum bit rate, bit error rate, priority, etc.
Synchronization of elementary streams is achieved through time stamping of individual access units within elementary streams. The synchronization layer manages the identification of such access units and the time stamping. Independent of the media type, this layer allows identification of the type of access unit; e.g., video or audio frames, scene description commands in elementary streams, recovery of the media object's or scene description's time base, and it enables synchronization among them. The syntax of this layer is configurable in a large number of ways, allowing use in a broad spectrum of systems.
Delivery of Streaming Data
The synchronized delivery of streaming information from source to destination, exploiting different QoS as available from the network, is specified in terms of the synchronization layer and a delivery layer containing a two-layer multiplexer.
The first multiplexing layer is managed according to the DMIF specification. (DMIF stands for Delivery Multimedia Integration Framework) This multiplex may be embodied by the MPEG-defined FlexMux tool, which allows grouping of Elementary Streams (ESs) with a low multiplexing overhead. Multiplexing at this layer may be used, for example, to group ES with similar QoS requirements, reduce the number of network connections or the end to end delay.
The “TransMux” (Transport Multiplexing) layer offers transport services matching the requested QoS. Only the interface to this layer is specified by MPEG-4 while the concrete mapping of the data packets and control signaling must be done in collaboration with the bodies that have jurisdiction over the respective transport protocol. Any suitable existing transport protocol stack such as (RTP)/UDP/IP, (AAL5)/ATM, or MPEG-2's Transport Stream over a suitable link layer may become a specific TransMux instance. It is possible to:                Identify access units, transport timestamps and clock reference information and identify data loss.        Optionally interleave data from different elementary streams into FlexMux streams        Convey control information to:                    indicate the required QoS for each elementary stream and FlexMux stream;            translate such QoS requirements into actual network resources;            associate elementary streams to media objects                        Convey the mapping of elementary streams to FlexMux and TransMux channels.Interaction with Media Objects        
In general, the user observes a scene that is composed following the design of the scene's author. Depending on the degree of freedom allowed by the author, however, the user has the possibility to interact with the scene. Operations a user may be allowed to perform include:                change the viewing/listening point of the scene, e.g. by navigation through a scene;        drag objects in the scene to a different position;        trigger a cascade of events by selecting a specific object, e.g. starting or stopping a video stream;        select the desired language when multiple language tracks are available;        
The multimedia content delivery chain encompasses content creation, production, delivery and consumption. To support this, the content has to be identified, described, managed and protected. The transport and delivery of content will occur over a heterogeneous set of terminals and networks within which events will occur and require reporting. Such reporting will include reliable delivery, the management of personal data and preferences taking user privacy into account and the management of (financial) transactions.
The MPEG-21 multimedia framework identifies and defines the key elements needed to support the multimedia delivery chain as described above, the relationships between and the operations supported by them. MPEG-21, MPEG will elaborate the elements by defining the syntax and semantics of their characteristics, such as interfaces to the elements. MPEG-21 will also address the necessary framework functionality, such as the protocols associated with the interfaces, and mechanisms to provide a repository, composition, conformance, etc.
The seven key elements defined in MPEG-21 are:                Digital Item Declaration (a uniform and flexible abstraction and interoperable schema for declaring Digital Items);        Digital Item Identification and Description (a framework for identification and description of any entity regardless of its nature, type or granularity);        Content Handling and Usage (provide interfaces and protocols that enable creation, manipulation, search, access, storage, delivery, and (re)use of content across the content distribution and consumption value chain);        Intellectual Property Management and Protection (the means to enable content to be persistently and reliably managed and protected across a wide range of networks and devices);        Terminals and Networks (the ability to provide interoperable and transparent access to content across networks and terminals);        Content Representation (how the media resources are represented);        Event Reporting (the metrics and interfaces that enable Users to understand precisely the performance of all reportable events within the framework).Problem        
Content and service providers as well as end users demand for remote provisioning (at the providers facilities) of high-quality entertainment services. State-of-the-art video gaming and future virtual reality based applications will generate requirements on high-dynamic, interactive, and high-resolution audio/video. Real-time video/audio processing for electronic gaming and other interactive virtual reality based entertainment requires specialized and performant local resources, e.g. PCs or game consoles.
The problem to be solved by the invention is the provisioning of highly interactive video/audio services for end users, e.g. remote gaming, with reactive requirements and hard real-time conditions. Challenging is the real-time behavior on user commands and a required reactive and a realistic dynamic visualization.
The solution should embed in the existing environment. I.e. remote hosted service, e.g. video games, should be based on the standard broadcast TV distribution concepts and therefore designed additional control path for the user interaction like MHP.
Currently there are no adequate solutions for individual interactive virtual reality services, because the response time seems not to allow realistic dynamic behavior, and the exhaustive motion in the video stream exhausting bandwidth.
Remote hosted simple video games based on the standard broadcast TV distribution path and an additional control path are known, but they provide no adequate solutions for individual interactive services, because the response time does not allow realistic dynamic behavior.