The invention relates to streaming multimedia files via a network. The invention relates in particular to enabling the emulation of streaming graphics or video animation over the Internet within a broadcast context. The invention relates in particular to enabling an end-user to interact with the environment created through the graphics or video animation.
The term xe2x80x9cstreamingxe2x80x9d refers to transferring data from a server to a client so that it can be processed as a steady and continuous stream at the receiving end. Streaming technologies are becoming increasingly important with the growth of the Internet because most users do not have fast enough access to download large multimedia files comprising, e.g., graphics animation, audio, video, or a combination thereof, etc. Streaming, however, enables the client""s browser or plug-in to start processing the data before the entire file has been received. For streaming to work, the client side receiving the file must be able to collect the data and send it as a steady stream to the application that is processing the data. This means that if the client receives the data faster than required, the excess data needs to be buffered. If the data does not arrive in time, on the other hand, the presentation of the data will not be smooth.
The term xe2x80x9cfilexe2x80x9d is used herein to indicate an entity of related data items available to a data processing and capable of being processed as an entity. Within the context of the invention, the term xe2x80x9cfilexe2x80x9d may refer to data generated in real-time as well as data retrieved from storage.
Among the technologies that are currently available or under development for the communication of graphics data via the Internet are VRML 97 and MPEG-4. VRML 97 stands for xe2x80x9cVirtual Reality Modeling Languagexe2x80x9d, and is an International Standard (ISO/IEC 14772) file format for describing interactive 3D multimedia content on the Internet. MPEG-4 is an ISO/IEC standard being developed by MPEG (Moving Picture Experts Group). In both standards, the graphical content is structured in a so-called scene graph. A scene graph is a family tree of coordinate systems and shapes, that collectively describe a graphics world. The top-most item in the scene family tree is the world coordinate system. The world coordinate system acts as the parent for one or more child coordinate systems and shapes. Those child coordinate systems are, in turn, parents to further child coordinate systems and shapes, and so on.
VRML is a file format for describing objects. VRML defines a set of objects useful for doing 3D graphics, multi-media, and interactive object/world building. These objects are called nodes, and contain elemental data which is stored in fields and events. Typically, the scene graph comprises structural nodes, leaf nodes, interpolation nodes and sensor nodes. The structural nodes define the spatial relationship of objects within a scene. The leaf nodes define the physical appearance of the objects. The interpolation nodes define animations. The sensor nodes define user interaction for particular user input modalities. VRML does not directly support streaming of data from a server into a client. Facilities such as synchronization between streams and time stamping that are essential in streaming do not exist in VRML. However, VRML has a mechanism that allows external programs to interact with VRML clients. This has been used in sports applications to load animation data into the client. See, for example, xe2x80x9cVirtuaLive Soccerxe2x80x9d of Orad Hi-Tec Systems, Ltd at  less than http://www.virtualive.com greater than . This web document discusses a process for producing realistic, animated, three-dimensional graphic clips that simulate actual soccer match highlights for being sent via the Internet. The system generates content that complements television sports coverage with multimedia-rich Web pages in near real time. In this example, the process works in two steps. First the graphics models of the stadium and of the soccer players are downloaded along with an external program, in this case a Java Applet. The user can then interact with the external program to request a particular animation. The data for this animation is then downloaded into the client and interacted with by the user: the user can view scenes of the match from different points of view in the animation and in slow motion if desired. In terms of node type, this process first downloads the structural and leaf nodes, and thereupon the interpolation nodes. By changing the set of interpolation nodes, it is possible to run a different animation sequence. The process used in this example is somewhat equivalent to a single step process in which the user can choose the complete VRML file that contains all the models (structural nodes) and all the animation data (interpolator nodes). This approach leads to long download times before any content can be played on the client. This is experienced as a frustrating experience, especially if compared to TV broadcast where content is available instantly.
The other technology introduced above, MPEG-4, defines a binary description format for scenes (BIFS) that has a wide overlap with VRML 97. MPEG-4, on the other hand, has been designed to support streaming of graphics as well as for video. MPEG-4 defines two server/client protocols for updating and animating scenes: BIFS-Update and BIFS-Anim. Some of the advantages of MPEG-4 over VRML are the coding of the scene description and of the animation data as well as the built-in streaming capability. The user does not have to wait for the complete download of the animation data. For example, in the soccer match broadcast application mentioned earlier the animation an start as soon as the models of the players and the stadium are downloaded. MPEG-4 further has the advantage that it more efficient owing to its BIFS transport protocol that uses a compressed binary format.
Within the context of streaming, the known technologies mentioned above have several limitations with regard to bandwidth usage, packet-loss concealment or recovery and multi-user interactivity, especially in a broadcast to large numbers of clients.
As to bandwidth, the complete animation is generated at the server. This results in a large amount of data that needs to be transported over the network, e.g., the Internet, connecting the client to the server. For example, in the soccer broadcast application mentioned above, the 22 soccer players need to be animated. Each animation data point per individual player comprises a position in 3D space and a set of, say, 15 joint rotations to model the player""s posture. This represents 63 floating-point values. If it is assumed that the animation update rate is 15 data points per seconds, a bit-rate of 665 Kbps is required. This bit-rate can be reduced through compression. Typically, using BIFS reduces the bit-rate by a factor of 20, giving a bit-rate of about 33 Kbps. However, this number has not taken into account overhead required for the Internet protocols (RTP, UDP and IP) and for additional data types, such as audio. However, typical modems currently commercially available on the consumer market have a capacity of 28.8 Kbps or 33.6 Kpbs. It is clear that streaming animation causes a problem at the end user due to bandwidth limitations. In the case of a broadcast to a large number of clients, say 100,000 clients, the data stream will need to be duplicated at several routers. A router on the Internet determines the next network point to which a packet should be forwarded on its way toward its final destination. The router decides which way to send each information packet based on its current understanding of the state of the networks it is connected to. A router is located at any juncture of networks or gateway, including each Internet point-of-presence. It is clear that the broadcast could lead to an unmanageable data explosion across the Internet. To prevent that from happening, the actual bandwidth needs to be limited to much lower than 28.8 Kbps.
As to packet loss concealment, VRML-based systems utilize reliable protocols (TCP). Packet losses are not an issue here. In the case of MPEG-4, BIFS uses RTP/UDP/IP. A packet loss recovery mechanism is therefore required. In a point-to-point application, re-transmission of lost packets can be considered. In a broadcast situation, however, this is much more complex. In both cases, however, MPEG reliability requires either higher bandwidth usage (redundancy) or higher latency (retransmission).
As to multi-user interactivity, both VRML and MPEG-4 are essentially based on a server-client communication. No provisions exist to enable communication among multiple clients.
For more information on VRML see, for example, xe2x80x9cKey Conceptsxe2x80x9d, Mar. 5, 1996, at:  less than http://sgi.felk.cvut.cz/xcx9cholecek/VRML/concepts.html greater than , or xe2x80x9cInternetwork Infrastructure Requirements for Virtual Environmentsxe2x80x9d, D. P. Brutzman et al., Jan. 23, 1996, publicly available at:  less than http://www.stl.nps.navy.mil/xcx9cbrutzman/vrml/vrml13 95.html greater than .
For more information on MPEG-4 see, for example, xe2x80x9cOverview of the MPEG-4 Standardxe2x80x9d, ISO/IEC JTC1/SC29/WG11 N2323 ed. Rob Koenen, July 1998, publicly available at  less than http://drogo.cselt.stet.it/mpeg/standards/mpeg-4/mpeg-4.htm greater than .
It is therefore an object of the invention to provide a technology that enables a client to process multimedia data as if it were a steady and continuous stream. It is another object to enable the continuous processing at a large number of clients in a broadcast over the Internet. It is noted that the problems identified above become rather acute in a broadcast application. It is another object to use this technology for creating an interactive software application that enables the user to navigate in a continuously evolving electronic virtual environment.
To this end, the invention provides a method of emulating streaming a multimedia file via a network to a receiving station connected to the network. Respective state information descriptive of respective states of the file is supplied. The receiving station is enabled to receive the respective state information via the network and is enabled to locally generate the multimedia file under control of the respective state information. In a broadcast for animation, the invention relates to a method of supplying data via a network for enabling displaying graphics animation. Respective state information is supplied over the network descriptive of successive respective states of the animation. The respective state information is received via the network. The receiving station is enabled to generate the animation under control of the respective state information upon receipt. More particularly, the invention provides a method of enabling a user to navigate in a continuously evolving electronic virtual environment. The method comprises: providing to the user a world model of the environment; sending state changes of the world model representative of the evolving; enabling the user to provide input for control of a position of an object relative to the virtual environment; and creating the environment from the world model and in response to the state changes and the user input.
In the invention the multimedia file (animation, video or audio file) is described as a succession of states. It is this state information that gets transmitted to the clients rather than the animation data itself The term xe2x80x9cemulatingxe2x80x9d therefore emphasizes that the information communicated to the client need not be streamed. The client generates the data for play-out locally and based on the state information received. Accordingly, the user perceives a continuous and steady stream of data during play-out as if the data were streamed over the network (under optimal conditions).
In a preferred embodiment, a shared-object protocol is used to accomplish the emulation. Both a server and a client have copies of a collection of objects. An object is a data structure that holds state information. Within the context of the virtual soccer match, an object is, for example, a graphics representation of one of the soccer players. The server receives a streamed video file and monitors the objects. It is noted that MPEG-4 enables the creation of video objects that are processed as an entity. If the server changes the state of this object, the shared object protocol causes the copy at the client to change accordingly. This is explained in more detail with reference to the drawings.
This state information is at a higher level of abstraction than the animation data itself. For example, in the soccer match broadcast application mentioned above, the state information comprises the current positions of the 22 players in the field and parameters specifying their current action (e.g., xe2x80x9crunningxe2x80x9d, xe2x80x9cjumpingxe2x80x9d, etc.). The use of higher level information has several advantages, in particular in a broadcast application where animation is streamed over the Internet to a large audience. The content of the state information as communicated over the Internet is very compact, thus requiring lower bandwidth than in case the animation data itself is streamed. The animation is generated locally from a few parameters. In addition, the update rate of animation data points is lower because the state of the animation changes at a slower rate than the animation data itself. This contributes to further lowering bandwidth requirements. Furthermore, the invention provides enhanced possibilities for packet loss recovery or concealment and for network latency jitter masking. It is easy to interpolate or extrapolate between states and to implement dead reckoning concepts. User interaction with the animation is more easily programmable because of this higher level of abstraction. Another advantage is that multi-user interaction is feasible if clients are enabled to share state information. Still another advantage is the fact that clients are enabled to convert the state information into animation based on their individual processing power that might differ from client to client. The resources available at the client may be different per client or groups of clients.
Within the context of the invention, reference is made to U.S. patent application Ser. No. 09/053,448 (PHA 23,383) of same Assignee, titled xe2x80x9cGroup-wise video conferencing uses 3D-graphics model of broadcast eventxe2x80x9d and incorporated herein by reference. This document addresses a TV broadcast service to multiple geographically distributed end users. The broadcast service is integrated with a conferencing mode. Upon a certain event in the broadcast, specific groups of end users are switched to a conference mode under software control so that the group is enabled to discuss the event. The conference mode is enhanced by a 3D graphics model of the video representation of the event that is downloaded to the groups. The end users are capable of interacting with the model to discuss alternatives to the event.