Most packet based communication networks, especially Internet Protocol (IP) networks without guaranteed quality of service, suffer from a variable amount of packet losses or errors. Those losses can stem from many sources, for example router or transmission segment overload or bit errors in packets that lead to their deletion. It should be understood that packet losses are a common operation point in most packet networks architectures, and not a network failure. Media transmission, especially the transmission of compressed video, suffers greatly from packet losses.
Annoying artifacts in a media presentation resulting from errors in a media transmission can further be avoided by many different means during the media coding process. However, adding redundancy bits during a media coding process is not possible for pre-coded content, and is normally less efficient than optimal protection mechanisms in the channel coding using a forward error correction (FEC).
Forward Error Correction works by calculating a number of redundant bits over the to-be-protected bits in the various to-be-protected media packets, add those bits to FEC packets, and transmit both, the media packets and the FEC packets. At the receiver, the FEC packets can be used to check the integrity of the media packets and to reconstruct media packets that may be missing. Henceforth, the media packets and the FEC packets which are protecting those media packets will be called a FEC frame. Examples of the FEC frame are shown in FIG. 1. As shown in FIG. 1, a media GOP stream 300 comprises a media GOP 310 and a media GOP 320 separated by a boundary 315. The FEC structure 500 comprises a FEC frame 510 and a FEC frame 520 separated by a boundary 515. In addition to the media packets 514, the FEC frame 510 also contains an FEC packet 512 and two padding packets 516. Likewise, the FEC frame 520 contains an FEC packet in addition to the media packets 524. As such, the FEC frames 510, 520 are generally longer than the media GOPs. As such, the FEC frames are not aligned with the media GOPs.
Most FEC schemes intended for error protection allow selecting the number of to-be-protected media packets and the number of FEC packets to be chosen adaptively to select the strength of the protection and the delay constraints of the FEC subsystem.
Packet based FEC in the sense discussed above requires a synchronization of the receiver to the FEC frame structure, in order to take advantage of the FEC. That is, a receiver has to buffer all media and FEC packets of a FEC frame before error correction can commence.
Video coding schemes, and increasingly some audio coding schemes, for example, use so-called predictive coding techniques. Such techniques predict the content of a later video picture or audio frame from previous pictures or audio frames, respectively. In the following, video pictures and audio frames will both be referred to as “pictures”, in order to distinguish them from FEC frames. By using predictive coding techniques, the compression scheme can be very efficient, but becomes also increasingly vulnerable to errors the longer the prediction chain becomes. Hence, so-called key pictures, or the equivalent of non-predictively coded audio frames, both referred to as key pictures hereinafter, are inserted from time to time. This technique re-establishes the integrity of the prediction chain by using only non-predictive coding techniques. It is not uncommon that a key pictures is 5 to 20 times bigger than a predictively coded picture. Each encoded picture may correspond, for example, to one to-be-protected media packet.
Following the conventions of MPEG-2 visual, the picture sequence starting with a key picture and followed by zero or more non-key pictures is henceforth called Group of Pictures (GOP). In digital TV, a GOP consists normally of no more than six pictures. In streaming applications, however, GOP sizes are often chosen much bigger. Some GOPs can have hundred of pictures in a GOP in order to take advantage of the better coding efficiency of predictively coded pictures. For that reason, the “tune in” to such a sequence can take several seconds.
FEC schemes can be designed to be more efficient when FEC frames are big in size, for example, when they comprise some hundred packets. Similarly, most media coding schemes gain efficiency when choosing larger GOP sizes, since a GOP contains only one single key picture which is, statistically, much larger than the other pictures of the GOP. However, both large FEC frames and large GOP sizes are required to synchronize to their respective structures. For FEC frames this implies buffering of the whole FEC frame as received, and correcting any correctable errors. For media GOPs this implies the parsing and discarding of those media packets that do not form the start of a GOP (the key frame).
In U.S. Patent Application Publication No. 2006/0107189 A1, it is stated that, in order to reduce a buffer delay at a decoding end, the FEC frames should be aligned with the groups of media packets. To that end, the encoder should be able to determine, for a group of coded media packets contained in an FEC frame, the number of next subsequent groups of coded media packets which fit completing into that FEC frame, and to select all coded media packets associated with the group or groups of coded media packets so determined for that FEC frame. For alignment purposes, it is possible to equalize the size of selected packets by adding predetermined data to some of them. Examples of aligned FEC frames and the groups of media packets are shown in FIG. 2. As shown in FIG. 2, a media GOP stream 400 comprises a media GOP 410 and a media GOP 420 separated by a boundary 415. The FEC structure 600 comprises a FEC frame 610 and a FEC frame 620 separated by a boundary 615. Although the FEC frames 610 and 620 also contain FEC packets and the media packets, they can be made aligned with the GOPs.
FEC can be applied to rich media content. Rich media content is generally referred to content that is graphically rich and contains compound (or multiple media) including graphics, text, video and audio and preferably delivered through a single interface. Rich media dynamically changes over time and could respond to user interaction.
Streaming of rich media content is becoming more and more important for delivering visually rich content for real-time transport especially within the Multimedia Broadcast/Multicast Services (MBMS) and Packet-switched Streaming Services (PSSS) architectures in 3GPP. PSS provides a framework for Internet Protocol (IP) based streaming applications in 3G networks, especially over point-to-point bearers. MBMS streaming services facilitate resource efficient delivery of popular real-time content to multiple receivers in a 3G mobile environment. Instead of using different point-to-point (PtP) bearers to deliver the same content to different mobiles, a single point-to-multipoint (PtM) bearer is used to deliver the same content to different mobiles in a given cell. The streamed content may consist of video, audio, XML (eXtensible Markup Language) content such as Scalable Vector Graphics (SVG), timed-text and other supported media. The content may be pre-recorded or generated from a live feed. SVG allows for three types of graphic objects: vector graphic shapes, image and texts. Graphic objects can be grouped, transformed and composed from previously rendered objects. SVG content can be arranged in groups such that each of them can be processed and displayed independently from groups that are delivered later in time. Groups are also referred to as scenes.
Until recently, applications for mobile devices were text based with limited interactivity. However, as more wireless devices are coming equipped with color displays and more advanced graphics rendering libraries, consumers will demand a rich media experience from all their wireless applications. A real-time rich media content streaming service is imperative for mobile terminals, especially in the area of MBMS, PSS, and Multi-Media Streaming (MMS) services. Rich media applications particularly in the Web services domain include XML based content such as:
SVGT 1.2—is a language for describing two-dimensional graphics in XML. SVG allows for three types of graphic objects: vector graphic shapes (e.g., paths consisting of straight lines and curves), multimedia (such as raster images, video, video), and text. SVG drawings can be interactive (using DOM event model) and dynamic. Animations can be defined and triggered either declaratively (i.e., by embedding SVG animation elements in SVG content) or via scripting. Sophisticated applications of SVG are possible by use of a supplemental scripting language which accesses the SVG Micro Document Objects Module (μDOM), which provides complete access to all elements, attributes and properties. A rich set of event handles can be assigned to any SVG graphical object. Because of its compatibility and leveraging of other Web standards (such as CDF), features like scripting can be done on XHTML (Extensible HyperText Markup Language) and SVG elements simultaneously within the same Web page.SMIL 2.0—The Synchronized Multimedia Integration Language (SMIL) enables simple authoring of interactive audiovisual presentations. SMIL is typically used for “rich media”/multimedia presentations which integrate streaming audio and video with images, text or any other media type.CDF—The Compound Documents Format (CDF) working group is producing recommendations on combining separate component languages (e.g. XML-based languages, elements and attributes from separate vocabularies), like XHTML, SVG, MathML, and SMIL, with a focus on user interface markups. When combining user interface markups, specific problems have to be resolved that are not addressed by the individual markups specifications, such as the propagation of events across markups, the combination of rendering or the user interaction model with a combined document. The Compound Document Formats working group will address this type of problems. This work is divided in phases and two technical solutions: combining by reference and by inclusion.
In the current 3GPP DIMS (Dynamic Interactive Multimedia Scenes) activity, the streaming of DIMS content has been recognized as an important component of a dynamic rich media service for enabling real time, continuous realization of content at the client. A DIMS content stream typically consists of a series of RTP (Real-time Transport Protocol) packets whose payload is SVG scene, SVG scene update(s), and coded video and audio packets. These RTP packets are encapsulated by UDP (User Datagram Protocol)/IP headers and transmitted over the 3G networks. The packets may be lost due to transmission errors over the wireless links or buffer overflows at the intermediate routers of the 3G networks.
3GPP SA4 defined some media independent packet loss recovery mechanisms at transport layer and above in the MBMS and PSS frameworks. In MBMS, application layer FEC is used for packet loss recovery for both streaming and download services. In PSS, RTP layer retransmissions are used for packet loss recovery. For unicast download delivery, TCP (Transmission Control Protocol) takes care of the reliable delivery of the content.
For rich media based MBMS streaming services, it is very likely that the users tune-in to the service at arbitrary instants during the streaming session. The clients start receiving the packets as soon as they tune-in to the service and may have to wait for a certain time period to start decoding/rendering of the received rich media content. This time period is also called “tune-in delay”. For good user experience, it is desirable that the clients start rendering the content as soon as possible from the time they receive the content. Thus one requirement of DIMS is to allow for efficient and quick tune-in of DIMS clients to the broadcast/multicast streaming service. Quick tune-in can be enabled by media level solutions, transport level solutions or a combination of the two.
When streaming rich media (DIMS) content over broadcast/multicast channels of the 3G wireless networks, it is essential to protect the content from packet losses by using application layer forward error correction (AL-FEC) mechanism. AL-FEC algorithm is typically applied over a source block of media RTP packets to generate redundant FEC RTP packets. As mentioned earlier and illustrated in FIGS. 1 and 2, the media and the associated FEC packets are collectively referred to as an “FEC frame”. The FEC frame is transmitted over the lossy network. A receiver would be able to recover any lost media RTP packets if it receives sufficient total number of media and FEC RTP packets from that FEC frame. Currently, the length of the above-mentioned source block is configurable. AL-FEC is more effective if large source blocks are used. On the other hand, the tune-in delay is directly proportional to the length of the source block.
In a typical rich media streaming session that involves SVG, audio and video media, at the sender side, source RTP packets of each media are bundled together to form a source block for FEC protection. One or more FEC RTP packets are generated from this source block using an FEC encoding algorithm. The source RTP packets of different media along with the FEC RTP packets are transmitted as separate RTP streams, as shown in FIG. 3. As shown in FIG. 3, the DIMS RTP stream contains a plurality of FEC frames 6101, 6102 and 6103, for example. These FEC frames may contain the source blocks for different DIMS media or the same medium. The FEC frame 6101 comprises a source block 6141 of source RTP packets and a FEC RTP packet 6121. On the receiver side, the client buffers the received RTP packets (both source and FEC) for sufficient duration and tries to reconstruct the above mentioned source block. If any source RTP packets are missing, then it tries to recover them by applying the FEC decoding algorithm.
The length of the FEC source block is a critical factor in determining the tune-in delay. The client has to buffer for the duration of an entire FEC source block. If a client starts receiving data in the middle of the current FEC source block, then it may have to discard the data from the current source block, and wait to receive next source block from the beginning to the end. Hence on an average it has to wait for 1.5 times the FEC source block duration.
After FEC decoding, the packets are sent to various media decoders at the receiver. The media decoders may not be able to decode from arbitrary points in the compressed media bit stream. If the FEC frames and the media GOPs are not aligned, then on an average the decoder may have to discard one half of the current media GOP data.Tune-in delay=1.5*(FEC source block duration)+0.5*(media GOP duration)  (1),where FEC source block duration is the buffering delay of the FEC frame (in isochronous networks this is proportional to the size of the FEC frame), and media GOP duration is the buffering delay of the media GOP. The worst case buffer sizes have to be chosen such that a complete FEC frame and a complete GOP, respectively, fits into the buffer of an FEC decoder and the buffer of a media decoder, respectively.