The quality of transmission on most packet networks, for example IP networks, suffer from erasures. Erasures can happen due to many different causes, such as link errors, transmission segment or router overload, environmental factors (especially when a wireless link segment is involved) and so forth. It should be noted that transmission errors cannot be seen as a failure of a network element; instead, they are a normal operation conditions and network protocols and elements need to be designed to cope with such conditions. For the purpose to simplify later discussions relating to the present invention, two forms of erasures are defined. An impulse erasure is a loss of a single packet (both the packet before and after the packet in question, in transmission order, are received successfully). Burst erasures, in contrast, encompass at least two adjacent packets in transmission order.
To combat the negative impact on the perceived quality of reproduced media data that was subject to erasures, many different schemes are known. Feedback based retransmission uses reports about lost packets or lost media entities (for example lost coded pictures), to trigger some form of reaction by the sender in a closed loop. Source coding based media tools, such as Intra picture or Intra macroblock refresh in coded video, make the media itself more robust. Forward channel coding techniques, such as forward error correction or redundancy coding, improve the packet reception rate at the receiver in a media-unaware fashion. All these mechanisms add a certain amount of additional bit rate and a certain amount of latency. As a general rule, in terms of bit rate efficiency, feedback based tools are better than forward channel coding techniques, and these are better than source coding based techniques. However, in terms of delay, the ranking is just the opposite: source coding based techniques add the lowest additional delay (sometimes zero), channel coding techniques add somewhat higher delay, and feedback based techniques typically add very high delay.
For many media compression schemes, one can assign a category of importance to individual bit strings of the coded media, henceforth called priority. In coded video, for example non-predictively coded information (Intra pictures) have a higher priority than predictively coded information (Inter pictures). Of the Inter pictures, those which are used for the prediction of other inter pictures (reference pictures) have a higher priority than those which are not used for future prediction (non-reference pictures). Some audio coding schemes require the presence of codebook information before the playback of the content can start, and here the packets carrying the codebook have a higher priority than the content packets. When using MIDI, instrument definitions have a higher priority than the actual real-time MIDI stream. A person skilled in the art should easily be able to identify different priorities in media coding schemes based on the examples presented.
Priority can also be established based on “soft” criteria. For example, when a media stream encompasses audio and video packets, one can, in most practical cases, assume that the audio information is, from a user's perception's point of view, of higher importance than the video information. Hence, the audio information carries a higher priority than the video information. Based on the needs of an application, a person skilled in the art should be capable to assign priorities to different media types that are transported in a single media stream.
The loss of packets carrying predictively coded media has normally negative impacts on the reproduced quality. Missing data not only leads to annoying artifacts for the media frame the packet belongs to, but the error also propagates to future frames due to the predictive nature of the coding process. Most of the media compression schemes mentioned above implement a concept of independent decoder refresh information (IDR). IDR information has, by its very nature, the highest priority of all media bit strings. Independent decoder refresh information is defined as information that completely resets the decoder to a known state. In older video compression standards, such as ITU-T H.261, an IDR picture is identical to an Intra picture. Modern video compression standards, such as ITU-T H.264, contain reference picture selection. In order to break all prediction mechanisms and reset the reference picture selection mechanism to a known state, those standards include a special picture type called IDR picture. For the mentioned audio and MIDI examples, an IDR consists of all codebook/instrument information necessary for the future decoding. An IDR period is defined herein to contain media samples from an IDR sample (inclusive) to the next IDR sample (exclusive), in decoding order. No coded frame following an IDR frame can reference a frame prior to the IDR frame.
A sequence of coded pictures from an IDR picture, inclusive, to the next IDR picture, exclusive, in decoding order, is henceforth called a Group of pictures (GOP) in this application. Pictures can be either reference pictures or non-reference pictures. It is also possible to encode video streams such that it contains so-called sub-sequences, and sub-sequences have a hierarchical dependency structure.
For example, packet-based Forward Error Correction (FEC) can be used to combat transmission errors. In order to allow FEC to effectively protect packets against erasures, a so-called “Matrix” or FEC block approach is commonly used. Two examples of FEC computations in a FEC matrix are illustrated in FIGS. 2a and 2b. The examples are described in greater detail later in this application, but a brief introduction is given here to introduce the field of invention. Media payloads are located into the FEC matrix in a deterministic way. For example, in FIG. 2a each row corresponds to one transport packet, and transport packets are padded with stuffing bits to make their sizes equal. The media data in the FEC matrix is also referred to as the source block. FEC coding is using a certain scan-order in the media data. For example, FIG. 2a, media data is scanned column by column, and for each column a certain number of FEC repair symbols is created. The repair symbols are located into the FEC matrix and packetized to transport packets in a pre-defined manner.
Video Compression
In hybrid video coders, an input video picture frame is divided, for processing purposes, into blocks of, for example, 16×16 pixels (pels), called macro-blocks. Each macro-block comprises, for example, blocks carrying processed sample values of one of three components: one luminosity component Y, and two chrominance components Cb and Cr. One or more macro-blocks can be combined to form a slice. The concept of slicing was developed to enable encoders to fit video data into the Maximum Transfer Unit (MTU) of a transmission channel. The use of slices breaks the in-frame prediction commonly used in hybrid video coders.
Reduction of temporal redundancies, in video, is achieved by predicting the current to-be-coded frame from previous or future picture frames. A frame used for this kind of prediction purposes is called a reference frame. Some of the coded frames in a sequence can, as a matter of the encoder's choice, not be used for prediction. These frames are called non-reference frames. Slices belonging to non-reference frames are called non-references slices, and slices belonging to the reference frames are called reference slices henceforth. FIG. 1 illustrates the Reference and Non-Reference Pictures in a simplified manner. In previous research, the use of non-reference frames has been shown to improve compression efficiency, as well as provide a mechanism of temporal scalability.
In modern video compression standards, more than one reference frame can be used to predict macroblocks of the slice to be coded. It has been shown that the use of more than one reference frames can improve the compression efficiency of the codec and also make the coded video more robust to errors.
Wireless Networks
Due to the huge popularity, and the growing demand for IP based services, most current wireless data networks, are migrating from a circuit switched networks to packet switched networks. This allows the wireless networks to provide most or all of the services available on the Internet. Moving towards this goal, new protocol architectures like GPRS and UMTS have been standardized, or are in the process of standardization.
The 3rd Generation Partnership Project (3GPP) produces a complete set of globally applicable technical specifications and reports, for a 3rd generation systems based on the evolved Global System for Mobile Communication (GSM) core networks and the Universal Terrestrial Radio Access (UTRA) networks. Packet based air interfaces like CDMA2000, Edge, and WCDMA are the result of the standardization efforts of 3GPP/3GPP2.
In the following, some problems of previous solutions are shortly discussed.
Stream Synchronization and Initial Buffering Delay
FIG. 6 presents an example showing a part of audio-video stream in Multimedia Broadcast/Multicast Service (MBMS) streaming delivery. Decoding and transmission order within a stream goes from left to right. It is further assumed that the presentation order of media packets 805, 806 is the same as their decoding order and that the location of the media samples in the streams 801, 802 depicted in FIG. 6 also indicates the approximate presentation time. The media packets of the streams 801, 802 are divided into FEC blocks 803, 804. The FEC blocks 803, 804 comprise media packets 805, 806 and repair packets 807, 808.
To maximize the probability of correct reception of media samples a and c, the receiver should delay the decoding of the corresponding FEC block 803, 804 until all the repair packets 807 of the FEC block 803, 804 are received. Similarly, to maximize the probability of correct reception of media sample d, the receiver should delay the decoding of the corresponding FEC block until all the repair packets of the FEC block are received.
Audio media frame c is supposed to be played out simultaneously to video picture d. Therefore, media decoding and rendering of the corresponding audio FEC block 803 must actually be delayed until the video FEC block 804 containing sample d is completely received.
The initial buffering delay before media decoding and rendering is derived as follows: the maximum difference of reception time of the last packet of FEC block B and the first packet of FEC block A is calculated for any such pair of FEC blocks (A,B) in streams 1 and 2 respectively, in which the smallest RTP timestamp in FEC block A is within the range of RTP timestamps of FEC block B, and the last packet of FEC block A is received later than the last packet of FEC block B.
For two streams in an MBMS streaming session, the additional initial buffering delay is a sum of maximum differences between streams 1 and 2 and streams 2 and 1.
Tune-in Delay
A receiver in multicast/broadcast may not start reception from the first packet of a FEC block. If packets are transmitted in decoding order and if each packet is predictively coded (e.g. P pictures in video coding), then decoding of media data can only start once the synchronization to the FEC block structure is achieved. In addition, to produce correct output samples, the decoding process of the media decoder has to be reset e.g. with an IDR picture of H.264/AVC. The tune-in delay into a middle of broadcast/multicast therefore consists of the following parts:
First, the delay until the first packet of a FEC block is received. After that it takes some time to receive a complete FEC block (reception duration). Also the size variation of FEC blocks need to be compensated as well as the synchronization between the streams of the MBMS streaming session. Finally, the tune-in delay is also affected by the delay until a media decoder is refreshed to produce correct output samples.
Unequal Protection
Predictively coded media, and especially predictively coded video is notorious for not being gracefully degradable. That is, when the channel conditions (as perceived by the video decoder) detioriate, the quality of video is reduced remarkably. In contrast to this, analog TV picture just gets noisier but is still usable—though annoying. It would be more desirable to have a perfect-quality picture in good channel conditions and an acceptable-quality picture in bad channel conditions.
Many methods for degradation in multicast/broadcast streaming are based on scalable video coding. In H.264/AVC, both non-reference pictures and sub-sequences can be used to achieve temporal scalability while compression efficiency improves or stays unchanged compared to non-scalable coding. When a non-reference picture or a sub-sequence in layer 1 or above is lost or corrupted, the impact of the error does not propagate in time. Therefore, at least some degradation can be achieved when the “base layer” is protected such that it is always received and the “enhancement layer” (the layer 1 or above) is protected such that it is received when the channel conditions are sufficiently good.
The computation of the FEC as discussed above is normally performed using a so-called FEC matrix. Two examples of FEC computations in a FEC matrix are illustrated in FIGS. 2a and 2b. In FIG. 2a, each row corresponds to one transport packet. FEC coding is performed vertically. Transport packets are padded with stuffing bits to make their sizes equal. The stuffing bits are removed after the parity packets are generated and so, the stuffing bits are not transmitted. Each source packet is protected by an (n, k) code, where n is the total number of transport packets (along with the FEC packets), and k is the number of media packets that the FEC code protects. This method of FEC parity packet generation is described in detail in Adam Li, “An RTP Payload Format for Generic FEC”, Work in Progress, draft-ietf-avt-ulp-10.txt, July 2004. In FIG. 2b, one row corresponds to source transport packet, one column corresponds to one modified transport packet, and FEC coding is performed horizontally. The source bits and parity bits for every source packets are distributed into many different modified transport packets, respectively. M. Wagner, J. Pandel, W. Weng, “An RTP Payload Format for Erasure-Resilient Transmission of Progressive Multimedia Streams”, Work in Progress, draft-ietf-avt-uxp-07.txt, October 2004, uses a similar kind of FEC protection mechanism.
RFC2733 allows selective FEC, i.e. FEC packets include a bit-mask that signals the media packets over which the FEC is calculated. The mask field is 24 bits. If bit i in the mask is set to 1, then the media packet with sequence number N+i is associated with this FEC packet, where N is the sequence number (SN) Base field in the FEC packet header. The least significant bit corresponds to i=0, and the most significant to i=23. The SN base field is be set to the minimum sequence number of those media packets protected by FEC. This allows for the FEC operation to extend over any string of at most 24 packets.
The publication Adam Li, “An RTP Payload Format for Generic FEC”, Work in Progress, draft-ietf-avt-ulp-10.txt, July 2004, like RFC2733, specifies a payload format for generic Forward Error Correction (FEC) for media data encapsulated in RTP. It is also based on the exclusive- or (parity) operation, but builds on RFC2733 with a generalized algorithm that includes Uneven Level Protection (ULP). The payload format described in this draft allows end systems to apply protection using arbitrary protection lengths and levels, in addition to using arbitrary protection group sizes. It enables complete recovery or partial recovery of the critical payload and RTP header fields depending on the packet loss situation. Uneven level of protection can be applied for different parts of packets, i.e. the best protection of the first A bytes of each media packet, a weaker level of protection for the next B bytes in the packets, and no protection for the remaining bytes in the packets. This scheme is completely backward compatible with non-FEC capable hosts. Those receivers that do not use FEC can simply ignore the protection data.
The publication M. Wagner, J. Pandel, W. Weng, “An RTP Payload Format for Erasure-Resilient Transmission of Progressive Multimedia Streams”, Work in Progress, draft-ietf-avt-uxp-07.txt, October 200], uses Reed-Solomon codes together with an appropriate interleaving scheme for adding redundancy, but allows for finer granularity in the structure of the progressive media stream. It provides mechanisms typical for mobile channels, where the long message blocks like IP packets are split up into segments of desired lengths, which can be multiplexed onto link layer packets of fixed size. It uses a matrix structure of L by N where L is the number of rows of N octets. The incoming RTP packet data bytes are filled such that more important data, usually in the beginning of the RTP packet occupy less number of columns in the matrix and the less significant data occupy the more number of columns in the matrix. RS parity code words are then computed across each of the N columns, each row then forming a valid code word of the chose RS code.
Both the above mentioned documents seem to solve the same problem, but with some essential differences in the methodology used. The main difference between both approaches is that while ULP preserves the structure of the packets which have to be protected and provides the redundancy in extra packets, the unequal erasure protection (UXP) scheme mentioned in draft-ietf-avt-uxp-07.txt interleaves the info stream which has to be protected, inserts the redundancy information, and thus creates a totally new packet structure.
Another difference concerns multicast compatibility: It cannot be assumed that all future terminals will be able to apply UXP/ULP. Therefore, backward compatibility could be an issue in some cases. Since ULP does not change the original packet structure, but only adds some extra packets, it is possible for terminals which do not support ULP to discard the extra packets. In case of UXP, however, two separate streams with and without erasure protection have to be sent, which increases the overall data rate.
When IP multicast is used, each receiver can select the number of multicast groups it wants to receive. Multicast groups are ordered such that the first multicast group provides a basic-quality decoded stream and each multicast group in their numbered order enhances the quality. It is known that layers of scaleable coded media can be streamed on different IP multicast groups and a multicast group may contain FEC to improve the quality of data in that multicast group or any “lower” multicast group. More information on this issue can be found e.g. in Philip Chou et al., U.S. Pat. No. 6,594,798.