In a multimedia streaming service, there are three participants involved: a streaming server, a streaming client and a transmission channel or an underlying network. Usually it is the transmission channel that is the bottleneck of the service, both in terms of throughput and in terms of reliability (i.e., if no throughput bitrate guarantee is assumed), but throughput limitations can occur also at the client and/or at the server.
In a real-time streaming system, due to the dynamically changing throughput characteristics of the channel, client and server, the streaming delivery needs to be adaptive in order to maintain a real-time playback experience for the user. The server should adapt the transmission rate to the varying throughput of the system. An example of such a rate adaptation system can be found in Haskell et al. (U.S. Pat. No. 5,565,924, “Encoder/Decoder Buffer Control for Variable Channel”).
The streaming client provides receiver buffering for storing incoming data before passing them to the media decoder for playout. The receiver buffer is used to compensate for the difference between source encoding rate (also referred to as sampling rate) and transmission rate (pre-decoder buffering). It is also used to compensate for the packet transfer delay variation over the channel (jitter buffering). In general, these two functions are assumed to be combined in a single receiver buffer. However, they can also be implemented with two separate buffers in a receiver, although such an implementation is not optimum from a delay point of view. Receiver buffering can also smooth out the adaptation inaccuracies (i.e. if the system throughput is not matched exactly by the server output).
If the receiver buffer becomes empty (i.e. buffer underflow), which means that the decoder is running out of data to decode, the client needs to pause playout and re-buffer incoming data before resuming. On the other hand, if the incoming data rate is faster than the playout rate, then the receiver buffer space can be exhausted (i.e., buffer overflow), which can result in dropping packets from the buffer in order to make room for new incoming packets. When the packets are dropped, the video quality is degraded. To ensure a smooth and flawless playout, the receiver buffer of the client should be kept within a certain fullness range. In order to guarantee that the receiver buffer will not underflow or overflow, the bitrate for transmission and sampling at the server and that for reception and playout at the client must be adequately controlled.
3GPP rate adaptation signaling as defined in 3GPP TS 26.234 is based on feedback sent from the receiver to the sender in the form of an RTCP APP (Application-Defined Real Time Control Protocol) packet. This packet includes the sequence number (SN) of the oldest packet in the receiver buffer. This SN is referred to as OBSN (oldest buffered sequence number).
The signaling of the OBSN allows the sender to perform the necessary adaptation. Yet, if the decoding order and the display order are different, the sender may not be able to derive the status of the buffer and the purpose of the signaling would be defeated. With the PSS (Packet Switched Streaming Service) video codecs supported in Release 5, this is not a problem as their packet transmission order is equal to the decoding order.
In Release 6, H.26L (also known as H.264) will be added to the list of the PSS codecs. With H.26L, the transmission order and the decoding order could be different because of interleaved packetization at the payload level (as specified in the IETF H.26L payload format draft).
The same property also exists for the frame-interleaved transmission of many audio and speech codecs, such as AMR-NB, AMR-WB, AMR-WB+, AAC and AACPlus (for the latter, the interleaving method defined in RFC 3640 is used).
The problem is hereafter illustrated assuming that the server transmits a series of packets whose RTP sequence numbers are denoted x, x+1, x+2, x+3, . . . The DON (Decoding order number) defined by the H.26L payload format maps these sequence numbers to a decoding order y. The decoding order y is defined as follows: If a packet has a decoding order y, it is the yth packet to be decoded. That is, when the current packet has a decoding order y, it also means that (y−1) packets have already been decoded by the time the current packet is given to the decoder. Although the y value is derived from the DON value, these two values are not always the same.
The following example illustrates the differences between the sequence numbers of packets and their corresponding decoding orders:
xyx + 1y + 1x + 2y + 2x + 3y + 3x + 4 y + 100x + 5 y + 101x + 6y + 4x + 7y + 5. . . . . .  x + 101 y + 99 x + 102 y + 102
In the above-given example, the decoding order is equal to the sequence number (SN) from packet x to packet x+3. However, the decoding order and the sequence number are not the same for packets x+4 and x+5. Packets x+4 and x+5, for example, may be two packets of a frame that will be decoded only in the future.
Let us now look at the evolution of the receiver buffer and assume that, at a certain time, the receiver has received packets x, x+1, x+2, x+3. In this situation, the oldest sequence number in the buffer (OBSN) is x, and the highest received sequence number (HSN) signaled in RTCP RR reports is x+3. As time progresses, packet of SN x has been decoded and packet of SN x+4 has been received. Accordingly, the server will signal to the client OBSN=x+1 (the new “oldest” sequence number in the buffer) and HSN=x+4 (the new “most recent” SN received).
As time further progresses to the time when x+1, x+2 and x+3 have been played and x+5, x+6 and x+7 have been received, for example. At that point, the state of the buffer is x+4, x+5, x+6 and x+7. Accordingly, the client will signal to the server OBSN=x+4 and HSN=x+7. The problem arises around this time because after x+5, the decoding order number for the following packets: x+6, x+7, etc. is smaller than the decoding order number at x+4. Accordingly, the current rate adaptation signaling OBSN will remain at x+4 until the packet x+102 is received, at which time the OBSN will be updated. The server will thus lose track of the receiver buffer status because OBSN is not updated according to the decoding and the removal of packets from the receiver buffer.
For AMR-NB and AMR-WB, RFC 3267 defines how interleaving can be used. For AMR-WB+, the same interleaving rules defined for AMR-WB apply. There are two relevant parameters signaled inside the payload headers: ILL and ILP. Moreover, the number of frames per AMR packet is fixed to a certain number (let's say N). These three values define a mathematically deterministic method for defining the order of frames to be present in an RTP payload present inside an AMR RTP packet.
It can be seen that there is no notion of hard-coded DON in AMR as in H.26L, since each frame has a deterministic decoding order based on ILL, ILP and N values signaled in the RTP payload header. AMR-wise DON is interpreted by the client and server by making use of the first RTP Sequence number signaled in RTSP PLAY response and with (ILL, ILP, N) triplet. The same problem statement mentioned for the H.26L case is valid for interleaved streaming of AMR-NB, AMR-WB and AMR-WB+.
In sum, the prior art method of rate adaptation signaling is based on the oldest packet currently in the receiver playout buffer, allowing the sender to estimate both the number of bytes in the receiver buffer and the duration of the playout buffer. This information is used by the sender to perform adaptation so as to avoid receiver underflow (playout interruption) or receiver overflow (packet loss). However, because the decoding order and the transmission order are not the same in some occasions, the sender may lose track of the receiver buffer.
A typical RTP packet is shown in FIG. 1. The RTP packet includes a multi-time aggregation packet of type MTAP16 and two multi-time aggregation units. The RTP Header in the first row of the packet is shown in FIG. 2. As shown in FIG. 2, the sequence number (SN) of the packet is shown in the first row of the RTP header. As shown in FIG. 1, the aggregation type packet aggregates multiple Network Abstraction Layer (NAL) units into a single RTP payload. In particular, in MTAP16s, the NAL unit payload consists of a 16-bit unsigned decoding number order (DON) base, or DONB (see second row of the packet). DONB contains the value of DON of the first NAL unit, so that the value of DON of all other NALs can be expressed in DOND, or the difference between the value of DON in a certain NAL and DONB.
The RTP payload format for H.264 codec can be found in the IETF Audio Visual Transport Working Group Internet Draft draft-ietf-avt-rtp-h264-05 (April 2004).