In conversational packet-switched multimedia systems, e.g., in IP-based video conferencing systems, different types of media are normally carried in separate packets. Moreover, packets are typically carried on top of a best-effort network protocol that cannot guarantee a constant transmission delay, but rather the delay may vary from packet to packet. Consequently, packets having the same presentation (playback) time-stamp may not be received at the same time, and the reception interval of two packets may not be the same as their presentation interval (in terms of time). Thus, in order to maintain playback synchronization between different media types and to maintain the correct playback rate, a multimedia terminal typically buffers received data for a short period (e.g. less than half a second) in order to smooth out delay variation. Herein, this type of buffer is referred to as a delay jitter buffer. In conversational packet-switched multimedia systems buffering can take place before and/or after media data decoding.
Delay jitter buffering is also applied in streaming systems. Due to the fact that streaming is a non-conversational application, the delay jitter buffer required may be considerably larger than in conversational applications. When a streaming player has established a connection to a server and requested a multimedia stream to be downloaded, the server begins to transmit the desired stream. The player typically does not start playing the stream back immediately, but rather it buffers the incoming data for a certain period, typically a few seconds. Herein, this type of buffering is referred to as initial buffering. Initial buffering provides the ability to smooth out transmission delay variations in a manner similar to that provided by delay jitter buffering in conversational applications. In addition, it may enable the use of link, transport, and/or application layer retransmissions of lost protocol data units (PDUs). Buffering allows the player to decode and play data from the buffer while allowing the possibility for lost PDUs to be retransmitted. If the buffering period is sufficiently long the retransmitted PDUs are received in time to be decoded and played at the scheduled moment.
Initial buffering in streaming clients provides a further advantage that cannot be achieved in conversational systems; it allows the data rate of the media transmitted from the server to vary. In other words, media packets can be temporarily transmitted faster or slower than their playback rate, as long as the receiver buffer does not overflow or underflow. The fluctuation in the data rate may originate from two sources. The first source of fluctuation is due to the fact that the compression efficiency achievable in some media types, such as video, depends on the contents of the source data. Consequently, if a stable quality is desired, the bit-rate of the resulting compressed bit-stream varies. Typically, a stable audio-visual quality is subjectively more pleasing than a varying quality. Thus, initial buffering enables a more pleasing audio-visual quality to be achieved compared with a system without initial buffering, such as a video conferencing system.
Considering the example of video data in more detail, different frames of a video sequence may be represented by very different amounts of data. This results from the use of predictive encoding techniques. Typically, video encoding standards define at least two types of frame. The principal frame types are INTRA or I-frames and INTER or P-frames. An INTRA frame is encoded on the basis of information contained within the image itself, while a P-frame is encoded with reference to at least one other frame, usually a frame occurring earlier in the video sequence. Due to the significant temporal redundancy between successive frames of a digital video sequence, it is possible to encode an INTER frame with a significantly smaller amount of data than that required to represent an INTRA frame. Thus, INTRA frames are used comparatively infrequently in an encoded video sequence.
Typically an encoded sequence starts with an INTRA frame (as there is no previous frame available to be used as a reference in the construction of an INTER frame). INTRA frames may be inserted into the sequence periodically e.g. at regular intervals, in order to compensate for errors that may accumulate and propagate through a succession of predicted (INTER) frames. INTRA frames are also commonly used at scene cuts where the image content of consecutive frames changes so much that predictive coding does not provide effective data reduction. Thus, a typical encoded video stream generally starts with an INTRA coded frame and comprises a sequence of INTER frames interspersed with occasional INTRA frames, the amount of data required to represent an INTRA frame being several (e.g. 5-10) times greater than that required to represent an INTER coded frame. The amount of data required to represent each INTER frame also varies according to the level of similarity/difference with its reference frame and the amount of detail in the image.
This means that the information required to reconstruct a predictively encoded video sequence is not equally distributed amongst the transmitted data packets. In other words, a larger number of data packets is required to carry the data related to an INTRA frame than is required to carry the data for an INTER frame. Furthermore, as the amount of data required to represent consecutive INTER frames also varies depending on image content, the number of data packets required to carry INTER frame data also varies.
A second source of fluctuation occurs when packet losses in fixed IP networks occur in bursts. In order to avoid bursty errors and high peak bit- and packet-rates, well-designed streaming servers schedule the transmission of packets carefully and packets may not be sent precisely at the rate they are played back at the receiving end. Typically, network servers are implemented in such a way that they try to achieve a constant rate of packet transmission. A server may also adjust the rate of packet transmission in accordance with prevailing network conditions, reducing the packet transmission rate when the network becomes congested and increasing it if network conditions allow, for example. This typically occurs by adjusting the advertised window of the acknowledgement message sent in TCP (transmission control protocol).
Considering this embedded property of network servers, and in connection with the previously described video encoding system, not only is the information required to reconstruct a predictively encoded video sequence unequally distributed between the transmitted data packets, but the data packets themselves may also be transmitted from the server at a varying rate. This means that a decoder in, for example, a receiving client terminal experiences a variable delay in receiving the information that it requires to construct consecutive frames in a video sequence even if the transmission delay through the network is constant. It should be noted that the term client terminal refers to any end-user electronic device such as handheld devices (PDAs), wireless terminals, as well as desktop and laptop computers and set top boxes, This variation in delay, which arises due to encoding, packetisation and packet transmission from a server can be termed an “encoding” or “server-specific” variation delay. It is independent of, or in addition to, delay jitter that arises due to variations in transmission time within the network.
Hence, initial buffering enables the accommodation of fluctuations in transmitted data rate from the aforementioned disadvantages i.e. encoding or server-specific delay variation and network transmission related delay variation. Initial buffering helps to provide a more stable audio-visual quality and to avoid network congestion and packet losses.
Initial buffering may also be performed after decoding of the received media data. This has the disadvantage that the dimensions of the buffer must be relatively large, as the buffering is performed on decoded data. The combined effect of encoding, server-specific and network transmission delay variations also tends to increase the initial buffering requirement.
Furthermore, the encoding of media data and the way in which encoded data is encapsulated into packets and transmitted from a server causes a decoder in a receiving client terminal to experience a variable delay in receiving the information it requires to reconstruct the media data, even if the transmission delay through the network is constant. Thus, a post-decoder buffer does not provide a means of absorbing this form of delay variation prior to decoding.