1. Technical Field
The present invention relates to a system and method suitable for streaming audio and video content over IP (Internet Protocol) networks. In particular, the present invention is suitable for use where the available bit-rate is inherently variable due to physical network characteristics and/or contention with other traffic. For example, the present invention is suitable for multimedia streaming to mobile handheld terminals, such as PDAs (Personal Digital Assistants) via GPRS (General Packet Radio Service) or 3G networks.
2. Related Art
New data network access technologies such as cable and ADSL (Asymmetric Digital Subscriber Line) modems, together with advances in compression and the availability of free client software, are driving the growth of video streaming over the Internet. The use of this technology is growing exponentially, possibly doubling in size every six months, with an estimated half billion streams being served in 2000. However, user perception of Internet streaming is still colored by experiences of congestion and large start-up delays.
Current IP networks are not well suited to the streaming of video content as they exhibit packet loss, delay and jitter (delay variation), as well as variable achievable throughput, all of which can detract from the end-user's enjoyment of the multimedia content.
Real-time video applications require all packets to arrive in a timely manner. If packets are lost, then the synchronization between encoder and decoder is broken, and errors propagate through the rendered video for some time. If packets are excessively delayed, they become useless to the decoder, which must operate in real-time, and are treated as lost. Packet loss, and its visual effect on the rendered video, is particularly significant in predictive video coding systems, such as H.263. The effect of packet loss can be reduced, but not eliminated, by introducing error protection into the video stream. It has been found that such resilience techniques can only minimize, rather than eliminate, the effect of packet loss.
In the case of a sustained packet loss, indicating a long-term drop in throughput, the streaming system needs to be able to reduce its long-term requirements. This commonly means that the bit-rate of the streamed media must be reduced.
Standard compression technologies, such as H.263 and MPEG-4, can be managed to provide a multimedia source that is capable of changing its encoding rate dynamically. A video source having such properties is described herein as an elastic source, i.e., one that is capable of adapting to long-term variations in network throughput. This is commonly achieved by providing a continuously adaptive video bit-rate. This is possible because unlike audio codecs, video compression standards do not specify an absolute operating bit-rate.
Video streaming systems may be designed to provide an encoded stream with varying bit-rate, where the bit-rate adapts, in response to client feedback, instantly to the available network bandwidth. Such a system could be made to be network-friendly, by controlling the transmission rate such that it reduces rapidly in the case of packet loss, and increases slowly at other times.
However, this solution is not practical for two reasons. Firstly, real-time video encoding usually requires a large amount of processing power, thus preventing such a solution from scaling to support many users. Secondly, the end-user perception of the overall quality will be adversely affected by rapid variations in instantaneous quality.
For uni-directional streaming applications, the delay between the sender and receiver is only perceptible at start-up. Therefore, common techniques trade delay for packet loss and jitter. Provided the average throughput requirements of the video stream match the average available bandwidth the receiver buffer size can be dimensioned to contain the expected variation in delay.
Market-leading streaming systems are believed to use significant client-side buffering to reduce the effects of jitter that may be encountered in the Internet. While this helps, it also introduces large start-up delays, typically between 5 and 30 seconds, as the buffer fills. These systems also include technologies that allow the client to adapt to variations in available bandwidth. Although the details of these techniques are not publicly available, it is suspected that they generally use multi-data rate encoding within single files (SNR scalability), and intelligent transmission techniques such as server-side reduction of the video picture rate to maintain audio quality. Such large amounts of buffering could conceivably allow a significant proportion of packets to be resent, although these re-transmissions themselves are subject to the same network characteristics. The decision to resend lost data is conditional on this and several other factors. Such techniques are generally only applicable to unicast transmissions. Multicast transmission systems are typically better served by forward error correction or receiver-based scalability such as RLM and RLC. S. McCanne, ‘Receiver driven layered multicast,’ Proceedings of SIGCOMM 96, Stanford, Calif. August 1996. L. Vicisano, L. Rizzo and J. Crowcroft, ‘TCP-like congestion control for layered multicast data transfer,’ Infocom '98.
The use of a buffer as described above allows a system to overcome packet loss and jitter. However, it does not overcome the problem of there being insufficient bit-rate available from the network. If the long-term average bit-rate requirements of the video material exceed the average bit-rate available from the network, the client buffer will eventually be drained and the video renderer will stop until the buffer is refilled. The degree of mismatch between available network bit-rate and the rate at which the content was encoded determines the frequency of pausing to refill the buffer.
As described above, most video compression algorithms, including H.263 and MPEG-4, can be implemented to provide a continuously adaptive bit-rate. However, once video and audio have been compressed, they become inelastic, and need to be transmitted at the encoded bit-rate.
While network jitter and short-term variations in network throughput can be absorbed by operating a buffer at the receiver, elasticity is achieved only when long-term variations in the network throughput can also be absorbed.
Layered encoding is a well-known technique for creating elastic video sources. Layered video compression uses a hierarchical coding scheme, in which quality at the receiver is enhanced by the reception and decoding of higher layers, which are sequentially added to the base representation. At any time, each client may receive any number of these video layers, depending on their current network connectivity to the source. In its simplest implementation, this provides a coarse-grain adaptation to network conditions, which is advantageous in multicast scenarios. Layered video compression has also been combined with buffering at the client, to add fine-grain adaptation to network conditions. However, it has been shown that layered encoding techniques are inefficient, and will typically require significantly more processing at the client which causes particular problems when dealing with mobile devices, which are likely to have reduced processing capability.
Transcoding is another well-known technique for creating elastic video sources. It has been shown that video transcoding can be designed to have much lower computational complexity than video encoding. However, the computational complexity is not negligible, and so would not lead to a scalable architecture for video streaming.