Efficient and reliable delivery of video data is becoming increasingly important as the Internet continues to grow in popularity. Video is very appealing because it offers a much richer user experience than static images and text. It is more interesting, for example, to watch a video clip of a winning touchdown or a Presidential speech than it is to read about the event in stark print. Unfortunately, video data is significantly larger than other data types commonly delivered over the Internet. As an example, one second of uncompressed video data may consume one or more Megabytes of data. Delivering such large amounts of data over error-prone networks, such as the Internet and wireless networks, presents difficult challenges in terms of both efficiency and reliability.
To promote efficient delivery, video data is typically encoded prior to delivery to reduce the amount of data actually being transferred over the network. Image quality is lost as a result of the compression, but such loss is generally tolerated as necessary to achieve acceptable transfer speeds. In some cases, the loss of quality may not even be detectable to the viewer.
Video compression is well known. One common type of video compression is a motion-compensation-based video coding scheme, which is used in such coding standards as MPEG-1, MPEG-2, MPEG-4, H.261, and H.263.
One particular type of motion-compensation-based video coding scheme is fine-granularity layered coding. Layered coding is a family of signal representation techniques in which the source information is partitioned into sets called “layers”. The layers are organized so that the lowest, or “base layer”, contains the minimum information for intelligibility. The other layers, called “enhancement layers”, contain additional information that incrementally improves the overall quality of the video. With layered coding, lower layers of video data are often used to predict one or more higher layers of video data.
The quality at which digital video data can be served over a network varies widely depending upon many factors, including the coding process and transmission bandwidth. Quality of Service”, or simply “QoS”, is the moniker used to generally describe the various quality levels at which video can be delivered. Layered video coding schemes offer a range of QoSs that enable applications to adopt to different video qualities. For example, applications designed to handle video data sent over the Internet (e.g., multi-party video conferencing) must adapt quickly to continuously changing data rates inherent in routing data over many heterogeneous sub-networks that form the Internet. The QoS of video at each receiver must be dynamically adapted to whatever the current available bandwidth happens to be. Layered video coding is an efficient approach to this problem because it encodes a single representation of the video source to several layers that can be decoded and presented at a range of quality levels.
Apart from coding efficiency, another concern for layered coding techniques is reliability. In layered coding schemes, a hierarchical dependence exists for each of the layers. A higher layer can typically be decoded only when all of the data for lower layers or the same layer in the previous prediction frame is present. If information at a layer is missing, any data for the same or higher layers is useless. In network applications, this dependency makes the layered encoding schemes very intolerant of packet loss, especially at the lower layers. If the loss rate is high in layered streams, the video quality at the receiver is very poor.
FIG. 1 depicts a conventional layered coding scheme 20, known as “fine-granularity scalable” or “FGS”. Three frames are shown, including a first or intraframe 22 followed by two predicted frames 24 and 26 that are predicted from the intraframe 22 and the previous frame 24. The frames are encoded into four layers: a base layer 28, a first layer 30, a second layer 32, and a third layer 34. The base layer typically contains the video data that, when played, is minimally acceptable to a viewer. Each additional layer contains incrementally more components of the video data to enhance the base layer. The quality of video thereby improves with each additional layer. This technique is described in more detail in an article by Weiping Li, entitled “Fine Granularity Scalability Using Bit-Plane Coding of DCT Coefficients”, ISO/IEC JTC1/SC29/WG11, MPEG98/M4204 (December 1998).
With layered coding, the various layers can be sent over the network as separate sub-streams, where the quality level of the video increases as each sub-stream is received and decoded. The base-layer video 28 is transmitted in a well-controlled channel to minimize error or packet-loss. In other words, the base layer is encoded to fit in the minimum channel bandwidth. The goal is to deliver and decode at least the base layer 28 to provide minimal quality video. The enhancement layers 30-34 are delivered and decoded as network conditions allow to improve the video quality (e.g., display size, resolution, frame rate, etc.). In addition, a decoder can be configured to choose and decode a particular portion or subset of these layers to get a particular quality according to its preference and capability.
One characteristic of the illustrated FGS coding scheme is that the enhancement layers 30-34 are predicatively coded from the base layer 28 in the reference frames. As shown in FIG. 1, each of the enhancement layers 30-34 in the predicted frames 24 and 26 can be predicted from the base layer of the preceding frame. In this example, the enhancement layers of predicted frame 24 can be predicted from the base layer of intraframe 22. Similarly, the enhancement layers of predicted frame 26 can be predicted from the base layer of preceding predicted frame 24.
The FGS coding scheme provides good reliability in terms of error recovery from occasional data loss. By predicting all enhancement layers from the base layer, loss or corruption of one or more enhancement layers during transmission can be remedied by reconstructing the enhancement layers from the base layer. For instance, suppose that frame 24 experiences some error during transmission. In this case, the base layer 28 of preceding intraframe 22 can be used to predict the base layer and enhancement layers of frame 24. Unfortunately, the FGS coding scheme has a significant drawback in that the scheme is very inefficient from a coding or compression standpoint since the prediction is always based on the lowest quality base layer.
FIG. 2 depicts another conventional layered coding scheme 40 in which three frames are encoded using a technique introduced in an article by James Macnicol, Michael Frater and John Arnold, which is entitled, “Results on Fine Granularity Scalability”, ISO/IEC JTC1/SC29/WG11, MPEG99/m5122 (October 1999). The three frames include a first frame 42, followed by two predicted frames 44 and 46 that are predicted from the first frame 42 and the previous frame 44. The frames are encoded into four layers: a base layer 48, a first layer 50, a second layer 52, and a third layer 54. In this scheme, each layer in a frame is predicted from the same layer of the previous frame. For instance, the enhancement layers of predicted frame 44 can be predicted from the corresponding layer of previous frame 42. Similarly, the enhancement layers of predicted frame 46 can be predicted from the corresponding layer of previous frame 44. The coding scheme illustrated in FIG. 2 suffers from a serious drawback in that it cannot easily recover from data loss. Once there is an error or packet loss in the enhancement layers, the error or packet loss propagates to the end of a GOP (group of predicted frames) and causes serious drifting in higher layers in the prediction frames that follow. This propagation is a simple example of what is called drifting error.
With the steady increase in the access bandwidth, more and more new applications are streaming audio and video contents using techniques described in articles by A. Luthra, titled “Need for simple streaming video profile”, published in ISO/IEC JTC1/SC29/WG11, MPEG doc M5800, Noordwijkerhout, Netherlands, March 2000, and by J. Lu, titled “Signal processing for Internet video streaming: A review”, published in SPIE in Image and Video Communication and Processing 2000, vol 3974, 246-258 (2000). These Internet streaming applications have to deal with network bandwidth fluctuations in a wide range from one user to another and from time to time. The objective of traditional video coding techniques is typically to optimize the video quality at a given bit rate. Therefore, the bit-stream generated with those methods does not adapt well to the channel bandwidth fluctuations.
In the FGS scheme, mentioned above, DCT residues between the original/predicted DCT coefficients and dequantized DCT coefficients of the base layer form the enhancement bit-stream using the bit plane technique. Since the bit plane technique provides an embedded bit-stream and fine granularity scalable capability, the FGS enhancement bit-stream can be decoded at any bit rate. Therefore, the FGS scheme can easily adapt to the channel bandwidth fluctuations. However, since its motion prediction is always based on the lowest quality base layer, the coding efficiency of the FGS scheme is not as good as, and sometimes much worse than, the traditional SNR scalable scheme. Compared with the non-scalable video coding scheme, the PSNR of the FGS scheme may drop 2.0 dB or more at the same bit rate.
A general framework has been proposed for effectively implementing the fine granularity scalability, called Progressive Fine Granularity Scalable (PFGS) video coding, in articles authored by F. Wu, S. Li and Y.-Q. Zhang, titled “DCT-prediction based progressive fine granularity scalability coding”, published in ICIP 2000, Vancouver, Canada, vol 3, 556-559 (Sep. 10-13, 2000), and authored by F. Wu, S. Li and Y.-Q. Zhang, titled “A framework for efficient progressive fine granularity scalable video coding”, and published in IEEE trans. Circuit and systems for video technology, special issue on streaming video, vol 11, no 3, 332-344 (2001), herein after collectively and individually referred to as the “Wu et al. Publications”. In the PFGS framework, a high quality reference is used in the enhancement layer coding.
FIG. 3 is a prediction architecture of a PFGS layered coding scheme 300 implemented by the video encoder. FIG. 3 shows arrows with solid lines between two adjacent frames which represent temporal prediction. The arrows with dashed lines in FIG. 3 are for prediction in the transform domain, and the gray rectangular boxes denote those layers to be constructed as references. Scheme 300 encodes frames of video data into multiple layers, including a base layer 3002 and multiple enhancement layers: the first enhancement layer 302, the second enhancement layer 304, the third enhancement layer 306, and a fourth enhancement layer 308. An example of a low quality enhancement layer reference is seen at second enhancement layer 304 in the frames 2 and 4. An example of a high quality enhancement layer reference is seen at third enhancement layer 306 in the frames 3 and 5.
As can be seen in FIG. 3, each frame at the base layer is always predicted from the previous frame at the base layer, whereas each frame at an enhancement layer is predicted from the previous frame at an enhancement layer. Since the quality of an enhancement layer is always higher than that of the base layer, the PFGS scheme provides more accurate motion prediction than the FGS scheme, thus improving the coding efficiency. Experimental results of the PFGS scheme show that the coding efficiency of the PFGS scheme can be up to 1.0 dB higher in average PSNR than that of the FGS scheme at moderate or high bit rates.
Just as in the FGS scheme, the PFGS scheme generates two bit-streams: base layer bit-stream and enhancement layer bit-stream. In general, the bit rate of the base layer is low enough to fit in the minimum network bandwidth. Therefore, it can be assumed that the base layer is always available in the decoder. However, since the high quality references always comprise part of the DCT coefficients encoded in the enhancement layer, more bandwidth is needed to transmit them to the decoder. When network bandwidth drops, the decoder may partially or completely lose the high quality references. In this case, the decoder has to use the corrupted high quality references or use the low quality references instead. This would introduce some errors to the enhancement layer due to the different references used in the encoder and the decoder. The unfortunate fact is that these kinds of errors can be propagated from one frame to another through motion compensation. In the worst case, the enhancement bit-streams in successive frames are completely dropped due to network congestion. Once the decoder receives the enhancement bit-stream again, the errors that occurred in previous frames can be accumulated and then affect the frames that follow within the same Group Of Picture (GOP). Hence, the decoded quality of the enhancement layer deteriorates rapidly while the frame number increases.
FIG. 4 shows a simple example wherein the conventional MPEG-4 test sequence, known as the Foreman sequence, is encoded with the FGS scheme and the PFGS scheme. The PSNR curves of both the FGS scheme and the PFGS scheme are drawn in FIG. 4 as a graph showing the drifting phenomenon at the low enhancement bit rate. The bit rate of base layer is 128 kbits/s. The high quality references are reconstructed from the second or third bit plane in the PFGS scheme so that the total bit rate for high quality references is more than 384 kbits/s. When the PFGS bit-stream is transmitted over a network with bandwidth 256 kbits/s, the high quality references are always incompletely transmitted to the decoder. When the frame number increases, the decoded quality of the PFGS scheme can be dropped more than 2.0 dB compared with that of the FGS scheme. Moreover, the PSNR curve of the PFGS scheme is clearly drifting toward the low end. Consequently, these kinds of errors are also called drifting errors. The cause of drifting errors is that the high quality references cannot be correctly and completely transmitted to the decoder.
A method proposed in the Wu et al. Publications to eliminate the drifting errors in the PFGS scheme suggested that the high quality reference could be alternatively reconstructed from the previous base layer and the previous enhancement layer. When the high quality reference is reconstructed from the previous base layer, the encoder and decoder can always obtain the same temporal prediction. The drifting errors propagated from the previous frames can be effectively eliminated. But this method also affects the coding efficiency of the PFGS scheme, because the high quality reference does not always obtain the best quality it could get. Moreover, since the choice of temporal references is frame-based, the original PFGS scheme does not provide a good trade-off between high coding efficiency and low drifting errors. The following section briefly reviews the existing techniques to terminate or reduce the drifting errors.