This invention relates to systems and methods for coding video data, and more particularly, to motion-compensation-based video coding schemes that employ fine-granularity layered coding.
Efficient and reliable delivery of video data is becoming increasingly important as the Internet continues to grow in popularity. Video is very appealing because it offers a much richer user experience than static images and text. It is more interesting, for example, to watch a video clip of a winning touchdown or a Presidential speech than it is to read about the event in stark print. Unfortunately, video data is significantly larger than other data types commonly delivered over the Internet. As an example, one second of uncompressed video data may consume one or more Megabytes of data. Delivering such large amounts of data over error-prone networks, such as the Internet and wireless networks, presents difficult challenges in terms of both efficiency and reliability.
To promote efficient delivery, video data is typically encoded prior to delivery to reduce the amount of data actually being transferred over the network. Image quality is lost as a result of the compression, but such loss is generally tolerated as necessary to achieve acceptable transfer speeds. In some cases, the loss of quality may not even be detectable to the viewer.
Video compression is well known. One common type of video compression is a motion-compensation-based video coding scheme, which is used in such coding standards as MPEG-1, MPEG-2, MPEG-4, H.261, and H.263.
One particular type of motion-compensation-based video coding scheme is fine-granularity layered coding. Layered coding is a family of signal representation techniques in which the source information is partitioned into a sets called xe2x80x9clayersxe2x80x9d. The layers are organized so that the lowest, or xe2x80x9cbase layerxe2x80x9d, contains the minimum information for intelligibility. The other layers, called xe2x80x9cenhancement layersxe2x80x9d, contain additional information that incrementally improves the overall quality of the video. With layered coding, lower layers of video data are often used to predict one or more higher layers of video data.
The quality at which digital video data can be served over a network varies widely depending upon many factors, including the coding process and transmission bandwidth. xe2x80x9cQuality of Servicexe2x80x9d, or simply xe2x80x9cQoSxe2x80x9d, is the moniker used to generally describe the various quality levels at which video can be delivered. Layered video coding schemes offer a range of QoSs that enable applications to adopt to different video qualities. For example, applications designed to handle video data sent over the Internet (e.g., multi-party video conferencing) must adapt quickly to continuously changing data rates inherent in routing data over many heterogeneous sub-networks that form the Internet. The QoS of video at each receiver must be dynamically adapted to whatever the current available bandwidth happens to be. Layered video coding is an efficient approach to this problem because it encodes a single representation of the video source to several layers that can be decoded and presented at a range of quality levels.
Apart from coding efficiency, another concern for layered coding techniques is reliability. In layered coding schemes, a hierarchical dependence exists for each of the layers. A higher layer can typically be decoded only when all of the data for lower layers or the same layer in the previous prediction frame is present. If information at a layer is missing, any data for the same or higher layers is useless. In network applications, this dependency makes the layered encoding schemes very intolerant of packet loss, especially at the lower layers. If the loss rate is high in layered streams, the video quality at the receiver is very poor.
FIG. 1 depicts a conventional layered coding scheme 20, known as xe2x80x9cfine-granularity scalablexe2x80x9d or xe2x80x9cFGSxe2x80x9d. Three frames are shown, including a first or intraframe 22 followed by two predicted frames 24 and 26 that are predicted from the intraframe 22. The frames are encoded into four layers: a base layer 28, a first layer 30, a second layer 32, and a third layer 34. The base layer typically contains the video data that, when played, is minimally acceptable to a viewer. Each additional layer contains incrementally more components of the video data to enhance the base layer. The quality of video thereby improves with each additional layer. This technique is described in more detail in an article by Weiping Li, entitled xe2x80x9cFine Granularity Scalability Using Bit-Plane Coding of DCT Coefficientsxe2x80x9d, ISO/IEC JTC1/SC29/WG11, MPEG98/M4204 (December 1998).
With layered coding, the various layers can be sent over the network as separate sub-streams, where the quality level of the video increases as each sub-stream is received and decoded. The base-layer video 28 is transmitted in a well-controlled channel to minimize error or packet-loss. In other words, the base layer is encoded to fit in the minimum channel bandwidth. The goal is to deliver and decode at least the base layer 28 to provide minimal quality video. The enhancement 30-34 layers are delivered and decoded as network conditions allow to improve the video quality (e.g., display size, resolution, frame rate, etc.). In addition, a decoder can be configured to choose and decode a particular portion or subset of these layers to get a particular quality according to its preference and capability.
One characteristic of the illustrated FGS coding scheme is that the enhancement layers 30-34 are predictively coded from the base layer 28 in the reference frames. As shown in FIG. 1, each of the enhancement layers 30-34 in the predicted frames 24 and 26 can be predicted from the base layer of the preceding frame. In this example, the enhancement layers of predicted frame 24 can be predicted from the base layer of intraframe 22. Similarly, the enhancement layers of predicted frame 26 can be predicted from the base layer of preceding predicted frame 24.
The FGS coding scheme provides good reliability in terms of error recovery from occasional data loss. By predicting all enhancement layers from the base layer, loss or corruption of one or more enhancement layers during transmission can be remedied by reconstructing the enhancement layers from the base layer. For instance, suppose that frame 24 experiences some error during transmission. In this case, the base layer 28 of preceding intraframe 22 can be used to predict the base layer and enhancement layers of frame 24.
Unfortunately, the FGS coding scheme has a significant drawback in that the scheme is very inefficient from a coding or compression standpoint since the prediction is always based on the lowest quality base layer. Accordingly, there remains a need for a layered coding scheme that is efficient without sacrificing error recovery.
FIG. 2 depicts another conventional layered coding scheme 40 in which three frames are encoded using a technique introduced in an article by James Macnicol, Michael Frater and John Arnold, which is entitled, xe2x80x9cResults on Fine Granularity Scalabilityxe2x80x9d, ISO/IEC JTC1/SC29/WG11, MPEG99/m5122 (October 1999). The three frames include a first frame 42, followed by two predicted frames 44 and 46 that are predicted from the first frame 42. The frames are encoded into four layers: a base layer 48, a first layer 50, a second layer 52, and a third layer 54. In this scheme, each layer in a frame is predicted from the same layer of the previous frame. For instance, the enhancement layers of predicted frame 44 can be predicted from the corresponding layer of previous frame 42. Similarly, the enhancement layers of predicted frame 46 can be predicted from the corresponding layer of previous frame 44.
The coding scheme illustrated in FIG. 2 has the advantage of being very efficient from a coding perspective. However, it suffers from a serious drawback in that it cannot easily recover from data loss. Once there is an error or packet loss in the enhancement layers, it propagates to the end of a GOP (group of predicted frames) and causes serious drifting in higher layers in the prediction frames that follow. Even though there is sufficient bandwidth available later on, the decoder is not able to recover to the highest quality until an other GOP start.
Accordingly, there remains a need for an efficient layered video coding scheme that adapts to bandwidth fluctuation and also exhibits good error recovery characteristics.
A video encoding scheme employs progressive fine-granularity scalable (PFGS) layered coding to encode video data frames into multiple layers, including a base layer of comparatively low quality video and multiple enhancement layers of increasingly higher quality video. Some of the enhancement layers in a current frame are predicted from at least one same or lower quality layer in a reference frame, whereby the lower quality layer is not necessarily the base layer.
In one described implementation, a video encoder encodes frames of video data into multiple layers, including a base layer and multiple enhancement layers. The base layer contains minimum quality video data and the enhancement layers contain increasingly higher quality video data. Layers in a prediction frame are predicted from both the base layer and one or more enhancement layers.
Residues resulting from the image frame prediction are defined as the difference between the original image and predicted image. When using a linear transform, such as Discrete Cosine Transform (DCT), the coefficients of the predicted residues equal the differences between the DCT coefficients of the original image and the DCT coefficients of the predicted image. Since the PFGS coding scheme uses multiple reference layers for the prediction, the coding scheme produces multiple sets of predicted DCT coefficients. The predicted DCT coefficients range in quality depending upon what reference layer is used for the prediction. Lower quality predicted DCT coefficients (or xe2x80x9cLQPDxe2x80x9d) are produced by using lower quality reference layers, such as the base layer. Higher quality predicted DCT coefficients (or xe2x80x9cHQPDxe2x80x9d) are produced by using higher quality enhancement layers as reference.
The expectation is that the HQPD coefficients will produce lower DCT residues in comparison to the LQPD coefficients because the reference layer is of higher quality and hence closer to the original image. Lower DCT residues translate into fewer coding layers, thereby resulting in better coding efficiency. While the expectation is valid from a mean value perspective, the various qualities of DCT residues tend to fluctuate due to the motion between frames and other reasons. In some instances, individual DCT residues in the HQPD coefficients actually increase in comparison to DCT residues produced by referencing a lower quality layer (i.e., residues in the LQPD coefficients). The undesired fluctuations and increases result in less efficient coding.
Ideally, to eliminate the fluctuations in the DCT coefficients caused by using multiple prediction references of different quality, the HQPD coefficients should be part of or partial encoded into the base layer and low enhancement layers. However, in practice, only the lower quality LQPD coefficients are encoded in the base layer and low enhancement layers.
The video encoding scheme described herein efficiently eliminates these fluctuations by predicting HQPD coefficients from the LQPD coefficients encoded in the base layer and low quality enhancement layer. These predicted HQPD coefficients, or high quality residues derived therefrom, can be calculated both in encoder and in decoder. Except for any residues from the HQPD prediction that still exceed the maximum, the bitstream containing the base layer and low quality enhancement layer need not be modified. The use of predicted HQPD coefficients improves coding efficiencies by eliminating large fluctuations prior to encoding.