1. Field of the Invention
The present invention relates to processing of compressed visual data, and in particular the processing of compressed visual data in order to reduce data storage requirements or data transmission bandwidth at the expense of decreased quality.
2. Background Art
It has become common practice to compress audio/visual data in order to reduce the capacity and bandwidth requirements for storage and transmission. One of the most popular audio/video compression techniques is MPEG. MPEG is an acronym for the Moving Picture Experts Group, which was set up by the International Standards Organization (ISO) to work on compression. MPEG provides a number of different variations (MPEG-1, MPEG-2, etc.) to suit different bandwidth and quality constraints. MPEG-2, for example, is especially suited to the storage and transmission of broadcast quality television programs.
For the video data, MPEG provides a high degree of compression (up to 200:1) by encoding 8×8 blocks of pixels into a set of discrete cosine transform (DCT) coefficients, quantizing and encoding the coefficients, and using motion compensation techniques to encode most video frames as predictions from or between other frames. In particular, the encoded MPEG video stream is comprised of a series of groups of pictures (GOPs), and each GOP begins with an independently encoded (intra) I frame and may include one or more following P frames and B frames. Each I frame can be decoded without information from any preceding and/or following frame. Decoding of a P frame requires information from a preceding frame in the GOP. Decoding of a B frame requires information from both a preceding and a following frame in the GOP. To minimize decoder buffer requirements, transmission orders differ from presentation orders for some frames, so that all the information of the other frames required for decoding a B frame will arrive at the decoder before the B frame.
In addition to the motion compensation techniques for video compression, the MPEG standard provides a generic framework for combining one or more elementary streams of digital video and audio, as well as system data, into single or multiple program transport streams (TS) which are suitable for storage or transmission. The system data includes information about synchronization, random access, management of buffers to prevent overflow and underflow, and time stamps for video frames and audio packetized elementary stream packets embedded in video and audio elementary streams as well as program description, conditional access and network related information carried in other independent elementary streams. The standard specifies the organization of the elementary streams and the transport streams, and imposes constraints to enable synchronized decoding from the audio and video decoding buffers under various conditions.
The MPEG-2 standard is documented in ISO/IEC International Standard (IS) 13818-1, “Information Technology-Generic Coding of Moving Pictures and Associated Audio Information: Systems,” ISO/IEC IS 13818-2, “Information Technology-Generic Coding of Moving Pictures and Associated Audio Information: Video,” and ISO/IEC IS 13818-3, “Information Technology-Generic Coding of Moving Pictures and Associated Audio Information: Audio,” which are incorporated herein by reference. A concise introduction to MPEG is given in “A guide to MPEG Fundamentals and Protocol Analysis (Including DVB and ATSC),” Tektronix Inc., 1997, incorporated herein by reference.
MPEG-2 provides several optional techniques that allow video coding to be performed in such a way that the coded MPEG-2 stream can be decoded at more than one quality simultaneously. In this context, the word “quality” refers collectively to features of a video signal such as spatial resolution, frame rate, and signal-to-noise ratio (SNR) with respect to the original uncompressed video signal. These optional techniques are known as MPEG-2 scalability techniques. In the absence of the optional coding for such a scalability technique, the coded MPEG-2 stream is said to be nonscalable. The MPEG-2 scalability techniques are varieties of layered or hierarchical coding techniques, because the scalable coded MPEG-2 stream includes a base layer that can be decoded to provide low quality video, and one or more enhancement layers that can be decoded to provide additional information that can be used to enhance the quality of the video information decoded from the base layer. Such a layered coding approach is an improvement over a simulcast approach in which a coded bit stream for a low quality video is transmitted simultaneously with an independently coded bit stream for high quality video. The use of video information decoded from the base layer for reconstructing the high quality video permits the scalable coded MPEG-2 stream to have a reduced bit rate and data storage requirement than a comparable simulcast data stream.
The MPEG-2 scalability techniques are useful for addressing a variety of applications, some of which do not need the high quality video that can be decoded from a nonscalable coded MPEG stream. For example, applications such as video conferencing, video database browsing, and windowed video on computer workstations do not need the high quality provided by a nonscalable coded MPEG-2 stream. For applications where the high quality video is not needed, the ability to receive, store, and decode an MPEG-2 base-layer stream having a reduced bit rate or data storage capacity may provide a more efficient bandwidth versus quality tradeoff, and a more efficient complexity versus quality tradeoff. A scalable coded MPEG-2 stream provides compatibility for a variety of decoders and services. For example, a reduced complexity decoder for standard television could decode a scalable coded MPEG-2 stream produced for high definition television. Moreover, the base layer can be coded for enhanced error resilience and can provide video at reduced-quality when the error rate is high enough to preclude decoding at high quality.
The MPEG scaling techniques are set out in sections 7.7 to 7.11 of the MPEG-2 standard video encoding chapter 13818-2. They are further explained in Barry G. Haskell et al., Digital Video: An Introduction to MPEG-2, Chapter 9, entitled “MPEG-2 Scalability Techniques,” pp. 183-229, Chapman & Hall, International Thomson Publishing, New York, 1997, incorporated herein by reference. The MPEG scalability techniques include four basic techniques, and a hybrid technique that combines at least two of the four basic techniques. The four basic techniques are called data partitioning, signal-to-noise ratio (SNR) scalability, spatial scalability, and temporal scalability.
Data partitioning is a method of partitioning a single layer coded bit-stream into two classes, including a base layer “partition 0” and an enhancement layer “partition 1”. Partition 0 contains all high level header information as well as some low frequency discrete cosine transform (DCT) coefficients. Partition 1 contains all remaining higher frequency DCT coefficients and end-of-block (EOB) markers. Some syntax elements belonging to partition 0 are redundantly copied to partition 1 to facilitate error recovery. This duplicated information includes the sequence_header, GOP_header, picture_header, sequence_end_code, sequence_extension, picture_extension, and sequence_scalable_extension. This duplication ensures that there is proper synchronization and recovery following a bit-stream error in the low priority enhancement layer (partition 1) and introduces very little overhead. With respect to the single layer coded bit-stream, the separation point between the syntax elements to be included in the base and enhancement layers is indicated by a priority breakpoint (PBP) marker. The PBP can be adjusted at every picture slice. The PBP marker partitioning granularity is at the (run, level) DCT event level of the coded block data. Data partitioning is especially useful for error resilient video transmission over asynchronous transfer mode (ATM) networks and other networks where data prioritization is possible. Data partitioning has a number of shortcomings, including limited flexibility for PBP adjustment (in terms of partitioning granularity and update frequency), and the accumulation of drift errors over P pictures due to partially available coefficient information from a damaged enhancement layer.
SNR scalability is a method of generating a multiplex of bit-streams representing individual layers including a base layer which contains DCT coefficients quantized at a basic moderate quality level, and one or more SNR enhancement layers that contain DCT refinement coefficients intended to enhance the precision of quantized DCT coefficients reconstructed based on the content of all lower layers. Consequently, SNR scalability is also referred to as “Quantization Noise Scalability.” The layers in SNR scalability are all at the same spatial and temporal resolutions but cumulatively produce increasing quality levels starting with the lowest quality at the base layer. The base layer includes all high level header information, all motion compensation and macroblock (MB) type information, and coarse quantized DCT coefficient information. The enhancement layers include quantized DCT refinement coefficient information, and some amount of overhead information. The slice structure should be the same for all layers. Use of different quantization matrices in the base and enhancement layers is allowed. The overhead required by SNR scalability results in a decreased bandwidth utilization efficiency compared to data partitioning. SNR scalability is especially useful for simultaneous distribution of standard definition television and high-definition television, error-resilient video services over ATM and other networks, and multi-quality Video On Demand (VOD) services. SNR scalability has a number of shortcomings, including increased complexity and overhead as compared to data partitioning, inflexibility in bandwidth distribution among the layers primarily due to the fact that all motion information has to be carried in the base layer, and the shortcoming that no single SNR scalable codec can eliminate drift errors and also be reliable under lossy enhancement layer transmission.
There are two variations to SNR scalability, namely, chroma simulcast and frequency domain SNR (FDSNR) scalability. Chroma simulcast provides a means for simultaneous distribution of video services that use 4:2:0 and 4:2:2 chroma subsampling formats. The associated bit-stream structure has three layers, including a base layer, an enhancement layer, and a simulcast layer. The base layer is a distribution of video in the 4:2:0 format. The enhancement layer provides SNR enhancement for the luminance component of the base layer. The simulcast layer includes chrominance components of the 4:2:2 format.
Frequency domain SNR scalability provides a transform domain method to achieve spatial resolution scalability. The base layer is intended for display at reduced spatial resolution and includes video encoded by a quantization matrix that allows a proper subset of normal size DCT transform coefficients to be selected and included in the base layer for use in conjunction with a smaller size DCT at the base layer decoder. The enhancement layer is the set of remaining normal size DCT transform coefficients.
Spatial scalability provides an ability to decode video at different spatial resolutions without first having to decode an entire (full-size) frame and then decimating it. The base layer carries the lowest spatial resolution version of the video obtained by decimating the original (full-size) video. Enhancement layers carry the differential information required to generate successively higher spatial resolution versions of the video. Spatial scalability supports interoperability between different video resolution and formats, such as support for simultaneous transmission of high definition television and standard definition television, and backward compatibility of MPEG-2 with different standards such as H.262 or MPEG-1. Spatial scalability supports error-resilient video transmission on ATM and other networks. Decoder complexity can scale with channel bandwidth. Spatial scalability has the advantages of a high degree of flexibility in video resolution and formats to be used for each layer, and a high degree of flexibility in achieving bandwidth partitioning between layers. There are no decoder drift problems because there are independent coding loops that are only loosely coupled. Spatial scalability, however, requires significantly increased complexity as compared to data partitioning and SNR scalability.
Temporal scalability provides an ability to decode video at different frame rates without first having to decode every single frame. The base layer carries the lowest frame rate version of the video coded by itself at the basic temporal rate. This version of the video is obtained from the original full frame rate version by a temporal down-sampling operation. The enhancement layers carry the information to construct the additional frames required to generate successively higher temporal resolution versions of the video. Additional frames in each enhancement layer are coded with temporal prediction relative to the frames carried by lower layers. Temporal scalability provides simultaneous support for different frame rates in the form of downward compatibility with lower-rate services, such as migration from first generation interlaced high definition television to high temporal resolution progressive high-definition television. Temporal scalability supports error-resilient video transmission on ATM and other networks. Decoder complexity can scale with channel bandwidth. Temporal scalability has the advantages of providing flexibility in achieving bandwidth partitioning between layers. There are no decoder drift problems because there are independent coding loops that are only loosely coupled. Temporal scalability has less complexity and higher efficiency than spatial scalability. Temporal scalability, however, provides a bandwidth partitioning flexibility that is more limited than spatial scalability because temporal scalability uses the same spatial resolution in all layers.
Hybrid scalability combines two scalabilities at a time from among SNR, spatial and temporal scalabilities. A base layer carries a basic quality, spatial and temporal resolution version of the intended video content. A first enhancement layer carries differential information required to implement one of the two intended enhancements on the base layer. A second enhancement layer carries differential information required to implement the second intended enhancement on the combination of the base and the first enhancement layers. Hybrid scalability is useful in more demanding applications requiring scalability in two video quality aspects within three or more bit-stream layers.