Video coding has conventionally focused on improving video quality at a particular bit rate. With the rapid growth of network video applications, such as Internet streaming video, there is a desire to improve the video quality over a range of bit rates. Further, because of the wide variety of video servers and varying channel connections, there has been an interest in determining the bit rate at which the video quality should be optimized. Several approaches have been developed to overcome the problem of variations in transmission bandwidth.
Fine Granularity Scalability (FGS) was developed based on the traditional DCT-based video coder. With FGS, a single bitstream can produce continuously varying bit rates and qualities within a preset range in contrast to the discrete set bit rates and qualities. Because FGS has good compatibility with conventional DCT-based encoders and a good balance between scalability and reasonable complexity, it has been adopted by the MPEG-4 standard for streaming video applications.
The principal idea of FGS is bitplane coding. In a traditional (i.e., non-FGS encoder), quantized DCT coefficients are encoded with run-length coding followed by variable length coding (VLC), which is essentially a “coefficient by coefficient” encoding. In a FGS encoder, quantized DCT coefficients are first converted to their binary representations. All the bits with the same significance are grouped together and called a “bit plane”. Starting with the most significant bit plane, the encoder codes the coefficients “plane by plane”. Run-length coding and VLC are still used when encoding each bit plane. The FGS property comes from the fact that even when only a subset of the bit planes is transmitted/received/decoded, decodable video is still obtained, only at lower quality.
The use of FGS encoding and decoding for streaming video is described in ISO/IEC JTC1/SC 29/WG 11 N2502, International Organisation for Standardisation, “Information Technology-Generic Coding of Audio-Visual Objects—Part 2: Visual, ISO/IEC FDIS 14496-2, Final Draft International Standard,” Atlantic City, October 1998, and ISO/IEC JTC1/SC 29/WG 11 N3518, International Organisation for Standardisation, “Information Technology-Generic Coding of Audio-Visual Objects—Part 2: Visual, Amendment 4: Streaming video profile, ISO/IEC 14496-2:1999/FPDAM 4, Final Proposed Draft Amendment (FPDAM 4),” Beijing, July 2000, the contents of which are incorporated by reference herein.
As described in an article by Li et al. entitled “Fine Granularity Scalability in MPEG-4 Streaming Video,” Proceedings of the 2000 IEEE International Symposium on Circuit and Systems (ISCAS), Vol. 1, Geneva, 2000, the contents of which are incorporated by reference herein, an encoder generates a base layer and an enhancement layer that may be truncated to any amount of bits within a video object plane (VOP). The enhancement layer preferably improves the quality of the VOP. In other words, receiving more FGS enhancement bits typically results in better quality in the reconstructed video. Thus, by using FGS coding, a single bit rate need not to be provided, but rather a bit rate range can be provided to the FGS encoder. The FGS encoder preferably generates the base layer to meet the lower bound of the bit rate range and the enhancement layer to meet the upper bound of the bit rate range.
In a traditional communication system, the encoder compresses the input video signal into a bit rate that is less than, and usually close to, the channel capacity, and the decoder reconstructs the video signal using all the bits received from the channel. In such a model, two basic assumptions are typically made. The first assumption is that the encoder has knowledge regarding the channel capacity. The second assumption is that the decoder is able to decode all the bits received from the channel fast enough to reconstruct the video.
However, these two basic assumptions are not necessarily true in Internet streaming video applications. First, due to the server 12 used between the encoder 10 and the channel 14, as shown in FIG. 1, plus the varying channel capacity, the encoder 10 does not have knowledge regarding the channel capacity and does not know at which bit rate the video quality should be optimized. Secondly, many applications use a client/decoder 16 that shares the computational resources with other operations on the user terminal. The client/decoder 16 may not be able to decode all the bits received from the channel fast enough for reconstruction of the video signal. Therefore, a goal of video coding for Internet streaming video is to improve the video quality over a given bit range instead of at a given bit rate. The bitstream should be partially decodable at any bit rate within the bit rate range to reconstruct a video signal with improved quality at that bit rate.
Scalable video coding also has been a recent topic of interest. Once a given bit rate is chosen, a conventional, nonscalable coding technique tries to achieve optimal quality, however, if the channel bit rate is lower than the video coding bit rate, a “digital cutoff” phenomenon occurs and the received video quality becomes very poor. On the other hand, if the channel bit rate is higher than the video-coding bit rate, the received video quality is no better. In MPEG-2 and MPEG-4, several layered scalability techniques, namely, SNR scalability, temporal scalability, and spatial scalability, have been implemented. In such a layered scalable coding technique, a video sequence is coded into a base layer and an enhancement layer. The enhancement layer bitstream is similar to the base layer bitstream in the sense that it has to be either completely received and decoded or it does not enhance the video quality.
FIG. 2 illustrates an SNR scalability decoder 20 defined in MPEG-2 video-coding standard. The base-layer bitstream is decoded by the base layer variable-length decoder (VLD) 22 first. The inverse quantizer 24 in the base layer produces reconstructed DCT coefficients. The enhanced bitstream is decoded by the VLD 26 in the enhancement layer and the enhancement residues of the DCT coefficients are produced by the inverse quantizer 28 in the enhancement layer. A higher accuracy DCT coefficient is obtained by adding the base-layer reconstructed DCT coefficient and the enhancement-layer DCT residue in adder 30. The DCT coefficients with a higher accuracy are provided to the inverse DCT (IDCT) unit 32 to produce reconstructed image domain residues that are to be added to the motion-compensated block from the previous frame in adder 34.
Temporal scalability is a technique to code a video sequence into two layers at the same spatial resolution, but different frame rates. The base layer is coded at a lower frame rate. The enhancement layer provides the missing frames to form a video with a higher frame rate. Coding efficiency of temporal scalability is high and very close to nonscalable coding. FIG. 3 illustrates temporal scalability. Only P-type prediction is used in the base layer. The enhancement-layer prediction can be either P-type or B-type from the base layer or P-type from the enhancement layer.
Spatial scalability is a technique to code a video sequence into two layers at the same frame rate, but different spatial resolutions. The base layer is coded at a lower spatial resolution. The reconstructed base-layer picture is up-sampled to form the prediction for the high-resolution picture in the enhancement layer. FIG. 4 illustrates a single-loop spatial scalability decoder 40. An advantage of single-loop spatial scalability is its simplicity. If the spatial resolution of the base layer is the same as that of the enhancement layer, i.e., the up-sampling factor being 1, the spatial scalability decoder 40 can be considered as an SNR scalability decoder also. Unlike the SNR scalability decoder 20 in MPEG-2, the spatial scalability decoder 40 does not include the enhancement-layer information into the prediction loop. Therefore, if the corresponding encoder does not include the enhancement layer information into the prediction loop either, base-layer drift does not exist. Coding efficiency of the enhanced video using such an “open-loop” scalable coding method suffers from the fact that the enhancement information of the previous frame is not used in the prediction for the current frame.
The spatial scalability decoders defined in MPEG-2 and MPEG-4 use two prediction loops, one in the base layer and the other in the enhancement layer. The MPEG-2 spatial scalable decoder uses as prediction a weighted combination of an up-sampled reconstructed frame from the base layer and the previously reconstructed frame in the enhancement layer, while the MPEG-4 spatial scalable decoder allows a “bi-directional” prediction using up-sampled reconstructed frame from the base layer as the “backward reference” and the previously reconstructed frame in the enhancement layer as the “forward reference”. Currently, FGS in the MPEG-4 standard does not support spatial scalability.
In conventional DCT coding, the quantized DCT coefficients are coded using run-level coding. The number of consecutive zeros before a nonzero DCT coefficient is called a “level”. If a so-called “2-D” VLC table is used, the (run, level) symbol is coded and a separate “EOB” symbol is used to signal the end of the DCT block. If a “3-D” VLC table is used, the (run, level, eob) symbol is coded, where “eob” signals the end of the DCT block.
The major difference between a bitplane coding method and a run-level coding method is that the bitplane coding method considers each quantized DCT coefficient as a binary number of several bits instead of a decimal integer of a certain value. For each 8×8 DCT block, the 64 absolute values are zigzag ordered into an array. A bitplane of the block is defined as an array of 64 bits, taken on from each absolute value of the DCT coefficients at the same significant position. For each bitplane of each block, (RUN, EOP) symbols are formed and variable-length coded to produce the output bitstream. Starting from the most significant bitplane (MSB-plane), 2-D symbols are formed of two components: 1) a number of consecutive zeros before a 1 (RUN) and 2) whether there are any ones left on this bitplane, i.e., end-of-plane (EOP). If a bitplane contains all zeros, a special symbol, ALL-ZERO, is needed to represent it.
The following example illustrates bitplane coding. It is assumed that the absolute values and the sign bits after zigzag ordering are given as follows:
10, 0, 6, 0, 0, 3, 0, 2, 2, 0, 0, 2, 0, 0, 1, 0, . . . , 0, 0(absolute value)0, x, 1, x, x, 1, x, 0, 0, x, x, 1, x, x, 0, x, . . . , x, x(sign bits).
The maximum value in this block is found to be 10 and the number of bits to represent 10 in the binary format (1010) is four. Therefore, four bitplanes are used in forming the (RUN, EOP) symbols. Writing every value in the binary format, the four bitplanes are as follows:
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, . . . , 0, 0(MSB)0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, . . . , 0, 0(MSB-1)1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, . . . , 0, 0(MSB-2)0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, . . . , 0, 0(MSB-3).
Converting the four bitplanes into (RUN, EOP) symbols, results in:
(0, 1)(MSB)(2, 1)(MSB-1)(0, 0), (1, 0), (2, 0), (1, 0), (0, 0), (2, 1)(MSB-2)(5, 0), (8, 1)(MSB-3).
Therefore, ten (RUN, EOP) symbols are formed in this example. These symbols are coded using variable-length code together with the sign bits, as shown below.
VLC(0, 1),0(MSB)VLC(2, 1),1(MSB-1)VLC(0, 0), VLC(1,0), VLC(2,0), 1, VLC(1,0), 0, VLC(0,0),(MSB-2)0, VLC(2,1),1VLC(5, 0), VLC(8,1), 0(MSB-3).
Each sign bit is put into the bitstream only once right after the VLC code that contains the MSB of the nonzero absolute value associated with the sign bit. For example, no sign bit follows the second VLC code of the MSB-2 plane because the sign bit has been coded after the VLC code in the MSB-1 plane.
However, conventional bitplane coding suffers from the following:                Run-length coding is not efficient when the run is short.        Encountering “1” in a bitplane makes a corresponding coefficient “significant” for all subsequent bitplane coding. If a coefficient is significant in a certain bitplane, the bit of that coefficient in that bitplane has approximately equal probability of being 1 or 0.        When coding a certain bitplane, the probability of an insignificant coefficient becoming a significant coefficient (“flip probability”) is much lower than 0.5.        The “significant” bits will interfere with the run-length coding because they have different statistical properties from the others.        