Video compression enables storing, transmitting, and processing of visual information with fewer storage, network, and processor resources. The most widely used video compression standards include MPEG-1 for storage and retrieval of moving pictures, MPEG-2 for digital television, and H.263 for video conferencing, see ISO/IEC 11172-2:1993, “Information Technology—Coding of Moving Pictures and Associated Audio for Digital Storage Media up to about 1.5 Mbit/s—Part 2: Video,” D. LeGall, “MPEG: A Video Compression Standard for Multimedia Applications,” Communications of the ACM, Vol. 34, No. 4, pp. 46-58, 1991, ISO/IEC 13818-2:1996, “Information Technology—Generic Coding of Moving Pictures and Associated Audio Information—Part 2: Video,” 1994, ITU-T SG XV, DRAFT H.263, “Video Coding for Low Bitrate Communication,” 1996, and ITU-T SG XVI, DRAFT13 H.263+Q15-A-60 rev.0, “Video Coding for Low Bitrate Communication,” 1997.
These standards are relatively low-level specifications that deal primarily with spatial compression of images or frames, and the spatial and temporal compression of a sequence of frames. As a common feature, these standards perform compression on a per frame basis. The standards achieve high compression ratios for a wide range of applications.
For transmission of a video over a communications channel with a fixed bandwidth, the video is often encoded with a constant bit-rate (CBR). To account for minor fluctuations in the bits produced for each frame, output bits of an encoder are first sent to a storage buffer. Subsequently, the buffer releases the output bits at a constant bit-rate to the channel.
There are many advantages with a CBR coded video, however, CBR encoding also has certain drawbacks. One drawback is that the perceived picture quality fluctuates due to fluctuating distortion within the bitstream. As distortion increases, the perceived quality of a picture decreases. As another drawback, CBR encoding does not provide an efficient means of transmitting a video over time-varying heterogeneous networks. Such a network is characterized by either varying bandwidth or sessions that are established based on available bit-rate (ABR) among many users or a combination of the two. In both cases, either to provide constant or a maximum quality video, or to fully utilize the capacity of the communications channel, variable bit-rate (VBR) encoding is often considered.
In U.S. Pat. No. 6,198,878, “Method and apparatus for encoding and decoding digital video data,” issued on Mar. 6, 2001 to Blawat et al., a buffer control strategy for VBR coded video stored on a fixed capacity medium was described. Blawat et al. increased quality in a first part of the entire sequence, e.g., 80% of total playing time, while maintaining a negligible loss in quality for a second part of the sequence, e.g., 20% of total playing time. Although a VBR coded video was produced, no method was described which would guarantee constant quality. Rather, their focus was on minimizing the distortion in the reconstructed video.
In U.S. Pat. No. 6,205,174, “Variable bit-rate video coding method and corresponding video coder,” issued on Mar. 20, 2001 to Fert et al., a VBR video coding method that includes an analysis pass, a prediction pass, and picture re-arrangement was described. They improved over previous VBR coders in that data from the first pass impacted the final quantization step size, as well as the location of picture types, i.e., I, P and B-frames, which was referred to as group of frames (GOP) allocation. They required multiple iterations to achieve reasonably constant quality, and indicated that a larger number of iterations would further improve the quality. However, each iteration consumed additional processing power and increased delay.
U.S. Pat. No. 5,978,029, “Real-time encoding of video sequence employing two encoders and statistical analysis,” issued on Nov. 2, 1999 to Boice, et al., described an encoding sub-system for analyzing a sequence of video frames and for deriving information. The sub-system includes a control processor for analyzing gathered information and for producing a set of control parameters. A second encoding sub-system encoded each frame using a corresponding set of control parameters. They overcame the delay associated with many prior VBR encoders by gathering statistics in a first pass, and then using the statistics to perform the encoding in a second pass. Although, in principle, the encoder system described was not very different from prior multi-pass encoders, they did described the means by which the two encoders could be coupled to ensure real-time operation.
In summary, the prior art methods primarily describe VBR coders that minimize distortion when fluctuations in the bit-rate is not a major concern.
It is evident from the prior art that extracting data from coded bitstreams during a first stage, and using the extracted data in a second stage of encoding is a common technique. That is further described by Lin et al. in “Bit-rate control using piece-wise approximated rate-distortion characteristics,” IEEE Trans. Circuits and Systems for Video Technology, August 1998. They describe a large set of quantization scales to encode a video. Corresponding rate-quantizer data and distortion-quantizer data were also recorded. Using that recorded data, a curve was interpolated via linear or cubic interpolation methods. The curve was finally used to select a set of quantization scales that minimized the average distortion or distortion variation for a given rate constraint. However, their method is computationally expensive in calculating the rate-quantizer data, and furthermore, a complex search for the optimal quantization scales are needed. Consequently, this method cannot be used for real-time applications, particularly, for low bit-rate streaming data.
FIG. 1 shows he underlying concept of most prior art VBR encoders. In one branch of an encoder 100, source-coding statistics 111 are extracted from the input video 101 by a statistics generator 110. A special case of the statistics generator 110 is a video encoder that extracts actual rate-distortion (R-D) statistics 111, possibly from many rate-distortion samples using a large set of quantization parameters. The R-D statistics 111 are sent to a statistical analyzer 120 where R-D parameters 121 for coding are determined. The R-D parameters 121 are used to perform single-layer VBR coding 130 on a copy of the input video 101 that has been delayed 140. The result is a VBR coded bitstream 150 that can be stored or transmitted over a network.
FIG. 2 shows a statistical multiplexing application of VBR coding as described in U.S. Pat. No. 6,167,084, “Dynamic bit-allocation for statistical multiplexing of compressed and uncompressed digital video signals,” issued on Dec. 26, 2000 to Wang et al. A dynamic bit-allocation method 200 allocates rates to multiple programs 201 transmitted over a CBR channel 262. Each program (video) 201 is encoded 210 in the form of either a compressed or uncompressed bitstream, possibly stored on a disk 220.
Hierarchical dynamic bit-allocation is performed using a rate control processor 240. The rate control processor first allocates bits at a super group-of-frames (GOP) level, then ultimately down to the frame level. The rate control processor 240 uses rate-distortion parameters 241 that are extracted by multiple single-layer VBR transcoders 231-232 and encoders 233-234. A target number of bits is determined in the rate control processor 240 according to frame types and program priority. Constraints on the target bit-rates are also considered in the rate control processor to prevent overflow and underflow in a buffer 260. Therefore, a signal on line 261 indicate the “fullness” of the buffer 260. The target number of bits for each video program is sent to each of the multiple encoders to produce multiple single-layer VBR bitstreams 235-238. Those bitstreams are multiplexed 250, buffered 260 and typically transmitted over the CBR channel 262.
For video transmission, e.g., from a video server to receivers, e.g. televisions or computers, external bandwidth fluctuation is a major concern. The fluctuation not only impacts the quality, but it also affects delay and jitter during transmission. U.S. Pat. No. 6,085,221, “File server for multimedia file distribution,” issued on Jul. 4, 2000 to Graf described a method for transmitting multimedia files from file servers. VBR coders were used to compress the multimedia. Graf did not elaborate on the details of his VBR encoding. He simply assumed that constant perceived quality could be achieved irrespective of the coding format. However, he does describe a method of scheduling a video transmission. Also, there was no mention of methods that could be used to optimize the perceived quality of the reconstructed video.
For the most part, the methods described above have two implicit assumptions. First, a single layer encoding scheme was assumed, and second there was a limited set of parameters that could be adjusted to meet rate or distortion constraints, e.g., for MPEG-2, only quantization parameters and GOP structures, i.e., frame type and location, are considered.
Video coding standards, such as MPEG-4 for multimedia applications, see ISO/IEC 14496-2:1999, “Information technology—coding of audio/visual objects, Part 2: Visual,” provide several new coding tools, including tools to improve the coding efficiency, and tools that support object-based coding and error-resilience.
One of the main problems in delivering video content over networks is adapting the content to meet particular constraints imposed by users and networks. Users require playback with minimal variation in perceived quality. However, dynamic network conditions often make this difficult to achieve.
Fine granular scalable (FGS) coding has been adopted by the MPEG-4 standard. The tools that support FGS coding are specified in an amendment of the MPEG-4 standard, “ISO/IEC 14496-2:1999/FDAM4, “Information technology—coding of audio/visual objects, Part 2: Visual,” An overview of FGS coding is described by Li in “Overview of Fine Granularity Scalability in MPEG-4 video standard,” IEEE Trans. Circuits and Systems for Video Technology, March 2001.
FGS coding is a radical departure from traditional scalable coding. With traditional scalable coding, the content was coded into a base layer bitstream and possibly several enhancement layer bitstreams, where the granularity was only as fine as the number of enhancement layer bitstreams that were formed. The resulting rate-distortion curve resembles a step-like function.
In contrast, FGS coding provides an enhancement layer bitstream that is continually scalable. Providing a continuous scalable enhancement layer bitstream is accomplished by a bit-plane coding method that uses discrete cosine transform (DCT) coefficients. Bit-plane coding allows the enhancement layer bitstream to be truncated at any point. In that way, the quality of the reconstructed video is proportional to the number of bits of the enhancement layer bitstream that are decoded.
FIG. 3 shows a conventional FGS encoder 300. An input video 301 is provided to a typical base layer encoder 310. The base layer encoder includes DCT 311, Quantization (Q) 312, motion compensation (MC) 318, inverse quantization (Q−1) 313, inverse DCT (IDCT) 314, motion estimation 317, clipping 315, frame memory 316, and variable length coder (VLC) 319 components. The output of the base layer encoder 310 is a base layer bitstream 302 having some predetermined minimum constant bit-rate. Typically, the CBR is very low, for example, 20 Kbps or less. Thus, the base layer bitstream can be transmitted over high and low bandwidth channels.
The enhancement layer bitstream is generated by subtracting 321 reconstructed frames of the base layer bitstream 302 from the input video 301. This yields an FGS residual signal 322 in the spatial domain. Enhancement layer encoding is then applied to the residual signal 322. The enhancement encoding includes a DCT 330, followed by bit-plane shifting 340, a maximum operation 350, and bit-plane VLC coding 360 to produce the enhancement-layer bitstream 303.
FIG. 4 shows a FGS decoder 400 that can be applied to base layer bitstream 302 and the enhancement layer bitstream 303 to produce reconstructed a base layer video 491 and a reconstructed enhancement layer video 492. The decoder 400 includes a variable length decoder (VLD) 410, inverse quantizer 415, inverse DCT 420, motion compensation 425, frame memory 430, and clipping 435 components. A FGS residual signal 456 is reconstructed by passing the enhancement-layer bitstream 303 through bit-plane VLD 445, bit-plane shift 450 and IDCT 455 components. The FGS residual signal 456 can then be added 457 to the reconstructed base layer signal 436 to yield the enhancement video 492. The combined signal is clipped 460 to ensure that the signal is bounded, i.e., 8-bit pixels values must be in the range [0, 255].
A selective enhancement method to control the bit-plane shifting in the enhancement layer of the FGS coded video bitstream was described in U.S. Pat. No. 6,263,022, “System and method for fine granular scalable video with selective quality enhancement,” issued on Jul. 17, 2001 to Chen, et al. There, the quantization parameter used for coding the base layer video also determined the corresponding shifting factor. The bit-planes associated with macroblocks that were deemed more visually important were shifted higher.
A key point to note is that the bit rate of the base layer bitstream is some predetermined minimum. The enhancement layer bitstream covered the range rates and distortions from the minimum to near lossless reconstruction. Also, after the enhancement layer bitstream has been generated, it could be stored and re-used many times. According to e.g., network characteristics, an appropriate number of bits can be allocated to a frame and transmitted over the network, taking into consideration current network conditions. It is important to note however that there is no quantization parameter to adjust in that scheme.
The standard does not specify how rate allocation, or equivalently, the truncation of bits on a per frame basis is to be done. The standard only specifies how the scalable bitstream is decoded. Additionally, traditional methods that have been used to model the rate-distortion (R-D) characteristics, e.g., based on quantization parameters, no longer hold with a bit-plane coding scheme used by the FGS coding. As a result the quality of the reconstructed video can vary noticeably.
Because differential sensitivity is key to our human visual perception, it is important to minimize the variation in perceived quality rather than overall distortion. Optimal rate allocation can be done by minimizing a cost for an exponential R-D model. This leads to constant quality among frames, see Wang, et al., “A new rate allocation scheme for progressive fine granular scalable coding,” Proc. International Symposium on Circuits and Systems, 2001. However, this prior art model-based approach does not work well on low bit-rate signals.
Therefore, there is a need for a scalable coder that can provide an output bitstream that has a constant quality. Furthermore, it is desired to provided techniques that can measure R-D characteristic in a bit-plane coded bitstream so that rates can be adjusted to meet quality requirements in real-time.