1. Field of the Invention
The invention is broadly directed to methods and apparatus for adapting variable-rate bitstreams to fixed-rate channels. In particular, the invention pertains to output rate control for bitstreams that are compression-coded in real time.
2. Discussion of Related Art
Feed-forward rate control is advantageous for adapting the encoded variable-rate signals output by video compressors to the fixed-rates required by video transmission channels. This feed-forward control sets the transmitted bit rate a priori, rather than relying on feedback signals reporting the condition of an output buffer from which codec's signal is transmitted. However, feed forward rate control has seldom been used in practical devices, because conventional feed-forward rate control is implemented as a bit-allocation optimization process that uses computationally costly linear-programming methods.
The advantage of a priori rate setting is the greater accuracy it provides, eliminating the slippage produced by the lag inherent in after-the-fact output buffer occupancy determinations. This is particularly advantageous for codecs that provide semantically-determined differential compression of various areas within a video image. Not only does the compressor's current bit rate vary to provide the greater definition required by semantic compression in some areas of an image, the moving average of that bit rate output from that compressor is also likely to vary. Feed-forward control responds more quickly to such rate variations. Thus feed-forward control can eliminate rate errors before the rate errors induced by the variability of differential-compression bit rates can accumulate.
The output or "rate" buffer that holds the encoded signal that is being transmitted by the codec smooths the bit rate of the transmitted signal. Cumulative rate errors impair the codec's ability to maintain an occupancy level in the rate buffer that can provide the desired global bit rate. This rate is an average defined by dividing the channel bit rate by the nominal video frame rate of the video signal. The buffer overflows that occur when image semantics cause the bit rate to momentarily exceed the desired bit rate too much, or to exceed it by only a little but for too long, disrupts image continuity.
Semantic compression also implicitly imposes additional image-quality constraints on a video-compression codec. In particular, errors in high-definition areas of the image become more critical, because better definition has been provided in those areas. For example, when videophone units process the facial areas of "talking head" images, the finer quantization of those facial areas that is needed for perceptual clarity unfortunately also makes the rate errors caused by feedback lag more objectionable in those very same, very important areas.
Compression Standards
The International Telecommunications Union Recommendations (ITU-T No. H.263) specify a compression format for transmitting video images over low bit-rate channels, that is, channels providing only bit rates of 64 kbps or less. Discrete Cosine Transform (DCT) codes are used to encode the pixel luminance and chrominance data for each image frame, and the "motion vectors" compensate for translation motion between adjacent frames. Both are used in the well-known Motion Picture Experts Group (MPEG) standard.
The compression encoding format specified by H.263 is based on segmentation of individual image frames in the video-frame sequence of the incoming video signal into 16.times.16-pixel (pel) motion-compensation macroblocks. The area of each motion-compensation macroblock coincides with four 8.times.8-pel DCT-coded image blocks each containing four 8.times.8-pel luminance-averaging blocks (Y) blocks and with 8.times.8-pel (U) and (V) chrominance blocks, each corresponding to the entire area of the 16.times.16-pel macroblock.
The H.263 standard permits the use of Syntax-based Arithmetic Coding (SAC), a more compact alternative to variable-length RLA encoding and "PB-frames" which are a hybrid of the sequentially-predicted P-frames and the bidirectionally-predicted B-frames defined by the MPEG standards, that require a "delta" vector in addition to the one or four motion vectors transmitted for each macroblock. The H.263 standard also defines a table of DQUANT values, a limited set of step sizes between scalar image quantizers applied to the DCT coefficients of each 8.times.8 coding block, both the AC block-prediction residuals and the DC coefficients of the intraframe luminance-coding blocks. These quantizer step size values or "deltas" define the difference between coding levels used for pixels in macroblocks belonging to different semantic regions in an image frame. The quantized block values are Huffman "entropy" coded for compression, and these Huffman codes and the quantizer step size used for scaling the DCT values of the macroblock are transmitted in the resulting compressed bitstream.
It is widely accepted that Adaptive Vector Quantization (AVQ) using updatable differential scalars in accordance with the H.263 standard, gives H.263 better compression performance than the basic scalar quantization of H.263 or that of CCITT Recommendation H.261. AVQ is superior not only at the very low PSTN bit rates of the public "switched telephone" networks, but even at ISDN rates of 128 kbps and higher on the "integrated data services" networks. According to Shannon rate-distortion theory, such vector-based quantization will always be superior to scalar quantization, so long as the vector-quantizer algorithm used is well-suited to the statistical profile of the signal source. For videophone units, the characteristic source material is "talking heads" and adaptive vector coding is particularly well-suited to the semantics of this relatively homogenous material.
The "Conferencing Codec"
The video encoder/decoder platform (codec) disclosed in the commonly-owned U.S. patent application, Ser. No. 08/727,862, filed Oct. 8, 1996, and the commonly-owned U.S. patent application, Ser. No. 08/500,672, filed Jul. 10, 1995, and incorporated herein by reference is directed toward the compression requirements of a video-conferencing system (the "conferencing codec"). Thus it was not designed for compression-coding video images at the very low PSTN-type bit rates that must be accommodated by 2D codecs used in consumer videophone communications devices, bit rates below 25.6 kbps. Instead, this "conferencing codec" and other 3D codecs used for video-conferencing are designed to transmit over the 128 kbps sub-band channels of the type available on ISDN networks.
The "conferencing codec" uses chrominance tracking and ellipse cross-correlation, supplemented by symmetry detection, for implementing a differential, finer quantization in the "face and hands" areas of each image frame. This "semantic" differential coding process, shown schematically in FIG. 1, is partially frame-independent in that the semantic structure of previous frames is used to provide default values for later frames.
The "conferencing codec" uses a hybrid AVQ/scalar encoding scheme as an alternative to the purely scalar quantizer scheme defined in H.263. This AVQ encoding scheme uses a fixed-size adaptive codebook of 64 different, dimensional codewords to encode macroblocks. The codebook used by the decoder is updated to better match the statistical profile of subsequent frames in a frame sequence using the codes transmitted for each frame. The "conferencing codec" selects the codeword having an index that is closest to the value of each 8.times.8 DCT-code block from among current codebook entries for each frame, as calculated using sum of the absolute-differences (SAD) of the respective pixel values in the 8.times.8 block. However, the AVQ residual error is then quantized by the "conferencing codec", using the same step size as the H.263 scalar quantizers use, and entropy coded in the same manner. Thus an equivalent degree of distortion is introduced whenever the scalar and vector residuals are equivalent.
Furthermore, to satisfy Shannon's criterion for efficient VQ coding, each codeword vector should be the centroid of its respective space in the codebook relative to the distribution of the values being coded. Thus, in general, the "conferencing codec" replaces the most infrequently used codeword vectors and the most distorting ones, i.e., codeword vectors having the largest average residual value in their respective frames, with the transmitted scalar-quantizer values. A quality quotient or code-space centroid is independently calculated by each encoder and decoder for each codeword in the current codebook, and used to update the codebook so that it adapts to changes in image semantics. The codebook index of the selected codeword and the scalar quantized value of the residual error value are then entropy encoded.
In the quality-quotient AVQ system, the codebooks are updated by replacing infrequently used codewords having a large average distortion with the scalar-quantized DCT values. The frequency of use and the average distortion of each codeword is recorded. The distortion is the size of, that is, the "energy" in the residue of each vector-quantized value transmitted with each transmitted codebook index value. The usage iteration value is a predetermined value assigned to each codeword when it is inserted in the codebook that is decremented each time it is not used in subsequent frames. In this AVQ system, the codeword having the lowest quotient when that codeword's current usage value is divided by its current average distortion value is replaced.
Alternatively, in the code-space AVQ system, the value of each codeword is replaced by the average of the sum of the current value and each of its residuals in a given frame each time that code word is used in a frame. Thus, for each image frame, each codeword maintained is the current centroid of its partition of codebook's range of values. However, in either updating system, the codeword stored in the current codebook is selected that is the "least distance" away from the DCT block currently being coded. That "least distance" is the sum of the absolute differences between the respective sets of corresponding coefficients for the DCT code and the codeword vector. The residual of that vector quantization (VQ), that same difference between sets, is then scalar-coded by quantization.
Finally, to optimize the overall performance of the video compressor for the current frame in the video sequence, the number of bits used for the H.263 entropy-encoded AVQ vector index and its VQ residual are compared to the number of bits used by the H.263 entropy-encoded scalar-quantized DCT. Bits that encode the pixels in the given block are thus dynamically allocated so that the entropy-coded quantization result requiring the fewest bits is transmitted. However, since the vector's residual value and the scalar, DCT-coded value are both quantized using the same Q.sub.p value, an equivalent distortion is introduced into the vector and the scalar coding paths of this hybrid system.
MFRC Frame Rate Control
The "conferencing codec" also optimizes frame rates so that the image quality provided by the H.623 options is further improved by allowing increased detail in selected frames of the compressed video image, in a hybrid feed-forward CFR/VFR rate control scheme referred to as Model-assisted Frame Rate Control (MFRC).
In CFR bit-rate control, a fixed bit budget "B" for each frame which is stated as bps/fps is allocated to the various areas of the frame by a global COUNTBITS routine. COUNTBITS is a iterative embedded-loop function that determines the number of bits "C" required to encode a frame having respective predicted quantization parameters Q.sub.i and Q.sub.e assigned to the facial and background regions of the current video frame.
To determine suitable values for the predicted Q.sub.i and Q.sub.e quantizers, these parameters are initially assigned low nominal values. If the result is not C.ltoreq.B. the outer loop first increases Q.sub.i by one or two units and makes Q.sub.e greater than Q.sub.i by a DQUANT value Q.sub.d So that Q.sub.d =Q.sub.e -Q.sub.i. If this pair of quantizers does not satisfy C.ltoreq.B, COUNTBITS then increases Q.sub.e by one or two units iteratively until largest quantizer value Q.sub.max is reached. If these values of Q.sub.e do not produce a result C that meets the budget B, the outer loop then increases Q.sub.i by one or two units once again sets Q.sub.e =Q.sub.i +Q.sub.d and tests for C.ltoreq.B and iterates the inner and outer loops, until the bit-budget constraint B is satisfied or until Q.sub.i +Q.sub.d =Q.sub.max that defines the maximum acceptable degree of quantization coarseness.
The CFR process provides the fixed frame rates needed for maintaining good lip-sync in compressed images, particularly at bit rates below 25.6 kbps, such as those used for videophone applications. The resolution needed in complex-foreground and moving-background frames, however, may make it impossible to reduce the sum C of Q.sub.i and Q.sub.e sufficiently to satisfy the bit budget constraint B merely by using this CFR algorithm because Q.sub.i +Q.sub.d &gt;Q.sup.max where 2Q.sub.i +Q.sub.d &lt;B. If Q.sub.i +Q.sub.d reaches the limit value Q.sub.max before the budget constraint value B is satisfied, some type of Variable Frame Rate (VFR) is required.
For the more complex foregrounds, and for moving backgrounds typical of mobile video applications, the global CFR rate control process must be supplemented by a variable frame rate control (VFR) routine that reduces the number of frames actually transmitted in the compressed, nominally 7.5 or 5 fps signal. Frame skipping allows the compressor to accommodate more bits of detail in some transmitted frames than are allowed by the CFR bit budget for those frames. However, a throughput of "x" bps can only be changed in steps x/30 bits-wide where the input frame rate is 30 fps.
Because the frame rate can only be changed by whole frames, the VFR-modified video-image signal must be transmitted through an output buffer that stores more than one frame at once. This permits the buffer's output bit rate to be somewhat independent of its input bit rate, smoothing the coarse x/30 bit adjustment steps. The output buffer's smoothing step adds, at most, a 33-ms delay when the input frame rate is 30 fps.
The VFR routine both modifies the frame-skipping pattern used to reduce the frame rate of the compressed signal and maintains a target buffer occupancy level for the codec such that there is always some unused buffer space and there are always bits available to be transmitted. This margin in the storage space and data available in the buffer prevents discontinuities in the video image, as noted above with reference to differential compression.
Model-assisted Frame Rate Control (MFRC) in the "conferencing codec" uses a constant frame rate whenever the relation between the need for fine quantization in the image and the value of the background coarseness constraint Q.sub.max alone satisfy the budget constraint C.ltoreq.B so that frame rate modification is unecessary.
Unlike conventional schemes in which the quantization level for each macroblock is simply a response to the sensed occupancy level in the output buffer, the MFRC quantizers are assigned to each frame globally. Also, the quantizers vary within that frame according to semantic values, rather than allowing such local quantizer changes to be determined by feedback from the buffer.
Videophone Systems and Image Ouality Issues
Video-image frames having a standard rate of 30 fps are compressed and output by the encoders of videophone codecs at a fixed rate of 7.5 or 5 fps. In addition to accommodating these current, very low PSTN bit rates, the consumer-oriented videophone codecs made by different manufacturers must be very flexible, so as to be compatible with each other. This means that they must be operable in real time at a variety of coding rates and input resolution levels, depending on the characteristics of the equipment at the other end of a given transmission line and the communications channels available to the consumer for a given transmission.
Videophone codecs should accommodate the standard CIF (352.times.288), SIF (360.times.240), QCIF (176.times.144), QSIF (180.times.120) and sub-QCIF (128.times.96) digital formats and both constant (CFR) or variable frame rate (VFR) coding schemes. To process such diverse signal formats, the codec must be able to select quantizers for VFR global bit-rate allocation consistent with the requirements of a given CFR system.
The constant frame rates of CFR coding in some ways are preferable for coding simple, static background "talking heads" video material at rates above 16 kbps, for the sake of simplicity and efficiency in codec design and operation. However, even at these higher speeds, image quality is degraded by the CFR coding mode when the scene becomes more complex. At very low bit rates the DCT-based compression schemes used by H.263 codecs suffer from two notable types of image degradation caused by coarse quantization: block artifacts and mosquito noise. The block artifacts are square patterns that represent a discontinuity in average intensity across the boundaries between neighboring DCT blocks. These are most objectionable in "flat" areas, areas where there is little detail to camouflage the boundaries of these blocks.
In contrast to the block artifact problem, mosquito noise smears sharp edges and fine detail by overfiltering them. This occurs in systems where images are compressed for low bit-rate transmission without sufficient regard for the semantics-driven variation in the optimal filtering level within each image. The result is artificial-looking facial expressions that lack detail around the eyes and are generally flat in visual texture. To make videophone technology attractive to consumers the system specified by H.263 must be extended to address code-boundary artifacts and the semantics of image content.
The ITU-T H.263 standard for low bit-rate channels, unlike CCITT H.261, limits the difference between the quantizers in adjacent blocks (i) and (i-1), referred to as the DQUANT value, to a list of predetermined of values "D": EQU (Q.sub.p [i]-Q.sub.p [i -1]).epsilon.D
A set of four positive and negative DQUANT values is provided in the H.263 standard operating mode. As shown in FIG. 2 a selectable alternative mode may also be provided to implement proprietary enhancements, for example, additional DQUANT values.
The conferencing codec produces a constant bit rate and constant frame rate, so long as that can be achieved, using the maximum quantizer step size consistent with that H.263 constraint on the DQUANT value. This strategy maximizes image detail in areas receiving fine quantization. The conferencing codec also provides semantic prefiltering of the image being coded, and semantic postfiltering on the receiving end, to reduce the smearing and block edge artifacts encountered when images are compressed for transmission.