Video encoding, with compression, enables storing, transmitting, and processing audio-visual information with fewer storage, network, and processor resources. The most widely used video compression standards include MPEG-1 for storage and retrieval of moving pictures, MPEG-2 for digital television, and MPEG-4 and H.263 for low-bit rate video communications, see ISO/IEC 11172-2:1991, “Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbps,” ISO/IEC 13818-2:1994, “Information technology—generic coding of moving pictures and associated audio,” ISO/IEC 14496-2:1999, “Information technology—coding of audio/visual objects,” and ITU-T, “Video Coding for Low Bitrate Communication,” Recommendation H.263, March 1996.
These standards are relatively low-level specifications that primarily deal with a spatial compression of images or frames, and the spatial and temporal compression of sequences of frames. As a common feature, these standards perform compression on a per-image basis. With these standards, one can achieve high compression ratios for a wide range of applications.
Interlaced video is commonly used to scan format for television systems. In interlaced video, each frame of the video is divided into a top-field and a bottom-field. The two interlaced fields represent odd- and even-numbered rows or lines of picture elements (pixels) in the frame. The two fields are sampled at different times to enhance a temporal smoothness of the video during playback. Compared to a progressive video scan format, interlaced video has different characteristics and provides more encoding options.
At the macro block level, a variety of modes can be used to encode a video, depending on the coding standard. For example, in order to support interlaced video sequences, the MPEG-2 standard has several different macro block coding modes, including intra mode, no motion compensation (MC) mode, frame/field motion compensation inter mode, forward/backward/interpolate inter mode, and frame/field DCT mode. As an advantage, the multiple modes provide better coding efficiencies due to their inherent adaptability.
The encoding tools included in the MPEG-2 standard are described by Puri et al., “Adaptive Frame/Field Motion Compensated Video Coding,” Signal Processing: Image Communications, 1993, and Netravali et al., “Digital Pictures: Representation Compression and Standards,” Second Edition, Plenum Press, New York, 1995.
In the MPEG-2 standard, after the picture-level coding mode, i.e., frame-picture or field-picture, is determined, each macro block (MB) in the P- or B-frame can be coded by several different modes. Each mode corresponds to specified motion estimation strategy and either a field-based DCT transform or a frame-based DCT transform is applied. In the TM5 reference encoder, the MB mode decision is based only on the sum of absolute difference (SAD) of the motion estimation and the corresponding variance in texture.
FIG. 1 shows the MB mode decision in a TM5 encoder for a P-type frame picture. Here, the input modes 101 depend on the picture structure type (P or B), and picture mode (frame or field). A “best inter mode” is selected 110 according to a sum 115 of absolute difference (SAD). For example, for P-type frame picture, there are three inter modes: field 111, frame 113, and dual motion vector (DMV) 112. If the SAD of field mode is the smallest of the three, then field mode is selected as the best inter mode 118. The best inter mode is then compared with intra mode 121 and a mode that just copies the co-positional MB of the previous frame (MV=0) 122 as the prediction. Based on the texture variance and some experience equations 130, a final mode 140 is selected. In the TM5 encoder, a difference of motion vector coding rate is not considered. Depending on the size of motion search window and picture type, the rate difference of the motion vectors corresponding to different modes can be tens of bits, which is significant.
After all of the MB modes are determined, the DCT type of each MB is estimated based on spatial difference between the top and bottom field part of each MB. For the field picture, the DCT type is fixed to the field type. For the frame picture, the DCT type can be either field DCT or frame DCT. In the TM5 encoder, two parameters of the top and bottom field parts are extracted. These are the sum of pixel values and the sum of the square of pixel values. The two parameters of both top and bottom field parts of each MB are combined to estimate the DCT type of the MB. However, the optimal mode decision should be based on both the rate and distortion (RD) information.
Because different modes have different motion vectors, which correspond to different coding rates, it should be obvious that the MB mode decision in the prior art TM5 encoder is not optimal. In the conventional rate control method such as TM5, the rate control is obtained by adjusting the quantization scales based on buffer fullness and localized texture variance. It is independent of the mode and DCT type decision. Obviously, that is not optimal either. Moreover, it can be shown that the TM5 DCT type estimation method is not accurate. Hence, an effective rate control method combining with MB mode decision is desired.
U.S. Pat. No. 5,909,513 “Bit allocation for sequence image compression” issued to Liang et al. on Jun. 1, 1999 describes a method and system for allocating bits for representing blocks that are transmitted in an image compression system. There, the bit allocation is obtained by minimizing a cost function cost=D+λR, where D is the total distortion for a frame, R is a desired total number of bits for the frame, a LaGrange multiplier λ is obtained by a bi-section based exhaustive search method. The LaGrange multiplier value λ can be adjusted block by block by a feedback technique.
U.S. Pat. No. 5,691,770 “Device and method for coding video pictures” issued to Keesman et al. on Nov. 25, 1997 describes a method to improve an MPEG-coded video signal by modifying selected coefficients after conventional quantization. The modification is such that a Lagrangian cost cost=D+λR is minimal for a given value of the LaGrange multiplier λ. The value of λ is calculated by means of a statistical analysis of the picture to be coded. The statistical analysis includes estimation of the RD curve on the basis of the amplitude histogram distribution of the DCT coefficients. The searched λ is the derivative of this curve at the desired bit rate. In that method for optimal quantization scale selection, the focus is on the determination of the LaGrange multiplier λ. Macro block mode decision is not considered.
In U.S. Pat. No. 6,226,327, “Video coding method and apparatus which select between frame-based and field-based predictive modes,” issued on May 1, 2001 to Igarashi et al, a picture is considered as a mosaic of areas. Each area is encoded using either frame-based motion compensation of a previously encoded area, or field-based motion compensation of a previously encoded area, depending on which will result in the least amount of motion compensation data. Each area is orthogonally transformed using either a frame-based transformation or a field-based transformation, depending on which will result in the least amount of motion compensation data.
U.S. Pat. No. 6,037,987, “Apparatus and method for selecting a rate and distortion based coding mode for a coding system,” issued to Sethuraman on Mar. 14, 2000 describes a macro block mode decision scheme. In that method, a coding mode that has a distortion measure that is nearest to an expected distortion level is selected. After an initial coding mode is selected, the method applies a trade-off operation. The trade-off operation is actually a simplified cost comparison among the optional modes. The best coding mode after the trade-off operation is selected as the coding mode for the current macro block. In that method, it is assumed that the suitable quantization scale and rate constraint for each macro block can be obtained by a rate-control strategy.
U.S. Pat. No. 6,414,992 “Optimal encoding of motion compensated video,” issued to Sriram et al. on Jul. 2, 2002 involves a system and method for optimizing video encoding. For each mode, both distortion and the amount of data required are taken into account. The optimal selection is obtained by comparing all the optional modes in the video encoder. As a rate distortion based method, encoding and decoding the macro block correspondingly is used to obtain the rate and distortion information of each mode. For example, if there are seven optional modes, seven pass encoding and decoding are required.
A similar strategy has been adopted by the Joint Video Team (JVT) reference code, see ISO/IEC JTC1/SC29/WG11 and ITU-T VCEG (Q.6/SG16), “Detailed Algorithm Technical Description for ITU-T VCEG Draft H.26L Algorithm in Response to Video and DCinema CJPs.” In that complexity mode decision method, the macro block mode decision is done by minimizing the Lagrangian functionJ(s, c, MODE|QP, λMODE)=SSD(s, c, MODE|QP)+λMODE·R(s, c, MODE|QP),where QP is the macro block quantizer, λMODE is the LaGrange multiplier for mode decision, MODE indicates a mode chosen from the set of potential prediction, and SSD is the sum of the squared differences between the original block s and its reconstruction c. In this method, QP is fixed and λMODE is estimated based on the value of QP.
None of the above prior art methods for optimal mode consider the selection of a quantization scale.
Systems and methods for optimally selecting a macro block coding mode based on a quantization scale selected for the macro block are described in U.S. Pat. No. 6,192,081, “Apparatus and method for selecting a coding mode in a block-based coding system,” issued to Chiang et al. on Feb. 20, 2001, and Sun, et al., “MPEG coding performance improvement by jointly optimizing coding mode decisions and rate control,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, No. 3, June 1997.
FIG. 2 shows a typical prior art system and method 200 for jointly optimizing the coding mode and the quantizer. That system 200 basically uses a brute force, trial-and-error method. The system 200 includes a quantization selector 210, a mode selector 220, a MB predictor 230, a discrete cosine transform (DCT) 240, a quantizer 250, a variable length coder (VLC) 260, a cost function 270 to select an optimal quantization and mode 280. The optimal quantization and mode 280 are achieved by an iterative procedure for searching through a trellis to find a path that has a lowest cost. As the quantizer selector 210 changes its step size, e.g., 1 to 31, the mode selector 220 responds by selecting each mode for each macro block, e.g., intra 221, no MC 222, MC frame 223, and MC field 224.
A macro block level is predicted 230 in terms of a decoded picture type. Then, the forward DCT 240 is applied to each macro block of a predictive residual signal to produce DCT coefficients. The DCT coefficients are quantized 250 with each step size in the quantization parameter set. The quantized DCT coefficients are entropy encoded using the VLC 260, and a bit rate 261 is recorded for later use. In parallel, a distortion calculation by means of mean-square-error (MSE) is performed over pixels in the macro block resulting in a distortion value.
Next, the resulting bit rate 261 and distortion 251 are received into the rate-distortion module for cost evaluation 270. The rate-distortion function is constrained by a target frame budget imposed by a rate constraint Rpicture 271. The cost evaluation 270 is performed on each value q in the quantization parameter set. The quantization scale and coding mode for each macro block with the lowest value are selected.
In that system, it is assumed that distortion is unchanged for different mode as long as the quantization scale value q is same. Thus, uniform distortion is used as a constraint and the minimization of the object function is equivalent to minimizing the resulted bit-rates. If Q denotes the set of all admissible quantization scales, and M denotes the set of all admissible coding modes, then the complexity of the system is Q×M. Because a single loop for each quantization scale value involves DCT transformation, quantization, distortion and bit count calculation for each macro block, the double loop for joint mode decision and quantization scale selection in that system makes its complexity extremely high.
Therefore, there is a need to provide a system and method for encoding video that achieves a solution for coding mode decision and quantization scale selection with less complexity than the prior art.