The invention relates generally to video compression techniques and, more particularly, to perceptual-based video signal coding techniques resulting in reduced complexity of video coders implementing same.
The age of digital video communications is arriving slower than many had anticipated. Picturephone (1970s) and videophone (1990s) were not commercial successes because they did not provide full color, full motion video and were not cost effective. Desktop video, using windows in computer monitors or TV screens, requires special purpose chips or graphics accelerators to perform the encoding operations. The chips usually come mounted on video boards that are expensive and whose installation intimidates most users. One main reason that these video processors are necessary is attributable to the use of block-based motion compensation, although two-dimensional block-based transforms and lossless compression of quantized transform coefficients add to the computational burden. Motion compensation accounts for over 60% of the computational effort in most video compression algorithms. Although there are algorithms that avoid motion compensation, such as Motion-JPEG, they tend to consume some ten times more transmission bandwidth or storage space because they fail to capitalize on the interframe correlation between successive frames of video. This is especially critical in video conferencing and distance learning applications, thus, rendering these algorithms uncompetitive in such applications.
Sources such as speech and video are highly correlated from sample to sample. This correlation can be used to predict each sample based on a previously reconstructed sample, and then encode the difference between the predicted value and the current sample. A main objective of motion compensation is to reduce redundancy between the adjacent pictures. There are two kinds of redundancy well known in video compression: (i) spatial (intra-frame) redundancy; and (ii) temporal (inter-frame) redundancy. Temporal correlation usually can be reduced significantly via forward, backward, or interpolative prediction based on motion compensation. The remaining spatial correlation in the temporal prediction error images can be reduced via transform coding. In addition to spatial and temporal redundancies, perceptual redundancy has begun to be considered in video processing technology, e.g., N. S. Jayant et al., xe2x80x9cSignal Compression Based on Models of Human Perception,xe2x80x9d Proceedings of IEEE, Volume 10, October 1993.
FIG. 1 illustrates a block diagram of a widely used video encoder 10 for encoding video signals for transmission, storage, and/or further processing. The encoder 10 includes a motion estimator 12 and a signal subtractor 14, both coupled to the input of the encoder. The encoder 10 also includes a transformer (e.g., a discrete cosine transform or DCT generator) 16 coupled to the signal subtractor 14, a quantizer 18 coupled to the transformer 16, and an entropy encoder 20 coupled to the quantizer 18 and the output of the encoder 10. An inverse transformer (e.g., an inverse DCT generator) 22 is also included and coupled between the quantizer 18 and the entropy encoder 20. The encoder 10 also includes a signal combiner 24 coupled to the inverse transformer 22, a delay 26 coupled to the signal combiner 24, and a motion compensator 28 coupled to the delay 26, the signal subtractor 14, the signal combiner 24, and the motion estimator 12. Also included in the encoder 10 is a rate control processor 30 coupled to the quantizer 18.
It is known that motion estimation and motion compensation, as described in detail in Y. Nakaya et al., xe2x80x9cMotion Compensation Based on Spatial Transformations,xe2x80x9d IEEE Transactions on Circuits and Systems for Video Technology,xe2x80x9d Volume 4, Number 3, Pages 339-356, June 1994, can be used to improve the inter-frame prediction by exploiting the temporal redundancy in a sequence of frames. The motion estimator 12 performs nxc3x97n (n typically equals 16) block-based matching of the kth input frame Fk using the kxe2x88x921st decompressed frame {circumflex over (F)}kxe2x88x921 (generated by delay 26) as the reference. The matching criterion usually employed is mean absolute error (MAE), although mean square error (MSE) may alternatively be employed. For the ith macroblock, the error measure eni(d) for the displacement vector d between Fk and {circumflex over (F)}kxe2x88x921 is:                                           en            i                    ⁡                      (            d            )                          =                              ∑                                          (                                  x                  ,                  y                                )                            ⁢              ε              ⁢                              xe2x80x83                            ⁢              B                                ⁢                      xe2x80x83                    ⁢                      "LeftDoubleBracketingBar"                                                            F                  k                                ⁡                                  (                                      x                    ,                    y                                    )                                            -                                                                    F                    ^                                                        k                    -                    1                                                  ⁡                                  (                                                            x                      -                      d                                        ,                                          y                      -                      d                                                        )                                                      "RightDoubleBracketingBar"                                              (        1        )            
where B is the measurement block being predicted. It is evident that a motion vector obtained based on MSE is ∥x∥=x2 and a motion vector obtained based on MAE is ∥x∥=|x| in equation (1). MAE is usually used, rather than MSE, because MAE is free of multiplications and provides similar results in terms of predictive error. The offset between each block in Fk and the block in {circumflex over (F)}kxe2x88x921 that best matches it is called the motion vector for that block. That is, the motion vector mvi for macroblock i is:                               mv          i                =                  arg          ⁢                                    min                              d                ⁢                                  xe2x80x83                                ⁢                ε                ⁢                                  xe2x80x83                                ⁢                S                                      ⁢                                          en                i                            ⁡                              (                d                )                                                                        (        2        )            
where S is the search area. Interpolation schemes allow the motion vectors to achieve fractional-pel accuracy, as described in ITU-T Recommendation H.263, xe2x80x9cVideo Coding For Low Bit Rate Communication,xe2x80x9d December 1995. Motion estimation is computationally demanding in that both signals, Fk and {circumflex over (F)}kxe2x88x921, entering the motion estimator 12 are high rate and, thus, the operations that have to be performed on them are computationally intensive even if the search for the best-matching block is performed only hierarchically rather than exhaustively. The result of the motion estimation is the set of motion vectors Mk for kth frame.
The Mk are usually losslessly compressed and then conveyed to the transmission channel for immediate or eventual access by the decoder. Also, the Mkxe2x80x2 are fed back to the motion compensator 28 in the prediction loop of the encoder. The Mk constitute a recipe for building a complete frame, herein referred to as {tilde over (F)}k, by translating the blocks of {circumflex over (F)}kxe2x88x921. The motion compensated frame {tilde over (F)}k is subtracted pixel-wise from the current input frame Fk, in signal subtractor 14, to produce a difference frame Dk, often referred to as the displaced frame difference (DFD), as further described in T. Ebrahimi et al., xe2x80x9cNew Trends in Very Low Bit Rate Video Coding,xe2x80x9d Proceedings of the IEEE, Volume 83, Number 6, Pages 877-891, June 1995; and W. P. Li et al., xe2x80x9cVector-based Signal Processing and Quantization For Image and Video Compression,xe2x80x9d Proceedings of the IEEE, Volume 83, Number 2, Pages 317-335, February 1995. The remaining spatial correlation in Dk is eliminated by the transformer 16 and the quantizer 18. The transformer may, for example, be a discrete cosine transform (DCT) generator which generates DCT coefficients for macroblocks of frames. The quantizer then quantizes these coefficients. A conventional video encoder such as that shown in FIG. 1 generally attempts to match the bit rate of the compressed video stream to a desired transmission bandwidth. The quantization parameter (QP) used in the quantizer 18 generally has a substantial effect on the resultant bit rate: a large QP performs coarse quantization, reducing the bit rate and the resulting video quality, while a small QP performs finer quantization, which leads to a higher bit rate and higher resulting image quality. The rate control processor 30 thus attempts to find a QP that is high enough to restrain the bit rate, but with the best possible resulting image quality. In general, it is desirable to maintain consistent image quality throughout a video sequence, rather than having the image quality vary widely from frame to frame. Both the MPEG-2 simulation model and the H.263 test model suggest rate control techniques for selecting the QP, however, there are other rate control techniques known to those of ordinary skill in the art.
Next, the lossy version of Dk, denoted as {circumflex over (D)}k and generated by inverse transformer 22, and the motion compensated frame {tilde over (F)}k are used in the compressor/feedback loop to reconstruct the reference frame {circumflex over (F)}k for the next input frame Fk+1. Finally, the Huffman (or arithmetic) coded lossy compressed version of Dk, generated by the entropy encoder 20, is transmitted to the decoder. It is to be appreciated that FIG. 1 represents a generic coder architecture described in the current video codec (coder/decoder) standards of H.261, H.263, MPEG-1, and MPEG-2. Further details on these standards are respectively described in: M. Liou, xe2x80x9cOverview of the P*64 Kbit/s Video Coding Standard,xe2x80x9d Communications of the ACM, Volume 34, Number 4, Pages 59-63, April 1991; ITU-T Recommendation H.263, xe2x80x9cVideo Coding For Low Bit Rate Communication,xe2x80x9d December 1995; D. LeGall, xe2x80x9cMPEG: A Video Compression Standard for Multimedia Applications,xe2x80x9d Communications of the ACM, Volume 34, Number 4, April 1991; and B. Haskell et al., xe2x80x9cDigital Video: An Introduction to MPEG-2,xe2x80x9d Chapman and Hall, 1997.
Picture quality, coding bit rate, computational complexity, and latency are the four aspects of video codecs that can be traded-off when designing a video codec system. This is further discussed in N. S. Jayant, xe2x80x9cSignal Compression: Technology Targets and Research Direction,xe2x80x9d IEEE Journal on Selected Areas in Communications, Volume 10, Number 5, Pages 796-818, June 1992. A main objective of a video codec is to represent the original signal with minimal bit rate while maintaining acceptable picture quality, delay, and computational complexity. From the above-mentioned rationale of motion estimation, the motion vectors are attained as the displacement having the minimal error metric. Although this achieves minimum-MAE in the residual block, it does not necessarily result in the best perceptual quality since MAE is not always a good indicator of video quality. In low bit rate video coding, the overhead in sending the motion vectors becomes a significant proportion of the total data rate. The minimum-MAE motion vector may not achieve the minimum joint entropy for coding the residual block and motion vector, and thus may not achieve the best compression efficiency. Another problem occurs in the smooth, still motion backgrounds where zero-displaced motion vectors may not be selected based strictly on minimum-MAE criteria. In this case, the zero-displaced motion vector is a better candidate than the minimum-MAE motion vector because the codeword for the zero-displaced motion vector is usually smaller and, thus, the zero-displaced motion vector will generate lower combined data rates for DCT coefficients and motion vectors without any loss of picture quality. If it can be determined that zero-displaced motion vector is the suitable one to select in the beginning phase, a large computational effort can be saved by avoiding motion estimation for these macroblocks.
However, it is to be appreciated that since motion estimation/compensation imposes such a significant computational load on the resources of an encoder and a corresponding decoder, it would be highly desirable to develop encoding techniques that segment frames into portions that should be motion compressed and those that do not need to be motion compressed.
The invention provides video encoding apparatus and methodologies which improve the computational efficiency and compression ratio associated with encoding a video signal. This is accomplished by providing perceptual preprocessing in a video encoder that takes advantage of the insensitivity of the human visual system (HVS) to mild changes in pixel intensity in order to segment video into regions according to perceptibility of picture changes. Then, the regional bit rate and complexity is reduced by repeating regions which have changed an imperceptible amount from the preceding frame. In addition, the invention accurately models the motion in areas with perceptually significant differences to improve the coding quality. Depending on picture content, perceptual preprocessing achieves varied degrees of improvement in computational complexity and compression ratio without loss of perceived picture quality.
In an illustrative embodiment of the invention, a method of encoding a video sequence including a sequence of video images is provided. The inventive method includes comparing elements of a portion of a first video image (e.g., pixels of a macroblock of a current frame) with elements of a portion of a second video image (e.g., corresponding pixels of a macroblock of a previous frame) to generate respective intensity difference values for the element comparisons. Then, a first value is assigned to the intensity difference values that are at least above a visually perceptible threshold value and a second value is assigned to the intensity difference values that are not at least above the visually perceptible threshold value. In one embodiment of the invention, the visually perceptible threshold value is a function of a quantization parameter associated with the bit rate of the encoding process such that quality-adaptive thresholding is realized. Next, the method includes dividing the portion of the first video image into sub-portions (e.g., four 4xc3x974 blocks of an 8xc3x978 macroblock) and summing the first and second values associated with each corresponding sub-portion to generate respective sums. If a respective sum is at least greater than a decision value, a variable associated with that sub-portion is set to a first value. If a respective sum is not at least greater than the decision value, the variable associated with that sub-portion is set to a second value. The values associated with the variables are then added. Depending on the result of the addition, the portion of the first video image is either motion compensated or not.
It is to be appreciated that the present invention takes advantage of the realization that an intensity difference between an isolated pixel or pixels of succeeding video frames, which is at least greater than some visually perceptible threshold value, is nonetheless difficult to detect by the HVS. That is, despite the fact that there may be a number of pixels in a macroblock in a current frame that, when compared to corresponding pixels in a previous frame, result in an intensity difference value at least greater than the visually perceptible threshold value, the macroblock may still not need to be motion compensated and can merely be repeated at the decoder if such individual pixels are isolated from one another in the macroblock. Advantageously, the evaluation of thresholding results with respect to sub-blocks of a macroblock permit a determination as to where the individual pixels of interest are located within the macroblock. In this manner, the present invention implements a randomness-adaptive decision process with respect to deciding whether or not to encode a macroblock.
It is to be further appreciated that the invention is fully compatible and, thus, may be implemented with video standards such as, for example, H.261, H.263, Motion-JPEG, MPEG-1, and MPEG-2.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.