The present invention relates generally to improvements in video encoding, for example, the encoding employed in such encoding standards as MPEG-1, MPEG-2, H.261, H.263, and motion estimation. More particularly, it relates to advantageous techniques for applying frequency domain analysis to motion estimation.
The moving pictures expert group (MPEG) video compression standards, MPEG-1 (ISO 11172-2) and MPEG-2 (ISO 13818-2), employ image processing techniques at multiple levels. Of interest to the present invention is the processing of 16xc3x9716 macroblocks and 8xc3x978 blocks. In the terminology used by the MPEG standards, a xe2x80x9cframexe2x80x9d is an X by Y image of pixels, or picture elements. Each pixel represents the smallest discrete unit in an image. The xe2x80x9cpixelxe2x80x9d, in MPEG usage, consists of three color components, one luminance and two chrominance values, Y, Cb, and Cr, respectively. Each frame is subdivided into 16xc3x9716 xe2x80x9cmacroblocksxe2x80x9d of pixels. A grouping of macroblocks is called a xe2x80x9cslicexe2x80x9d. Each macroblock is further sub-divided into 8xc3x978 xe2x80x9cblocksxe2x80x9d of pixels. A macroblock is typically comprised of four luminance (Y) and two or more chrominance (Cb and Cr) blocks. A more detailed description of luminance and chrominance is included in the MPEG-1 and MPEG-2 specifications. A sequence of frames ultimately makes up a video sequence.
One of the key compression methods used in MPEG is the discrete cosine transform (DCT) or the two dimensional discrete cosine transform (2D-DCT) coupled with quantization. During the encoding process, each block is transformed from its spatial-domain representation or its actual pixel values to a frequency-domain representation utilizing a two-dimensional 8xc3x978 DCT. The quantization has the effect of deemphasising or eliminating visual components of the block with high spatial frequencies not normally visible to the human visual system, thus reducing the volume of data needed to represent the block. The quantization values used by the MPEG protocols are in the form of a quantization scale factor, included in the encoded bitstream, and the quantization tables. There are default tables included in the MPEG specification. However, these can be replaced by quantization tables included in the encoded bitstream. The decision as to which scale factors and tables to use is made by the MPEG encoder.
One of the fundamental methods used by the MPEG protocol is a mechanism whereby a macroblock within a single frame within a sequence of frames is represented in a motion vector (MV) encoded format. An MV represents the spatial location difference between that macroblock and a reference macroblock from a different, but temporally proximate, frame. Note that whereas DCT compression is performed on a block basis, the MVs are determined for macroblocks.
MPEG classifies frames as being of three types: I-frame (Intra-coded), P-frame (Predictive-coded), and B-frame (Bidirectionally predictive-coded). I-frames are encoded in their entirety. All of the information to completely decode an I-frame is contained within its encoding. I-frames can be used as the first frame in a video sequence, as the first frame of a new scene in a video sequence, as reference frames described further below, as refresh frames to prevent excessive error build-up, or as error-recovery frames, for example, after incoming bitstream corruption. They can also be convenient for special features such as fast forward and fast reverse.
P-frames depend on one previous frame. This previous frame is called a reference frame, and may be the previous I-frame, or P-frame, as shown below. An MV associated with each macroblock in the P-frame points to a similar macroblock in the reference frame. During reconstruction, or decoding, the referenced macroblock is used as the starting point for the macroblock being decoded. Then, a, preferably small, difference macroblock may be applied to the referenced macroblock. To understand how this reference-difference macroblock combination works, consider the encoding process of a P-frame macroblock. Given a macroblock in the P-frame, a search is performed in the previous reference frame for a similar macroblock. Once a good match is found, the reference macroblock pixel values are subtracted from the current macroblock pixel values. This subtraction results in a difference macroblock. Also, the position of the reference macroblock relative to the current macroblock is recorded as an MV. The MV is encoded and included in the encoder""s output. This processing is followed by the DCT computation and quantization of the blocks comprising the difference macroblock. To decode the P-frame macroblock, the macroblock in the reference frame indicated by the MV is retrieved. Then, the difference macroblock is decoded and added to the reference macroblock. The result is the original macroblock values, or values very close thereto. Note that the MPEG encoding and decoding processes are categorized as lossy compression and decompression, respectively.
The idea is that the encoding of the MV and the difference information for a given macroblock will result in a smaller number of bits in the resulting bitstream than the complete encoding of the macroblock by itself. Note that the reference frame for a P-frame is usually not the immediately preceding frame. A sample ordering is given below.
B-frames depend on two reference frames, one in each temporal direction. Each MV points to a similar macroblock in each of the two reference frames. In the case of B-frames, the two referenced macroblocks are averaged together before any difference information is added in the decoding process. Per the MPEG standard, B-frame is not used as a reference frame. The use of B-frames normally results in a more compact representation of each macroblock.
A typical ordering of frame types would be I1, B2, B3, P4, B5, B6, P7, B8, B9, I10, and so on. Note that the subscripts refer to the temporal ordering of the frames. This temporal ordering is also the display ordering produced by the MPEG decoder. The encoded ordering of these frames, found in an MPEG bitstream, is typically different: I1, P4, B2, B3, P7, B5, B6, I10, B8, B9, and so forth. The first frame is always an I-frame. As mentioned above, an I-frame has no temporal dependencies upon other frames, therefore an I-frame contains no MVs. Upon completion of the decoding of this frame, it is ready for display. The second frame to be decoded is P4. It consists of MVs referencing I1 and differences to be applied to the referenced macroblocks. After completion of the decoding of this frame, it is not displayed, but first held in reserve as a reference frame for decoding B2 and B3, then displayed, and then used as a reference frame for decoding B5 and B6. The third frame to be decoded is B2. It consists of pairs of MVs for each macroblock that reference I1and P4 as well as any difference information. Upon completion of the decoding of B2, it is ready for display. The decoding then proceeds to B3. B3 is decoded in the same manner as B2 B3""s MVs reference I1 and P4. B3 is then displayed, followed by the display of P4. P4 then becomes the backward-reference frame for the next set of frames. Decoding continues in this fashion until the entire set of frames, or video sequence, has been decoded and displayed.
A video sequence generally approximates the appearance of smooth motion. In such a sequence, a given block of pixels in a given frame will be similar in content to one or more spatially proximate blocks in a range of temporally proximate frames. Given smooth real motion within a scene represented by such a sequence, and smooth apparent motion caused by changes in the orientation, point of view, and characteristics such as field width, for example, of the recorder of such a sequence, the positions of blocks that exhibit the greatest similarity across a number of temporally adjacent frames is very likely to be approximately spatially linear with respect to a fixed reference such as the common origin of the frames. The process of identifying the positions of such similar blocks across a range of frames is referred to as motion estimation. The spatial relationship among such blocks is referred to as the motion vector.
Historically, the measure of similarity between blocks has been represented by the pixel-wise sum or mean of the absolute differences (SAD or MAD, respectively) between the given macroblock and the reference macroblock or macroblocks. The SAD is defined as the sum of the absolute value of the differences between the spatially collocated pixels in the given macroblock and the reference macroblock. The MAD can be determined by computing the SAD, then dividing by the number of pixels in the given macro block, for example, 256 in a 16xc3x9716 macroblock. To differentiate between current techniques and the techniques of the present invention, the prior art spatial domain mean of absolute differences will be referred to as SD-MAD and the prior art spatial domain sum of absolute differences will be referred to as SD-SAD.
Much of the computational effort expended by the typical MPEG encoder is used in locating macroblocks of pixels, within a window of macroblocks, in the reference frame or frames that yield the least SD-SAD or the least SD-MAD for a given macroblock. Large search window sizes are needed to compress fast motion such as might be found in a video sequence of a sporting event.
The MPEG protocol represents an image in the frequency domain using DCT processing with quantization for compression reasons, yet motion estimation is typically performed in the spatial-domain. For example, implementations of block matching algorithms are readily found in the literature. These algorithms typically use an SD-MAD or an SD-SAD computation. In the following discussion of both existing algorithms and the new invention, the MAD statistic is used, but can readily be substituted by the SAD. The relationship between the two is one of a single constant. In other words, this constant is the number of values being considered, such as 256 for an MPEG macroblock. Spatial-domain similarity analysis has as a basic assumption that the SD-MAD of two pixel macroblocks correlates with the volume of data required to represent the 2D-DCT of the difference between the blocks. While this assumption may be valid, it is not the only possible correlation. Consider, as an extreme case, two frames with the first frame being completely white (i.e., the luminance of all values is equal to 255), and the second frame being completely black with all values equal to zero. Assume that the white frame is being used as a reference for the black P-frame, it is necessary to try to match the black blocks of the new frame against the white blocks of the I-frame. In the spatial domain, the SD-MAD of any pair of black and white blocks is 255, the worst possible value, making them, prima facie, poor candidates as a reference-difference pair. A typical spatial-domain motion estimator would not consider them. fact, the quantized 2D-DCT of the difference between these blocks is:
which conntains exactly one non-zero quantity. Due to the characteristics of MPEG variable-length coding, the DCT of the difference can be expressed very compactly, actually making these blocks a good reference-difference pair, even though the blocks have an extremely poor SD-MAD.
As a more complex example, FIGS. 1A and 1B show a pair of pixel blocks: a reference block 10 and a sample block 12. These blocks 10 and 12 are represented by the below values:
A spatial-domain difference for these blocks 10 and 12
quantifies the obvious, that there is little spatial-domain similarity between them. The SD-MAD is 94. The zigzag ordered, quantized 2D-DCT of the difference, however,
is a great deal more promising for compression.
As the foregoing analysis demonstrates, spatial-domain similarity, such as SD-MAD, is not always the best criterion from which to determine good reference-difference block pairs for motion estimation. While pairs that exhibit great spatial-domain similarity can very likely yield minimal difference blocks under variable length coding, such analysis can miss pairs that exhibit far better compression. The present invention recognizes that the spatial-domain measurement of the prior art is not necessarily ideal, and it provides an advantageous alternative criterion and a method of implementation that typically achieves better results than an SD-MAD or an SD-SAD approach. As further addressed below, the approach of the present invention is also significantly less computation intensive.
One aspect of a motion estimation and compensation process and apparatus in accordance with the present invention is the minimization of the volume of data in the frequency domain, as contrasted with the spatial domain, needed to describe the difference between two blocks. Additionally, in accordance with the present invention, inspection can be performed at the level of the quantized 2D-DCTs of the blocks and not at the level of the pixel blocks. Moreover, a smaller number of values need be inspected in the frequency domain, whereas all of the spatial domain values must be included in the typical spatial domain analysis. Blocks that will be missed by spatial-domain analysis will be identified. Better compression and faster computation thereby may be achieved.