The present invention relates generally to a method of compressing or coding digital video with bits and, specifically, to an effective method for estimating and encoding motion vectors in motion-compensated video coding.
In classical motion estimation the current frame to be encoded is decomposed into image blocks of the same size, typically blocks of 16×16 pixels, called “macroblocks.” For each current macroblock, the encoder searches for the block in a previously encoded frame (the “reference frame”) that best matches the current macroblock. The coordinate shift between a current macroblock and its best match in the reference frame is represented by a two-dimensional vector (the “motion vector”) of the macroblock. Each component of the motion vector is measured in pixel units.
For example, if the best match for a current macroblock happens to be at the same location, as is the typical case in stationary background, the motion vector for the current macroblock is (0,0). If the best match is found two pixels to the right and three pixels up from the coordinates of the current macroblock, the motion vector is (2,3). Such motion vectors are said to have integer pixel (or “integer-pel” or “full-pel”) accuracy, since their horizontal X and vertical Y components are integer pixel values. In FIG. 1, the vector V1=(1,1) represents the full-pel motion vector for a given current macroblock.
Moving objects in a video scene do not move in integer pixel increments from frame to frame. True motion can take any real value along the X and Y directions. Consequently, a better match for a current macroblock can often be found by interpolating the previous frame by a factor N×N and then searching for the best match in the interpolated frame. The motion vectors can then take values in increments of 1/N pixel along X and Y and are said to have 1/N pixel (or “1/N-pel”) accuracy.
In “Response to Call for Proposals for H.26L,” ITU-Telecommunications Standardization Sector, Q.15/SG16, doc. Q15-F-11, Seoul, Nov. 98, and “Enhancement of the Telenor proposal for H.26L,” ITU-Telecommunications Standardization Sector, Q.15/SG16, doc. Q15-G-25, Monterey, Feb. 99, Gisle Bjontegaard proposed using ⅓-pel accurate motion vectors and cubic-like interpolation for the H26L video coding standard (the “Telenor encoder”). To do this, the Telenor encoder interpolates or “up-samples” the reference frame by 3×3 using a cubic-like interpolation filter. This interpolated version requires nine times more memory than the reference frame. At a given macroblock, the Telenor encoder estimates the best motion vector in two steps: the encoder first searches for the best integer-pel vector and then the Telenor encoder searches for the best ⅓-pixel accurate vector V1/3 near V1. Using FIG. 1 as an example, a total of eight blocks (of 16×16 pixels) in the 3×3 interpolated reference frame are checked to find the best match which, as shown is the block associated to the motion vector V1/3=(VX, VY)=(1+⅓,1). The Telenor encoder has several problems. First, it uses a sub-optimal fast-search strategy and a complex cubic filter (at all stages) to compute the ⅓-pel accurate motion vectors. As a result, the computed motion vectors are not optimal and the memory and computation requirements are very expensive. Further, the Telenor encoder uses an accuracy of the effective rate-distortion criteria that is fixed at ½-pixel and, therefore, does not adapt to select better motion accuracies. Similarly, the Telenor encoder variable-length code (“VLC”) table has an accuracy fixed at ⅓-pixel and, therefore, is not adapted and interpreted differently for different accuracies.
Most known video compression methods estimate and encode motion vectors with ⅓-pixel accuracy, because early studies suggested that higher or adaptive motion accuracies would increase computational complexity without providing additional compression gains. These early studies, however, did not estimate the motion vectors using optimized rate-distortion criteria, did not exploit the convexity properties of such criteria to reduce computational complexity, and did not use effective strategies to encode the motion vectors and their accuracies.
One such early study was Bernd Girod's “Motion-Compensating Prediction with Fractional-Pel Accuracy,” IEEE Transactions on Communications, Vol. 41, No. 4, pp. 604-612, April 1993 (the “Girod work”). The Girod work is the first fundamental analysis on the benefits of using sub-pixel motion accuracy for video coding. Girod used a simple, hierarchical strategy to search for the best motion vector in sub-pixel space. He also used simple mean absolute difference (“MAD”) criteria to select the best motion vector for a given accuracy. The best accuracy was selected using a formula that is not useful in practice since it is based on idealized assumptions, is very complex, and restricts all motion vectors to have the same accuracy within a frame. Finally, Girod focused only on prediction error energy and did not address how to use bits to encode the motion vectors.
Another early study was Smita Gupta's and Allen Gersho's “On Fractional Pixel Motion Estimation,” Proc. SPIE VCIP, Vol. 2094, pp. 408-419, Cambridge, November 1993 (the “Gupta work”). The Gupta work presented a method for computing, selecting, and encoding motion vectors with sub-pixel accuracy for video compression. The Gupta work disclosed a formula based on mean squared error (“MSE”) and bilinear interpolation, used this formula to find an ideal motion vector, and then quantized such vector to the desired motion accuracy. The best motion vector for a given accuracy was found using the sub-optimal MSE criteria and the best accuracy was selected using the largest decrease in difference energy per distortion bit, which is a greedy (sub-optimal) criteria. A given motion vector was coded by first encoding that vector with ½-pel accuracy and then encoding the higher accuracy with refinement bits. Coarse-to-fine coding tends to require significant bit overhead.
In “On the Optimal Motion Vector Accuracy for Block-Based Motion-Compensated Video Coders,” Proc. IST/SPIE Digital Video Compression: Algorithms and Technologies, pp. 302-314, San Jose, February 1996 (the “Ribas work”), Jordi Ribas-Corbera and David L. Neuhoff, modeled the effect of motion accuracy on bit rate and proposed several methods to estimate the optimal accuracies that minimize bit rate. The Ribas work set forth a full-search approach for computing motion vectors for a given accuracy and considered only bilinear interpolation. The best motion vector was found by minimizing MSE and the best accuracy was selected using some formulas derived from a rate-distortion optimization. The motion vectors and accuracies were encoded with frame-adaptive entropy coders, which are complex to implement in real-time applications.
In “Proposal for a new core experiment on prediction enhancement at higher bitrates,” ISO/IEC JTC1/SC29/WG11 Coding of Moving Pictures and Audio, MPEG 97/1827, Sevilla, February 1997 and “Performance Evaluation of a Reduced Complexity Implementation for Quarter Pel Motion Compensation,” ISO/IEC JTC1/SC29/WG11 Coding of Moving Pictures and Audio, MPEG 97/3146, San Jose, January 1998, Ulrich Benzler proposed using ¼-pel accurate motion vectors for the video sequence and more advanced interpolation filters for the MPEG4 video coding standard. Benzler, however, used the Girod's fast-search technique to find the ¼-pel motion vectors. Benzler did consider different interpolation filters, but proposed a complex filter at the first stage and a simpler filter at the second stage and interpolated one macroblock at a time. This approach does not require much cache memory, but it is computationally expensive because of its complexity and because all motion vectors are computed with ¼-pel accuracy for all the possible modes in a macroblock (e.g., 16×16, four-8×8, sixteen-4×4, etc.) and then the best mode is determined. Benzler used the MAD criteria to find the best motion vector which was fixed to ¼-pel accuracy for the whole sequence, and hence he did not address how to select the best motion accuracy. Finally, Benzler encoded the motion vectors with a variable-length code (“VLC”) table that could be used for encoding ½ and ¼pixel accurate vectors.
The references discussed above do not estimate the motion vectors using optimized rate-distortion criteria and do not exploit the convexity properties of such criteria to reduce computational complexity. Further, these references do not use effective strategies to encode motion vectors and their accuracies.