Recent advances in digital technology have led to new communication media in which video information plays a significant role. Digital television, high definition TV (HDTV), video-conferencing, video-telephony, medical imaging, and multi-media are but a few examples of emerging video information applications.
When compared with text or audio media, video media require a much larger bandwidth, and therefore would benefit more from compressing data having redundancies. In the framework of video coding (encoding and decoding), statistical redundancies can be characterized as spatial or temporal. Due to differences in the spatial and temporal dimensions, the compressing of the data is usually handled separately.
Coding that reduce spatial correlations are referred to as intraframe coding, whereas interframe coding reduces temporal redundancies. Compared to static images where only spatial redundancies need to be considered, coding of a sequence of images over time requires a more efficient method.
In any case, the compressed bitstream that is produced by the encoding takes less memory to store, and less time to transport. A decoder can later be used to recover the original image sequence. Together encoders and complementary decoders are known as codecs.
As stated above, encoding is done by reducing temporal and spatial redundancies in the image sequence. A number of standards are known for video coding, e.g., MPEG-1, MPEG-2, MPEG-4, and H.263. However, these standards only define the syntax and semantic of the compressed bit stream. The methods used to produce the bitstream are not specified. In other words, the above standards specify how the bitstream should appear so that decoders will operate properly, but not on the details of how the bitstream is actually produced in the first place.
One frequently used aspect of video coding partitions the pixels of video images or "frames" into "blocks." The optical flow or "motion" of the pixels in the blocks is analyzed to estimate motion information. Compression is achieved, for example, by sending a block once, and then sending the motion information that indicate how the block "moves" in following frames.
The known standards, e.g., MPEG-1, MPEG-2, MPEG-4 and H.263, constrain the motion information to a half-pixel accuracy translation vector per macroblock or block of pixels. A macroblock is 16.times.16 pixels, a block is 8.times.8 pixels; however, the standards do not specify how to estimate the translation vector for the 16.times.16 macroblocks or 8.times.8 blocks.
Block matching is the classical method to estimate translation motion in video coding, please see Dufaux et al. "Motion estimation technique for digital TV: a review and a new contribution," Proc. of the IEEE, Vol. 83, No. 6, pp. 858-876, June 1995. There, a macroblock in the current image is matched with a macroblock in the previous reference image to minimize a disparity measure expressed as a prediction error signal.
More specifically, using the notation I(r, t) for an image I at pixel r and time t, W the measurement window, e.g., all the pixels in a macroblock, and S the search window, a translation vector d is obtained by: ##EQU1##
where the most widely used distance measures are the quadratic norm .parallel.x.parallel.=x.sup.2, and the absolute value .parallel.x.parallel.=.vertline.x.vertline.. The latter is usually preferred due to its lower computational complexity.
In full-search block matching, an exhaustive search of all discrete candidate displacements within a maximum displacement range is performed.
This method is guaranteed to reach the global minimum for the matching criterion at the cost of high computational complexity.
Indeed, the maximum displacement for normal video sequences is typically .+-.15 or .+-.31 pixels, hence requiring the evaluation of the matching criterion at (2*15+1).sup.2 =961 or (2*31+1).sup.2 =3969 positions. Furthermore, although the resulting motion vectors minimize the prediction error signal, they may not represent the true motion in the sequence of images.
Because it takes fewer bits to transmit a zero motion vector, the displacement (0, 0) is usually favored during the estimation process. More precisely, the disparity measure is reduced by a fixed number, e.g., 100 when using the absolute value as a norm when computing the disparity of a zero displacement.
The above method results in one-pixel accuracy motion vectors. However, by interpolating the reference image at half-pixel locations, the method can straightforwardly be extended to half-pixel accuracy motion vectors. In practice, one-pixel accuracy motion vectors are first estimated, the one-pixel motion vectors are then refined to half-pixel precision by searching the eight closest half-pixel locations.
The MPEG-2 Test Model, and the MPEG-4 Verification Model are based on the above full-search block matching technique with half-pixel refinement, respectively see ISO-IEC/JTC1/SC29/WG11, "MPEG-2 Test Model 4,"1993, and ISO-IEC/JTC1/SC29/WG11, "MPEG-4 Verification Model 9,"1998.
Fast search techniques have been proposed to reduce the computational complexity of the full-search technique, see Jain et al., "Displacement measurement and its application in interframe image coding," IEEE Trans. Commun., Vol. COM-29, pp. 1799-1808, December 1981, Koga et al., "Motion compensated interframe coding of video conferencing," Proc. Nat. Telecommun. Conf., New Orleans, La., December 1981, pp. G5.3.1-G5.3.5, Srinivasan et al., "Predictive coding based on efficient motion estimation," IEEE Trans. Commun., Vol. COM-33, pp. 888-896, August 1985, and Liu et al., "New fast algorithm for the estimation of block motion vectors," IEEE Trans. Circ. and Syst. for Video Tech., Vol. CSVT-3, No. 2, pp. 148-157, April 1993. However, using these techniques, convergence toward the global minimum is no longer guaranteed.
The above methods deal with images at a single resolution scale. To reduce computational complexity, and to take into account the multi-scale nature of the motion in a scene, hierarchical and multigrid block matching techniques for block-based motion estimation have also been proposed, see Bierling, "Displacement estimation by hierarchical block matching," SPIE Proc. Visual Commun. and Image Process.'88, Cambridge, Mass. November 1998, Vol. 1001, pp. 942-951.
While block matching motion estimation techniques are the most widely used in the field of video coding, other methods have been proposed for image sequence analysis. Notably, gradient techniques are widely used in computer vision, see Horn et al., "Determining optical flow," Artif. Intell., Vol. 17, pp. 185-203, 1981, and Lucas et al., "An iterative image registration technique with application to stereo vision," Proc. Image Understanding Workshop, pp. 121-130, 1981. Although these methods are efficient to estimate the motion in the scene, they do not always perform well in minimizing the prediction error signal.
Therefore, it is desired to provide a method for producing motion estimates that is computationally effective, has a high visual quality, while at the same time reduces prediction errors.