Motion estimation is an essential part of modern digital video encoding systems. Such standards as ISO MPEG-1 (formally known as ISO/IEC 11172), ISO MPEG-2 (formally known as ISO/IEC 13818), ITU-T H.261, ITU-T H.263, ISO MPEG-4 and many other video coding technologies utilize motion compensation as a mean for reducing data redundancy by transmitting components of motion vector (MV) for a set of rectangular pixel blocks (macroblocks in MPEG-1 and MPEG-2 terminology).
The process of searching the best suitable motion vectors for all macroblocks is called motion estimation (ME). It is generally known to be the most computationally complex element of video encoding systems. It is often the most expensive and power-consuming part of digital video encoding systems and the most processor intensive part of software video encoding and video conferencing applications. Similar procedure to ME is used in some other applications that involve searching for data correlation in images or 2-dimensional arrays, for example: image recognition, robotic vision, still image data compression and so on.
When the motion estimation procedure is completed, the digital video encoder stores the motion vector components for every pixel block of the video frame in the output data stream so that the video decoder will be able to apply motion compensation when reconstructing the video frame sequence.
The ME algorithm of is not a normative part of video encoding standard specifications, so the developer has some freedom in choosing a ME method. The purpose is to find a particular macroblock on the current video frame that is most similar to the macroblock on the previous video frame. As is known, a macroblock which is often used to refer to the pel data, contains the luminance data a chroma format. For example, a macroblock may comprise the four 8 by 8 blocks of luminance data and the two (for 4:2:0 chroma format), four (for 4:2:2 chroma format) or eight (for 4:4:4 chroma format) corresponding 8 by 8 blocks of chrominance data coming from a 16 by 16 section of the luminance component of the picture.
Often the search area is limited by some fixed maximum value of motion vector components. In many video encoding methods (for example MPEG-1 and MPEG-2), from a set of equally good macroblocks and motion vectors, which represent said macroblock's position, the one with the minimum length motion vector is preferable because it requires fewer bits to encode the motion vector components. A motion vector search requires a criterion specifying the best outcome. Usually mean squared error (MSE) or mean absolute distortion (MAD) is used as a block matching criterion. Some other criteria for block matching have been proposed. For the subject of the present invention the choice of block matching criterion is not significant.
Full or exhaustive spiral search (calculating matching criteria for every mackroblock of the current frame carried out in a spiral order within a limited search area on the previous frame) is usually assumed to give the best possible quality of ME. Unfortunately, full spiral search algorithms require a large number of calculations and are not suitable for many applications. For example, full spiral search ME for a picture size of 360.times.240 pixels within search area of +/-64 pixels and with a macroblock size of 16.times.16 pixels requires on the order of 10.sup.10 integer arithmetic operations.
Many different approaches have been proposed to reduce the computational cost of ME. The proposed "fast ME" algorithms can be categorized into several groups, described below. All of them are based on assumptions about a particular type of continuity in the dependency of matching criterion from motion vector components, or some temporal or spatial correlation between the best possible motion vector values. Such assumptions do not always hold, as we will describe later. Such assumptions underly all known fast ME methods and limit their usefulness.
Fast ME with Unimodal Error Surface Assumption
Many kinds of fast ME algorithms restrict the number of tested MV by assuming that matching error increases monotonically when the point of the ME search moves away from the best possible ME position. However, this assumption does not usually hold for real input video signals. A real input video signal always contains some level of noise in its pixel values and often contains an unpredictable amount of small and large visible objects in the frame. As a result the search is often trapped to a local minimum position far from the global minimum of a targeting matching criterion.
One example of such an algorithm is the two-dimensional logarithmic search that described by T.Koga, K.Iinuma, A.Hirano etc. in "Motion compensated interframe coding for video conferencing", Proc.Nat.Telecommunication Conf., pp. G5.3.1-5.3.5, Nov, 29-Dec. 3 1981. Another example of a method that is related to unimodal error surface assumption described in U.S. Pat. No. 5,623,313 to Naveen, entitled "Fractional pixel motion estimation of video signal".
Fast ME with Pixel Subsampling
One class of fast ME algorithms, sometimes referred to as Fast ME with Pixel Subsampling, is based on limiting the number of pixels used in the calculation of matching criterion. It is based on the assumption that if the error in some subset of pixels in the macroblock is minimized, the targeting matching criterion is minimized as well. This assumption usually does not hold for real input video signals. In addition, this algorithm provides for a relatively small reduction in the number of required calculations, usually from 4 to 16 times, and further reduction is desired.
Hierarchical and Multiresolution Methods of Fast ME
Another group of fast ME methods is based on doing a preliminary search of motion vectors on a coarse-resolution frame and refining the predicted motion vector in the frame with fine resolution. This is known as a hierarchical method and uses several levels of search with the same image size, but different block sizes at every level, as proposed in M.Bierling, "Displacement estimation by hierarchical block matching", Proc. SPIE Visual Communications and Image Processing 1988, vol. 1001, pp 942-951. This approach is based on the assumption that the motion vector obtained for the large block size is a good starting point for the search for smaller block sizes. However, this assumption is not always true and may not give a good initial motion vector.
A multiresolution method (also known as a pyramidal method) uses several coarse levels of image created from original image by reducing image resolution. In this approach, the search is performed first on the coarsest-resolution image for the proportionally reduced size of macroblock. Then, the best resulting motion vector is interpolated to obtain an initial value of motion vector for the next level. The motion vector is then refined, by searching in a small area on the next level.
A variation of this method was described in K.M.Uz, M.Verittiand D.LeGall, "Interpolative multiresolution coding of advanced television with comparable subchannels", IEEE Trans. Circuits Syst. Video Technol., vol.1, pp.86-99, March 1991., and in S.Zafar, Y.Q.Zhang, B.Jabbari,"Multiscale video representation using multiresolution motion compensation and wavelet decomposition", IEEE J.Select.Areas Commun., vol.11, pp.24-35, January 1993. Another example of hierarchical ME is described in U.S. Pat. No. 5,635,994 to Drexler et al, entitled "Method of making a hierarchical estimate of image motion in a television signal." This class of ME methods is based on the assumption that the best motion vector for a given coarse resolution image level and proportionally reduced size of macroblock is a good estimation for the next finer level. This may work for smooth, relatively monotonous images, but not for images with fine details that are invisible at coarse resolutions. This approach is also less effective for input with significant levels of noise in the signal.
Fast ME Based on Assumption of Spatial or Temporal Correlation of MV
Another class of fast ME methods, known as Fast ME based on assumption of spatial or temporal correlation of MV is based on the assumption of a correlation between values of optimum motion vectors for spatially neighboring macroblocks on the same frame, or temporally neighboring macroblocks for the same location on the sequential frames. One widely known example of this method is a telescopic search, which assumes that the motion vector for a particular macroblock can be a good initial approximation for the motion vector search on sequential frames.
Another algorithm based on assumption of spatial or temporal correlation, was described by S.Zafar, Y.-Q.Zhang, J.S.Baras, "Predictive block-matching motion estimation for TV coding", IEEE Trans.Broadcast., vol. 37, pp. 97-105, September 1991. A hybrid approach using spatial/temporal correlation and multiresolution, was described by J. Chalidabhongse. C.-C.Kuo, in "Fast motion vector estimation using multiresolution-spatio-temporal correlations", IEEE Trans. On Circ. and Syst. for video technology, vol.7, No3, June 1997.
The assumption of spatial or temporal correlation between values of optimum motion vector for spatially neighboring macroblocks may not hold for real input video signals. Although it may be valid for estimating the motion of large objects without noise in the video signal, it will often not work well on video containing many small moving objects and high level of noise.
Fast MA Based on Reducing Data set for Block Matching
Another class of fast ME methods, referred to as Fast ME based on reducing data set for block matching replaces initial video frame data with a reduced data set before performing ME. An example of this method is described in Xiaobing Lee, Ya-Qin Zhang, "A fast hierarchical motion-compensation scheme for video coding using block feature matching", IEEE Trans. On Circ. and Syst. for video technology, vol.6, No. 6, December 1996. This approach proposes replacing the initial image data by a sign truncated feature (STF) representation consisting of a mean values vector and sign-truncated vector, which represent every pixel by one bit. The value of this bit depends on the difference in sign between actual pixel data value and mean value. The ME search is then implemented on the STF data set.
Another example of this method was described in U.S. Pat. No. 5,612,746 to Slavin, entitled "Block matching for picture motion estimation using offset quantized pixels." This patent proposed replacing initial image data by quantized pixel values and measuring macroblock correlation by comparing the quantized pixel values between macroblocks. This essentially introduces new block matching criteria, not new ME methods. Although these new matching criteria can be calculated faster, there is no reason to expect better resultant values than those calculated using standard mean square error (MSE) or mean absolute distortion (MAD) criteria.
Selecting Macroblock by Means of a Set of Characteristics
A method based on calculating a set of characteristics for every pixel block, was described by D. R. McGregor, R. J. Fryer, P. Cockshot, P Murray,"Faster Fractal Compression", DDJ, pp.34-40, January 1996. In that paper the method was applied not to the motion estimation, but to the problem of fast still-image fractal compression. A set of gross characteristics was calculated for every pixel block in the image. The results of this gross characteristic calculation are then sorted in the form of a tree structure. When looking for the best match for a particular block of pixels, the first K number of best similar blocks (using the selected set of gross characteristics) are extracted from the tree structure and compared with the current block. This method is based on the assumption that the best matching block must be among the first K blocks extracted from said tree structure. This assumption may or may not be true. In this regard, the organizing data for all blocks from the frame in a tree structure is not suitable for the task of ME in video sequences because it does not easily allow the restriction of a search by a maximum of motion vector component values. It also does not provide means for selecting blocks with shorter motion vectors. Finally, sorting all of the data into said tree structure may itself take significant computation time.
As discussed above, the known fast methods and algorithms for motion estimation are generally based on one or more assumptions about expected continuity in the image frames of input video data: either the continuity in dependency of matching criterion from MV or some temporal or spatial correlation between the best possible MV values. Even in simple cases, when such assumptions are true for most of a given image area, there is some percentage of pixel areas in every frame in which those assumptions are not valid. For this reason, methods of fast ME that are based on these types of assumptions will fail to find good matching block for macroblocks in these areas. This results in dramatically reduced image quality in such image areas or will require higher data rates for transmitting said image areas with acceptable quality.