The present invention relates generally to digital video compression, and more particularly, to a hardware-efficient, high-performance motion estimation algorithm that has particular utility in H.261 digital video encoders.
Many different video compression algorithms have been developed for digitally encoding (xe2x80x9ccompressingxe2x80x9d) video data in order to minimize the bandwidth required to transmit the digitally-encoded video data (xe2x80x9cdigital video dataxe2x80x9d) for a given picture quality. Several multimedia specification committees have established and proposed standards for encoding/compressing audio and video data. The most widely known and accepted international standards have been proposed by the Moving Pictures Expert Group (MPEG), including the MPEG-1 and MPEG-2 standards. Officially, the MPEG-1 standard is specified in the ISO/IEC 11172-2 standard specification document, which is herein incorporated by reference, and the MPEG-2 standard is specified in the ISO/IEC 13818-2 standard specification document, which is also herein incorporated by reference. These MPEG standards for moving picture compression are used in a variety of current video playback products, including digital versatile (or video) disk (DVD) players, multimedia PCs having DVD playback capability, and satellite broadcast digital video.
Although the MPEG standards typically provide high picture quality, the data rate/bandwidth requirements are far too great for some applications. Videoconferencing is a particular application that typically does not require the coding resolution afforded by MPEG because the picture content does not normally vary a great deal from picture-to-picture, e.g., most of the motion is confined to a diamond-shaped region in the picture where the head and shoulders of the conferee are located. In short, because there is so little motion in a sequence of moving pictures in a videoconferencing application, there is a great deal of redundancy from picture-to-picture, and consequently, the degree of video data compression which is possible for a given picture quality is much greater. Moreover, the available bandwidth for many videoconferencing systems is less than 2 Mbits/second, which is far too low for MPEG transmissions.
Accordingly, a collaboration of telecommunications operators and manufacturers of videoconferencing equipment developed the H.320 videoconferencing standards for videoconferencing over circuit-switched media like ISDN (Integrated Services Digital Network) and switched-56 connections. H.261 is the video coding component of this standard. It is also known as the Pxc3x9764 standard since it describes video coding and decoding rates of pxc3x9764 kbits/second, where p is an integer from 1 to 30. Thus, the H.261 video coding algorithm compresses video data at data rates ranging from 64 kbits/second to 1,920 kbits/second. The H.320 standard was ratified in Geneva in December of 1990. This standard is herein incorporated by reference.
Like MPEG, the H.261 encoding algorithm uses a combination of DCT (Discrete Cosine Transform) coding and differential coding. However, only I-pictures and P-pictures are used. An I-picture is coded using only the information contained in that picture, and hence, is referred to as an xe2x80x9cIntra-codedxe2x80x9d or xe2x80x9cIntraxe2x80x9d picture. A P-picture is coded using motion compensated prediction (or xe2x80x9cmotion estimationxe2x80x9d) based upon information from a past reference (or xe2x80x9canchorxe2x80x9d) picture, and hence, is referred to as a xe2x80x9cPredictivexe2x80x9d or xe2x80x9cPredictedxe2x80x9d picture. In accordance with the H.261 standard, the compressed digital video data stream is arranged hierarchically in four layers: picture, group of blocks (GOB), macroblock (MB), and block. A picture is the top layer. Each picture is divided into groups of blocks (GOBs0. A GOB is either one-twelfth of a CIF (Common Intermediate Format) picture. Each GOB is divided into 33 macroblocks. Each macroblock consists of a 16xc3x9716 pixel array.
In short, just like MPEG, H.261 uses motion estimation to code those parts of sequential pictures that vary due to motion, where possible. More particularly, H.261 uses xe2x80x9cmotion vectorsxe2x80x9d (MVs) that specify the location of a xe2x80x9cmacroblockxe2x80x9d within the current picture relative to its original location within the anchor picture, based upon a comparison between the pixels of the current macroblock and corresponding array of pixels in the anchor picture within a given Nxc3x97Nxe2x88x92pixel search range. In accordance with the H.261 standard, the minimum search range is +/xe2x88x927 pixels, and the maximum search range is +/xe2x88x9215 pixels. It will be appreciated that using the maximum search range in all H.261 applications will not necessarily improve the quality of the compressed signal. In this regard, since H.261 applications can operate at various bit rates, ranging from 64 kbits/second to 1,084 kbits/second, the actual search range employed may vary. For example, at high bit rates, the temporal distance between adjacent pictures is smaller, and thus, a smaller search range can be used to achieve a given picture quality. At low bit rates, the situation is reversed, and a larger search range is required in order to achieve a given picture quality.
Once the motion vector for a particular macroblock has been determined, the pixel values of the closest-matching macroblock in the anchor picture identified by the motion vector are subtracted from the corresponding pixels of the current macroblock, and the resulting differential values are then transformed using a Discrete Cosine Transform (DCT) algorithm, the resulting coefficients of which are each quantized and Huffman-encoded (as is the motion vector and other information pertaining to and identifying that macroblock). If during the motion estimation process no adequate macroblock match is detected in the anchor picture (i.e., the differential value exceeds a predetermined threshold metric), or if the current picture is an I-picture, the macroblock is designated an xe2x80x9cIntraxe2x80x9d macroblock and the macroblock is coded accordingly.
The H.261 standard does not specify any particular implementation of the motion estimation algorithm employed. Otherwise stated, the H.261 leaves open the details of implementation of the motion estimation algorithm to the manufacturers of the videoconferencing systems. In general, various measures or metrics have been utilized and proposed to compute the location of the pixel array within the anchor picture that constitutes the closest match (i.e., minimum difference/error) relative to the current macroblock, and various motion estimation algorithms have been utilized and proposed to search for and locate the closest-matching macroblock in the anchor picture. These motion estimation (M.E.) algorithms are typically performed by software running on a processor, e.g., a TriMedia processor manufactured and sold by Philips Semiconductors that is tasked with the encoding of the video data in the videoconferencing system. The overarching goal is to locate the closest-matching macroblock in the anchor picture as quickly as possible, while minimizing the load on the processor to execute the algorithm, and maintaining an acceptable level of error/inaccuracy. The hardware/software that actually executes the motion estimation search algorithm is sometimes termed the xe2x80x9csearch enginexe2x80x9d. In terms of the search engine, the overarching goal is to optimize its performance while minimizing the resources required to execute the motion estimation algorithm. Simply stated, the basic goal is to minimize compute effort and compute time.
Among the best-known criteria or metrics for evaluating the quality of a match are the Sum of the Absolute Differences (SAD) and the Sum of the Squared Differences (SSD). The SAD metric constitutes the sum of the absolute values of the differences of each of the N pixels in the current macroblock (N=256 for the case of a 16xc3x9716 macroblock) and the respective ones of the corresponding pixels of the comparison macroblock in the anchor picture under evaluation. The SSD metric constitutes the sum of the squares of the above pixel differences. During a given motion estimation search sequence, the candidate macroblock in the anchor picture that yields the smallest SAD or SSD value (whichever criterion/metric is used) is selected as the xe2x80x9cbest matchxe2x80x9d. The horizontal and vertical position (i.e., x,y position) of this macroblock relative to the current macroblock (i.e., the x,y xe2x80x9coffsetxe2x80x9d), or a derivative thereof, is specified as the xe2x80x9cmotion vectorxe2x80x9d for the current macroblock. If the SAD or SSD value (whichever is used) is larger than a predetermined threshold value, it is determined that a suitable match does not exist for the current macroblock, and it is coded as an Intra macroblock. In general, the SAD metric is easier and faster to compute, but less accurate, than the SSD metric. Otherwise stated, the SSD metric calculations require greater processor exertions than do SAD metric calculations, and thus, can be considered to be more xe2x80x9cexpensivexe2x80x9d, from a xe2x80x9ccost functionxe2x80x9d standpoint.
In the H.261 domain, assuming a search range of +/xe2x88x9215 pixels, 961 candidate motion vectors must be evaluated, i.e., there are a total of 961 different macroblock-sized pixel arrays within the given search range of the anchor picture that are candidates for being the xe2x80x9cbest matchxe2x80x9d with the current macroblock being evaluated. Each motion vector evaluated will yield a different means square error (MSE) difference value. The motion vector having the minimum MSE value is the true xe2x80x9cbest matchxe2x80x9d. Since each motion vector evaluation requires a large number of subtractions and additions, it is completely impractical for the motion estimation search engine to compute the MSE value for each of the 961 different motion vectors within the given search range. This theoretical xe2x80x9cfull search algorithmxe2x80x9d always produces the true xe2x80x9cbest matchxe2x80x9d. However, because it is impractical from an implementation standpoint, it is only used as a reference or benchmark to enable comparison of different more practical motion estimation algorithms that evaluate only a subset of the full set of motion vectors within a given search range, a technique sometimes referred to as xe2x80x9csubsamplingxe2x80x9d. Motion estimation algorithms that use this subsampling technique are sometimes referred to as xe2x80x9cfast search algorithmsxe2x80x9d, because they can be executed far faster and with far fewer computations than a xe2x80x9cfull search algorithmxe2x80x9d.
Generally speaking, there exists an inherent trade-off between the speed of the motion estimation search, on the one hand, and the accuracy (and thus, the resultant picture quality of the encoded digital video data) of the motion estimation search, on the other hand. Moreover, the performance of the search engine is directly related to its ability to minimize the data set that it produces. In this regard, a motion estimation algorithm that reduces the MSE between the current macroblock and the selected xe2x80x9cbest matchxe2x80x9d macroblock in the reference picture by a factor of n will approximately improve performance by the reciprocal of n. Thus, the overarching goal is to devise a motion estimation algorithm (search strategy) that optimizes performance while minimizing the required compute effort and compute time. In this regard, motion estimation can be considered mathematically equivalent to an optimization problem to find a minimum of a cost function.
In order to facilitate a better understanding of the principles underlying the present invention, a review of the theoretical framework of motion estimation searching follows. In overview, the array of MSE differences (961 MSE differences in the H.261 domain) may be visualized as an xe2x80x9cerror surfacexe2x80x9d with a height proportional to error. Since most televideoconferencing scenes contain predominantly low spatial frequency data, the error surface is also normally low in spatial frequency and smoothly undulating, with one or only a few xe2x80x9cvalleysxe2x80x9d surrounding the minimum error. These characteristics of the error surface in the H.261 domain allow the use of severe subsampling in the search for a minimum error value for motion estimation. In particular, the error value associated with any single motion vector candidate evaluation provides information about an entire region of the search space, and the comparison of two error values associated with two motion vector candidate evaluations provides further information about the slope of the error surface between the two candidates. Pathological cases may arise, as in imaging a chess board. In such an image there will be a depression in the error surface in every case where the white squares match the white squares, and it would be challenging to locate the true minimum error value where the bishops and knights are also aligned. Thus, it can be appreciated that subsampling can lead to erroneous determinations, and that, all else being equal, the xe2x80x9csuccessxe2x80x9d (accuracy/resolution) of any search strategy is related to the number of samples evaluated (i.e., the density of sampling). Moreover, it can be appreciated that the success of a search strategy that relies upon subsampling is at least to some degree dependent upon the continuity of the error surface.
There are also vector correlations from one macroblock to spatially adjacent macroblocks and from a macroblock in one frame to the same macroblock in the following frame. For example, if a conferee""s elbow moves three pixels northeast in frame n, it can reasonably be inferred that the conferee""s hand in the adjacent macroblock will have the same sort of motion in frame n, and that both macroblocks will have a similar vector in frame n+1. These spatial and temporal correlations are imperfect, but too probable to be ignored. However, a search strategy that relies exclusively on these spatial and temporal correlations can only provide a fraction of the benefit available from motion compensation/motion coding.
Some previously proposed motion estimation algorithms have depended too heavily on the simplicity of the error surface, thereby greatly reducing the probability of correctly interpreting it. For example, one category of motion estimation algorithms that are particularly efficient are those that operate in a xe2x80x9cdimension sequentialxe2x80x9d manner. Although the motion estimation algorithms within this category vary in detail, the following description of an exemplary motion estimation algorithm that operates in a xe2x80x9cdimension sequentialxe2x80x9d manner should serve to illustrate the above-noted weakness with this category of motion estimation algorithms. More particularly, in accordance with this exemplary algorithm, a first series of evaluations are performed along the horizontal axis to locate the first minimum error point along the horizontal axis. A second series of evaluations are performed along a vertical line passing through the above-identified first minimum error point, and a second minimum error point along this vertical line is identified. Since the search is alternately performed in the horizontal and vertical dimensions of the search space, it can be thought of as a xe2x80x9cdimension sequentialxe2x80x9d process. The spacing between candidates is then reduced, and the dimension sequential process is repeated, locating a row of candidates at each minimum error point identified. Finally, if time permits, the immediate neighborhood of the last xe2x80x9cwinnerxe2x80x9d (i.e., the candidate having the minimum error value) may be evaluated. Although the total number of evaluations required with this dimension sequential approach is minimal, the probability of missing the true minimum error in the entire error surface is quite high if the error surface contains more than one xe2x80x9cvalleyxe2x80x9d.
Most motion estimation algorithms operate two dimensionally and in several levels of increasing resolution (i.e., decreasing scope/range of search), and thus, can be thought of as being multi-level or hierarchical. An exemplary hierarchical algorithm doubles the resolution in each of four hierarchical levels. At each level, eight vectors, spaced 45xc2x0 apart, are each evaluated, as is illustrated in FIG. 1. If the vector sizes are equally probable, that would represent an optimal solution. However, this strategy does not match the highly peaked distribution of vectors in a videoconferencing data set. In this regard, most of the benefit of motion estimation is derived from small vectors, but the above-mentioned exemplary hierarchical algorithm spends most of its time searching for large vectors. It is also entirely non-adaptive in its behavior, as it ignores the error surface data that suggest modifying the search sequence. The xe2x80x9cnorthwestxe2x80x9d candidate is evaluated first, and the xe2x80x9csouthwestxe2x80x9d candidate is evaluated eighth at every level, irrespective of the shape of the error surface determined along the way, i.e., the search sequence is not adapted based upon determinations made during the search sequence.
If the desire is to reduce the processing burden, which invariably means reducing the number of evaluations, the most obvious strategy would be to sample more coarsely. In this connection, the 56 Kbs codec manufactured and sold by Compression Labs, Inc. (CLI) under the brand name xe2x80x9cRembrandtxe2x80x9d in the late 1980""s employed a motion estimation algorithm that evaluated four vectors, spaced 90xc2x0 apart, at each level, rotating the pattern 45xc2x0 in alternate levels, as is illustrated in FIG. 2. Offsets were N, E, W, and S in even levels, and NE, SE, SW, and NW in odd levels. While this motion estimation algorithm worked well for the applications for which it was designed, in videconferencing applications, this search scheme not uncommonly misses the optimal vector due to the scarcity of the sampling.
Based upon the above and foregoing, there present exists a need in the art for a motion estimation algorithm that overcomes the drawbacks and shortcomings of the presently available technology. The present invention fulfills this need in the art with a simple and highly versatile implementation that reduces the required compute time and compute effort.
The present invention encompasses a method for identifying an optimum motion vector for a current block of pixels in a current picture in a process for performing motion estimation. The method is implemented by evaluating a plurality of motion vector candidates for the current block of pixels by, for each motion vector candidate, and calculating an error value that is representative of the differences between the values of the pixels of the current block of pixels and the values of a corresponding number of pixels in a reference block of pixels. While evaluating each motion vector candidate, the error value is checked, preferably at several points, while calculating the error value, and the evaluation is aborted for that motion vector candidate upon determining that the error value for that motion vector candidate exceeds a prescribed threshold value. The motion vector candidate that has the lowest calculated error value is selected as the optimum motion vector for the current block of pixels.
The motion vector candidates are preferably evaluated in two distinct phases, including a first phase that includes evaluating a subset of the motion vector candidates that have an intrinsically high probability of being the optimum motion vector candidate, and a second phase that includes performing a spatial search within a prescribed search region of a reference picture in order to identify a different reference block of pixels within the prescribed search region for each respective motion vector candidate evaluation.
The subset of motion vector candidates preferably includes a first motion vector candidate that corresponds to a location of the reference block of pixels in a reference picture that is the same as the location of the current block of pixels in the current picture, a second motion vector candidate that corresponds to a location of the reference block of pixels in a previous picture that is the same as the location of the current block of pixels in the current picture, and a third motion vector candidate that constitutes an optimum motion vector previously determined for a preceding block of pixels in the current picture. Preferably, no further evaluations are performed if it is determined that the error value for the first motion vector candidate is below a prescribed motion estimation termination threshold value.
The spatial search is preferably performed in a plurality of different search levels. In a presently preferred embodiment, the spatial search at each search level is reentrant, and is preferably performed, at each search level, by re-centering the spatial search on the best motion vector candidate identified to that point in the spatial search, the best motion vector candidate being the one that has an error value lower than the lowest error value found to that point in the spatial search. Preferably, the search pattern of the spatial search is adaptively varied in a heuristic manner based upon the results of the evaluations made during the spatial search.
The method is executed by software that operates a software-implemented state machine. The software preferably includes source code that defines a search sequence, and a function that builds the state machine in a prescribed memory space. The source code is preferably only executed once, at initialization. The search sequence is preferably an adaptive heuristic search sequence. The source code, at any point in the search sequence, identifies the appropriate x,y positions for the next motion vector candidate to be evaluated by making a single read from the prescribed memory space. Further, the source code, at any point in the search sequence, identifies the next motion vector candidate to be evaluated by reading one of two locations in the memory space, depending upon the polarity of a test bit that indicates either that a new best motion vector candidate has been identified or the results of a comparison between the two most recent evaluations.