Motion Estimation (ME) and Compensation is an important technique to exploit the temporal correlations among successive frames in a video sequence. Almost all current video compression standards such as MPEG-1/2/4 and H.26x employ a hybrid of block-based motion compensated prediction and transform coding for representing variations in picture content due to moving objects. In block-based motion estimation, a current frame is divided into rectangular blocks and an attempt is made to match each current block with a block from a reference frame, which would serve as the predictor of the current block. The difference between this predictor block and the current block is then encoded and transmitted. The (x,y) offset of the current block from the predictor block is characterized as a motion vector. A significant improvement in compression efficiency is achieved since usually the ‘difference block’ has a much lower energy or information content than the original block.
The improvement in compression efficiency, however, comes at a significant increase in complexity, since the process of matching a current block with a predictor block almost always involves a search algorithm. The current block is searched for the best possible match in the reference frame within a search window located around the position of the block in the current frame. For each search location, some metric—typically the Sum of Absolute Differences (SAD), or the Sum of Squared Difference (SSD) between the pixels of the two blocks—is calculated. The block that produces the smallest value in the metric is then selected as the predictor block. A full search strategy typically involves testing all the available blocks in the search range leading to a high computational complexity. The complexity of the search algorithm thus depends on the size of the search area (amongst other things).
The algorithms aimed at simplifying the number of calculations for motion estimation can be classified as being pel-recursive, block-based or object based. The pel-recursive methods lead to a significant number of operations per frame, as calculations have to be done on every pixel. The object-based methods involve separate operations for object-recognition leading to computational complexity. It has been observed that the computational complexity could be reduced if efficient block-based search techniques could be designed.
Many attempts aimed at reducing the complexity of ME have focused on Fast Motion Estimation (FME) algorithms, which focus on ways to reduce the number of search candidates required to find a ‘good match’ while leading to a minimum degradation in the predicted video quality as compared to the exhaustive search. Several block-based motion estimation algorithms that are computationally faster than the full search have been investigated and developed. The three-step search (TSS), new three-step search (NTSS), four step-search (4SS), block-based gradient descent search (BBGDS), diamond search (DS), hexagon-based search (HEXBS), Unsymmetrical-cross Multi-Hexagon-grid Search (UMHexagonS), Predictive Motion Vector Field Adaptive Search Technique (PMVFAST) and Enhanced Predictive Zonal algorithm (EPZS) are a few such FME algorithms. In addition, various FME methods are also disclosed in U.S. Pat. Nos. 6,668,020, 6,542,547, 6,414,997, 6,363,117, 6,269,174, 6,259,737, 6,128,047, 5,778,190, 5,706,059, and 5,557,341. In general, these methods are carried out in the spatial domain and depend on the shape and size of the search pattern and on the efficient choice of the search center to increase the speed of the motion vector search. However, the disadvantage is that these techniques may fall into a local distortion minimum and not identify the best predictor block. Also, the reduction in the number of search points depends on the shape of the search pattern.
While FME algorithms can significantly reduce the complexity of the ME process, they nonetheless suffer from the fact that—like the full search algorithm—their complexity is proportional to the size of the search area. This is a major concern for real-time encoders as high resolution video—which is becoming ever more prevalent—requires larger search areas (typically +/−64 pixels around the center of the search area for D1 and higher resolution video).
A common characteristic of all of the algorithms mentioned above (with the exception of the Full-Search algorithm) is that they are less amenable to parallel processing architectures. In most of these algorithms, the choice of motion vector candidates to be evaluated depends on the results of the previous iteration. In the case of more advanced techniques such as the UMHS, PMVFAST and EPZS algorithms, the situation is exacerbated because the initial set of predictors and the criteria for early termination of the searches depend on the encoding results of the preceding, neighboring macroblocks. Consequently, macroblocks have to be processed sequentially. The recent emergence of chips with multiple Digital Signal Processor and/or General Processor (GP) cores, as well as the availability of powerful Field Programmable Gate Arrays (FPGAs) promise to enable real-time, high-resolution H.264 encoding at a low cost, but only if the underlying algorithms are amenable to high degrees of parallel processing. There is therefore a need for an alternative mechanism that can perform motion estimation at much lower complexity, and take full advantage of parallel processing-based hardware architectures, but without sacrificing compression efficiency.