In general video transmission involves sending over wire, by radio signal, or otherwise very rapid successive frames of images. In the modem world, video transmission increasingly involves transmission of digital video. Each frame of a video stream is a separate image that comprises a substantial amount of data taken alone. Taken collectively, a stream of digital images making up a video represents an enormous amount of data that would tax the capacities of even the most modem transmission system. Accordingly, much effort has been devoted to compressing digital video streams by, inter alia, removing redundancies from images.
Although there are other compression techniques that can be and are used to reduce the sizes of the digital images making up a video stream, the technique of motion estimation has evolved into perhaps the most useful technique for reducing digital video streams to manageable proportions.
The basic idea of motion estimation is to look for portions of a “current” frame (during the process of coding a stream of digital video frames for transmission and the like) that are the same or nearly the same as portions of previous frames, albeit in different positions on the frame because the subject of the frame has moved. If such a block of basically redundant pixels is found in a preceding frame, the system need only transmit a code that tells the reconstruction end of the system where to find the needed pixels in a previously received frame.
Thus motion estimation is the task of finding predictive blocks of image samples (pixels) within references images (reference frames, or just references) that best match a similar-sized block of samples (pixels) in the current image (frame). It is a key component of video coding technologies, and is one of the most computationally complex processes within a video encoding system. This is especially true for an ITU-T H.264/ISO MPEG-4 AVC based encoder, considering that motion estimation may need to be performed using multiple references or block sizes. It is therefore highly desirable to consider fast motion estimation strategies so as to reduce encoding complexity while simultaneously having minimal impact on compression efficiency and quality.
Predictive motion estimation algorithms, disclosed in, for example, H. Y. Cheong, A. M. Tourapis, and P. Topiwala, “Fast Motion Estimation within the N T codec, “ISO/IEC JTCHSC29/WG11 and ITU-T Q6/SG16, document JVT-E023, October '02; H. Y. Cheong, A. M. Tourapis, “Fast motion estimation within the H.264 codec,” Proc. of the Intern. Conf. on Mult. and Expo (ICME '03), Vol. 3, pp. 517-520, July '03; and A. M. Tourapis, 0. C. Au, and M. L. Liou, “Highly efficient predictive zonal algorithms for fast block-matching motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12, Iss. 10, pp. 934-47, October '02, have become quite popular in several video coding implementations and standards, such as MPEG-2, MPEG-4 ASP, H.263, and others due to their very low coding complexity and high efficiency compared to the brute force Full Search (FS) algorithm. The efficiency of these algorithms comes mainly from initially considering several highly likely predictors and from introducing very reliable early-stopping criteria.
In addition, simple yet quite efficient checking patterns have been employed to further optimize and improve the accuracy of the estimation. For example, the Predictive Motion Vector Field Adaptive Search Technique (PMVFAST), Tourapis, Au, and Liou, cited above, initially examined a six-predictor set including the three spatially adjacent motion vectors used also within the motion vector prediction, the median predictor, (0, 0), and the motion vector of the co-located block in the previous frame. It also employed adaptively calculated early stopping criteria that were based on correlations between adjacent blocks. If the minimum distortion after examining this set of predictors was lower than this threshold then the search was immediately terminated. Otherwise, an adaptive two stage diamond pattern centered on the best predictor was used to refine the search further. Due to its high efficiency (on average more than 200 times faster than FS in terms of checking points examined using search area±16) the algorithm was also accepted within the MPEG-4 Optimization Model, “Optimization Model Version 1.0”, ISO/IEC JTC1/SC29/WG 11 MPEG2000/N3324, Noordwijkerhout, Netherlands, March 2000, as a recommendation for motion estimation. The Advanced Predictive Diamond Zonal Search (APDZS) (Tourapis, Au, and Liou, cited above), used the same predictors and concepts on adaptive thresholding as PMVFAST, but employed a multiple stage diamond pattern mainly to avoid local distortion minima thus achieving better visual quality while having insignificant cost in terms of speed up compared to PMVFAST.
In Cheong, Tourapis, and Topiwala, cited above, the authors introduced the Enhanced Predictive Zonal Search (EPZS) algorithm which employed a simpler, single stage pattern (diamond or square). EPZS achieved better performance both in terms of encoding complexity and quality than the above mentioned algorithms, mainly due to the consideration of additional predictors and better thresholding criteria. A 3-Dimensional version of EPZS was also introduced with the main focus on multi-reference fast motion estimation such as is the case of the H.264/MPEG4 AVC standard. Considering the low complexity and high efficiency of these algorithms, it would be highly desirable to implement any such implementation within the H.264/MPEG4 AVC standard and adapt it to that standard.
The H.264/MPEG4 AVC standard, apart from the multiple reference consideration discussed above, has some additional distinctions compared to previous standards that considerably affect the performance and complexity of motion estimation. In particular, unlike standards MPEG-4 and H.263/H.263++ that only consider block types of 16×16 and 8×8, H.264 considers five additional block types, including block types of 16×8, 8×16, 8×4, 4×8, and 4×4. These must be considered within a fast motion estimation implementation in an effort to achieve best performance within an H.264 type encoder. Furthermore, considering that the current H.264 reference software (JM) implementation, JVT reference software version JM9.6, http://iphome.hhi.de/suehring/tml/download/, employs a Rate Distortion Optimization (RDO) method for both motion estimation and mode decision, it is imperative that this is also taken in account.
In particular, within the current JM software the best predictor is found by minimizing:J(m,λMOTION)=SAD(s,c(m))+λMotion*R(m−p)  (1)
with m=(mx, my)T being the motion vector, p=(px, py)T p being the prediction for the motion vector, and λMOTION being the Lagrange multiplier. The rate term R(m−p) represents the motion information only and is computed by a table-lookup. The SAD (Sum of Absolute Differences) is computed as:
                                          SAD            ⁡                          (                              s                ,                                  c                  ⁡                                      (                    m                    )                                                              )                                =                                    ∑                                                x                  =                  1                                ,                                  y                  -                  1                                                            B                ,                B                                      ⁢                                                                          s                  ⁡                                      [                                          x                      ,                      y                                        ]                                                  -                                  c                  ⁡                                      [                                                                  x                        -                                                  m                          x                                                                    ,                                              y                        -                                                  m                          y                                                                                      ]                                                                                                    ,                            (        2        )            
B=16, 8 or 4
with s being the original video signal and c being the coded video signal. A good motion estimation scheme needs to consider, if feasible, both Equation 1 and the value of λMOTION in an effort to achieve best performance according to RD optimized encoding designs.