Motion Estimation (ME) is an important part of any video encoding system since it can significantly affect the output quality of an encoded sequence. Unfortunately, this feature requires a significant part of the encoding time especially when using the straightforward Full Search (FS) Algorithm. For this reason, various fast motion estimation algorithms have been proposed which manage to reduce computational complexity considerably with little, if any, loss in coding efficiency compared to the FS Algorithm. A rather popular set of such fast motion estimation algorithms are the predictive algorithms, which initially consider a set of adaptive predictors and thresholds, select the best one (or more) from this set, and refine the selection using predefined search patterns. Such algorithms include the Enhanced Predictive Zonal Search (EPZS), the Predictive Motion Vector Field Adaptive Search Technique (PMVFAST), the Adaptive Predictive Diamond Zonal Search (APDZS), and so forth. Nevertheless, although complexity is reduced considerably using these algorithms, for certain architectures or implementations this may not be sufficient and further reduction in complexity may be desirable.
Block Matching Motion Estimation is an essential part of several video-coding standards such as, for example, the International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) Moving Picture Experts Group-2 standard (hereinafter the “MPEG-2 standard”), the International Telecommunication Union, Telecommunication Sector (ITU-T) H.263 recommendation (hereinafter the “H.263 recommendation”), and the ISO/IEC Moving Picture Experts Group-4 (MPEG-4) Part 10 Advanced Video Coding (AVC) standard/ITU-T H.264 recommendation (hereinafter the “MPEG-4 AVC standard”). By using motion estimation (ME) and motion compensation (MC), we are able to exploit the temporal correlation and reduce the redundancy that exists between frames of video sequences, which leads to high compression efficiency.
In block matching motion estimation, an image is partitioned into indexed regions, in particular square or orthogonal blocks of pixels, and the best match for these blocks is found inside a reference frame. To locate this best match we essentially perform a search inside a previously coded frame and select a best matching block therein using a criterion (whether it is a predetermined criterion or otherwise). The best match is then used to predict the current block, whereas the displacement between the two blocks defines a motion vector (MV), which is associated with the current block. It is only necessary in the encoder to send the motion vector and a residue block, defined as the difference between the current block and the predictor, in order to recover the original block. This can require significantly fewer bits than the direct coding of the original.
The most common distortion measure used is typically the mean absolute error (MAE) or mean absolute difference (MAD), or the equivalent sum of absolute difference (SAD), which requires no multiplication and gives similar performance as the mean square error (MSE). The MAD or SAD of a block A of size M×M located at (x,y) inside the current frame compared to a block B located at a displacement of (vx,vy) relative to A in a previous frame is defined as follows:
                                          M            ⁢                                                  ⁢            A            ⁢                                                  ⁢                          D              ⁡                              (                                                      v                    x                                    ,                                      v                    y                                                  )                                              =                                    1                              M                2                                      ⁢                                          ∑                                  m                  ,                                      n                    =                    0                                                                    N                  -                  1                                            ⁢                                                                                                          I                      t                                        ⁡                                          (                                                                        x                          +                          m                                                ,                                                  y                          +                          n                                                                    )                                                        -                                                            I                                              t                        -                        i                                                              ⁡                                          (                                                                        x                          +                                                      v                            x                                                    +                          m                                                ,                                                  y                          +                                                      v                            y                                                    +                          n                                                                    )                                                                                                                          ,                            (        1        )                                                      S            ⁢                                                  ⁢            A            ⁢                                                  ⁢                          D              ⁡                              (                                                      v                    x                                    ,                                      v                    y                                                  )                                              =                                                    M                2                            ·              M                        ⁢                                                  ⁢            A            ⁢                                                  ⁢                          D              ⁡                              (                                                      v                    x                                    ,                                      v                    y                                                  )                                                    ,                            (        2        )            where It is the current frame and It-i is a previously coded frame.
If a maximum displacement of W pixels in a frame is allowed, we will have (2W+1)2 locations to search for the best match of the current block. Unfortunately the Full Search (FS) Algorithm, which essentially examines all possible locations to find the block with the minimum distortion, is too computational intensive and cannot be used by various architectures. For example, for a frame of size P×Q and a frame rate of T fps, the amount of computation in terms of operations is as follows:
                                                        T              ·                              (                                                      P                    M                                    ·                                      Q                    M                                                  )                            ·                                                (                                                            2                      ⁢                      W                                        +                    1                                    )                                2                                      ⁢                          (                                                2                  ⁢                                      M                    2                                                  -                1                            )                                ≅                      8            ⁢                          TPQW              2                                ≅                      1.09            ×                          10              10                                      ,                            (        3        )            for a possible combination of T=30, P=288, Q=360, and W=21, and only if a single reference and block sizes of size 16×16 are considered.
Unfortunately, these numbers become considerably more significant when references for motion estimation and motion compensation.
Predictive motion estimation algorithms have become quite popular in several video coding implementations and standards, such as the H.263 recommendation and the MPEG-2 and MPEG-4 AVC standards, due to their very low encoding complexity and high efficiency compared to the brute force Full Search (FS) Algorithm.
The efficiency in using such predictive motion estimation algorithms comes mainly from initially considering several highly likely predictors and the introduction of early termination criteria. These schemes utilize simple yet efficient checking patterns to further optimize and improve the accuracy of the estimation. For example, the Predictive Motion Vector Field Adaptive Search Technique (PMVFAST) initially examines a 6 predictor set that includes 3 spatially adjacent motion vectors (MVs), the median predictor, (0,0), and the motion vector of the co-located block in the previous frame. The PMVFAST also employs early stopping criteria, which is adaptively calculated and based on correlations between adjacent blocks. These criteria enable the termination of the motion estimation (ME) immediately after these predictors are examined, and if these criteria are satisfied. Otherwise an adaptive two-stage diamond pattern centered on the best predictor is used to refine the search further. This process essentially allows a considerable reduction in the complexity of the ME. Turning to FIGS. 1A-1B, small diamond patterns used in the Enhanced Predictive Zonal Search (EPZS) are indicated generally by the reference numerals 100 and 120, respectively. The numbers 1 and 2 disposed within some of the patterns indicate a first search and a second search, respectively, where the first search finds the minimum cost which is used as a center for the second search.
Diamond patterns were also employed by other similar algorithms, with either fixed step sizes, or in increasing step sizes. In one prior art approach, it is illustrated that the square pattern can be more reliable than the diamond pattern since it can more effectively avoid local minima while, unlike the diamond pattern, can also consider diagonal motion. The introduction of more predictors, leading to the Enhanced Predictive Zonal Search (EPZS), could in the end, yield better quality than PMVFAST, regardless of the pattern (square or diamond) used. Turning to FIGS. 2A-2C, square patterns used in the Enhanced Predictive Zonal Search (EPZS) are indicated generally by the reference numerals 200, 220, and 240, respectively. The numbers 1 and 2 disposed within some of the patterns indicate a first search and a second search, respectively, where the first search finds the minimum cost which is used as a center for the second search.
The EPZS algorithm was further enhanced to further improve performance under a Rate Distortion Optimization (RDO) framework, and to better consider multiple block sizes and references, as used within the MPEG-4 AVC standard. In particular, using H.264 software, Joint Video Team (JVT) Reference Software version JM8.4 (hereinafter referred to as the “JVT Reference Software”), the best motion vector for a given block size is found by minimizing:J(m,λMOTION)=SAD(s,c(m))+λMOTION·R(m−p)  (4)where m=(mx,my)T is the current motion vector being considered, p=(px,py)T is the motion vector used as the prediction during the motion vector coding process, and λMOTION is a Lagrangian multiplier. The rate term R(m−p) represents the motion rate information only and is computed by a table-lookup. The SAD (Sum of Absolute Differences) is computed as follows:
                              S          ⁢                                          ⁢          A          ⁢                                          ⁢                      D            ⁡                          (                              s                ,                                  c                  ⁡                                      (                    m                    )                                                              )                                      =                              ∑                                          x                =                1                            ,                              y                =                1                                                                    B                1                            ,                              B                2                                              ⁢                                                                                  s                  ⁡                                      [                                          x                      ,                      y                                        ]                                                  -                                  c                  [                                                            x                      -                                              m                        x                                                              ,                                          y                      -                                              m                        y                                                                                                                          ,                                                          (        5        )            with s and c being the original and the coded video signals, and B1 and B2 being the vertical and horizontal dimensions of the examined block type and can be equal to 16, 8, or 4. If the search scheme is not good enough, and due to the Lagrangian consideration, the search could easily be trapped at a local minimum, therefore reducing efficiency.
Therefore, for EPZS, improved prediction is achieved through the consideration of more elaborate and reliable search patterns, adaptive dual refinement of the prediction, but most importantly through the consideration of a larger adaptive set of initial predictors, which combined allows for better avoidance of local minima. Such predictors are dependent on distortion, window, block size, reference, and so forth, and could still lead to significant speed improvement without any sacrifice in quality of performance. Unfortunately, this also implies a rather significant overhead increase in terms of checked points compared to previous implementations. Turning to FIGS. 3A-3D, extended patterns used in the extended Enhanced Predictive Zonal Search (extEPZS) are indicated generally by the reference numerals 300 320, 340, and 360, respectively. The numbers 1 and 2 disposed within some of the patterns indicate a first search and a second search, respectively, where the first search finds the minimum cost which is used as a center for the second search.
Turning to FIG. 4, a video encoder without pre-processing elements is indicated generally by the reference numeral 400. The video encoder 400 includes a combiner 410 having an output connected in signal communication with an input of a transformer 415. An output of the transformer 415 is connected in signal communication with an input of a quantizer 420. An output of the quantizer 420 is connected in signal communication with a first input of a variable length coder (VLC) 460 and an input of an inverse quantizer 425. An output of the inverse quantizer 425 is connected in signal communication with an input of an inverse transformer 430. An output of the inverse transformer 430 is connected in signal communication with a first non-inverting input of a combiner 435. An output of the combiner 435 is connected in signal communication with an input of a loop filer 440. An output of the loop filter 440 is connected in signal communication with an input of a frame buffer 445. A first output of the frame buffer 445 is connected in signal communication with a first input of a motion compensator 455. A second output of the frame buffer 445 is connected in signal communication with a first input of a motion estimator 450. A first output of the motion estimator 450 is connected in signal communication with a second input of the variable length coder (VLC) 460. A second output of the motion estimator 450 is connected in signal communication with a second input of the motion compensator 455. A second output of the motion compensator 455 is connected in signal communication with a second non-inverting input of the combiner 435 and with an inverting input of the combiner 410. A non-inverting input of the combiner 410 and a second input of the motion estimator 450 are available as inputs to the encoder 400. An output of the variable length coder (VLC) 460 is available as an output of the encoder 400.