Moving pictures such as video are composed of a number of consecutive frames of still pictures. In the NTSC (National Television Standards Committee) conventional television system each second includes 30 frames or 60 fields. Consecutive frames are generally similar except for changes caused by moving objects. Video coding experts call this similarity temporal redundancy. In the digital video compression temporal redundancy enables a major improvement in coding efficiency. Thus digital television can transmit 4 to 6 channels over an equivalent analog channel of the same capacity. The temporal redundancy reduction in digital video compression is achieved by motion compensation (MC). Using motion compensation the current picture can be modeled as a translation of prior pictures.
In the MPEG video coding standard employed in most of today's digital video applications, each picture is divided into two-dimensional macroblocks of M horizontal by N vertical pixels. In the MPEG video coding standard M and N are both set to 16. Each macroblock in the current frame is predicted from a previous or future frame called the reference frame by estimating the amount of the motion in the macroblock during the frame time interval. The MPEG video coding syntax specifies how to represent the motion information for each macroblock in vectors. This standard does not specify how these motion vectors are to be computed.
Due to the block-based motion representation, many implementations of MPEG video coding use block matching techniques. The motion vector is obtained by minimizing a cost function measuring the mismatch between the reference and the current macroblocks. The most widely-used cost function is the sum of absolute difference values (AE) defined as:
                                                                        AE                                  κ                  ,                  d                                            ≡                            ⁢                                                ∑                                      j                    =                    0                                                        N                    -                    1                                                  ⁢                                                                  ⁢                                                      ∑                                          i                      =                      0                                                              M                      -                      1                                                        ⁢                                                                                                        f                        t                                            +                                              τ                        ⁡                                                  (                                                                                                                    x                                κ                                                            +                                                              d                                h                                                            +                              i                                                        ,                                                                                          y                                κ                                                            +                                                              d                                v                                                            +                              j                                                                                )                                                                    -                                                                                                                                                            ⁢                                                f                  t                                ⁡                                  (                                                                                    x                        κ                                            +                      i                                        ,                                                                  y                        κ                                            +                      j                                                        )                                                                                                      Eq        .                                  ⁢        1            
This equation represents the absolute difference where: d is the displacement (dh, dv) for the macroblock whose left-upper corner pixel is denoted by ft(x_,y_); ft+—(h,v) is the pixel at coordinates (h,v) in the reference frame; τ is the frame distance between the current frame and the reference frame. An alternate cost function is the sum of squared error values. This is defined as:
                                                                        SE                                  κ                  ,                  d                                            ≡                            ⁢                                                ∑                                      j                    =                    0                                                        N                    -                    1                                                  ⁢                                                                  ⁢                                                      ∑                                          i                      =                      0                                                              M                      -                      1                                                        ⁢                                      (                                                                  f                        t                                            +                                              τ                        ⁡                                                  (                                                                                                                    x                                κ                                                            +                                                              d                                h                                                            +                              i                                                        ,                                                                                          y                                κ                                                            +                                                              d                                v                                                            +                              j                                                                                )                                                                    -                                                                                                                                                                              ⁢                                                      f                    t                                    ⁡                                      (                                                                                            x                          κ                                                +                        i                                            ,                                                                        y                          κ                                                +                        j                                                              )                                                  )                            2                                                          Eq        .                                  ⁢        2            
FIG. 1 illustrates the block matching process. Current frame 100 includes macroblock 101 having a size M by N. Reference frame 110 includes macroblock to be predicted 111 which is displaced by motion vector d from the corresponding position of macroblock 101.
Finding the motion vector d among the motion vector search window denoted by Wh×Wv that minimizes the absolute difference for each macroblock is called motion estimation (ME). Using the motion vector d, motion-compensated residual signals denoted by g(x_+i, y_+j), where 0≦i≦M−1, 0≦j≦N−1 are coded through successive transform coding process such as Discrete Cosine Transform (DCT) are expressed as:g(xκ+i,yκ+j)≡ft(xκ+i,yκ+j)−ft+τ(xκ+dh+i,yκ+dv+j)  Eq. 3From equation 2 the best match minimizes the number of significant, i.e. non-zero, signals to be coded. This leads to a best coding gain among all possible matches.
Video coding standards such as MPEG do not specify how the motion estimation should be performed. The system designer decides how to implement among many possible ways. A common prior art technique employs a full search (FS) over a wide 2-dimensional area yields the best matching results in most cases. This assurance comes at a high computational cost to the encoder. In fact motion estimation is usually the most computationally intensive portion of the video encoder.
FIG. 2 illustrates the flowchart 200 of the prior art full search plain block matching. This block matching determines which candidate motion vector d provides the best match between the current macroblock and the reference frame. The process begins with start block 201. Block 202 initializes a variable AE_MIN correspond to the cost function minimum to a saturated value, the maximum possible value. Block 203 selects the next candidate motion vector d. Block 204 computes the cost function for the current macroblock at the current candidate motion vector d. This is typically the absolute difference (AE) of equation 1. Decision block 205 tests to determine if the new absolute difference AE is less than the prior cost function minimum AE_MIN. If this is the case (Yes at decision block 205), then the current candidate motion vector d yields a better cost function than the previous best. Thus block 206 stores the current candidate motion vector d as the best motion vector and replaces the prior cost function minimum AE_MIM with the current cost function AE. Decision block 207 tests to determine if there are no more candidate motion vectors. If there are additional candidate motion vectors (No at decision block 207), process flow returns to block 203. Block 203 begins a repeat for the next candidate motion vector d. If the new absolute difference AE is not less than the prior cost function minimum (No at decision block 205), then the current candidate motion vector d does not yield a better cost function than the previous best. Process 200 branches ahead to decision block 207. If there are no additional candidate motion vectors (Yes at decision block 207), then the best motion vector d for the current macroblock has been found. Block 208 confirms the current candidate motion vector d is the best motion vector for the current macroblock. Process 200 ends at end block 209.
The computational complexity of the motion estimation is usually represented with in the units of summation of absolute difference (SAD). One match computation between a current macroblock and one candidate reference macroblock each having M by N pixels requires M×N SAD. Here let SADmb denote SAD for a macroblock with search window denoted by Wh×Wv, which is represented as:SADmb=M×N×Wh×Wv  Eq. 4Then SAD for a frame denoted by SADframe is expressed as:
                                                                        SAD                frame                            =                                                SAD                  mb                                ×                number                ⁢                                                                  ⁢                of                ⁢                                                                  ⁢                macroblocks                                                                                        =                              M                ×                N                ×                                  W                  h                                ×                                  W                  v                                ×                                                      (                                                                  P                        h                                            ×                                              P                        v                                                              )                                    /                                      (                                          M                      ×                      N                                        )                                                                                                                          =                                                W                  h                                ×                                  W                  v                                ×                                  (                                                            P                      h                                        ×                                          P                      v                                                        )                                                                                        Eq        .                                  ⁢        5            
This SADframe calculation assumes only one prediction mode and one prediction direction. However, in many cases there are two or three prediction modes and both forward and backward prediction are employed. For SDTV (Standard Definition TV) quality service, the full search motion estimation requires 100 GOPS (Giga Operations Per Second, Giga: 109) to 200 GOPS of SAD. Meanwhile all the encoder modules except the motion estimation only take about 1 GOPS or only 1% as much processing. Thus much effort has been made to reduce this SAD number down to a practical level.
Several algorithms have been proposed to reduce the number of candidate motion vectors that must be considered.
The Q-step search algorithm first evaluates the cost function at the center and eight surrounding locations of certain area. This area is typically a 32 pixel by 32 pixel block. The location that produces the smallest cost function becomes the center of the next stage. The search range is reduced, generally by half, and the search repeated. This sequence is repeated Q times. Typically 2≦Q≦4.
In a sub-sampling based search both current and reference frames are sub-sampled with an adequate decimation factor. This decimation factor is usually 2 or 4 for horizontal and vertical directions. In a first iteration, the computation of the cost function is performed in that sub-sampled domain. This yields a coarse motion vector. For successive iterations, the coarse motion vector is refined by conducting the matching over domain with a smaller decimation factor.
A telescopic search exploits the motion information in adjacent frames to reduce the computational cost. The rationale behind this approach is that the movement of objects in video is continuous, so the motion information in adjacent frames is correlated. Thus the motion vector of the previous frame provides information relevant to the motion vector of the current frame. Among various implementations a simple instantiation is to use the motion vector of the previous frame as an offset, that is, the center of the search window. This helps find the best matches with a relatively small search window.
Many digital video encoders use one of these three algorithms or their families. Some use a mixture of these, such as a sub-sampling based search together with a telescopic search. It has been empirically found that well tuned motion estimation algorithms take only 2% to 3% of the computation that the full search algorithm requires. This benefit typically sacrifices little visual quality. These tailored methods are complicated and tend to require additional resources such as a memory buffer. Even having achieved such significant complexity reduction, the motion estimation is still the most computationally intensive part of video coding. The motion estimation often requires operations 5 times that of the entire rest of the modules. Therefore further reduction of computational complexity is desired while preserving visual quality and increases implementation complexity as little as possible.