Imaging and video capabilities have become the trend in consumer electronics. Digital cameras, digital camcorders, and video cellular phones are common, and many other new gadgets are evolving in the market. Advances in large resolution CCD/CMOS sensors coupled with the availability of low-power digital signal processors (DSPs) has led to the development of digital cameras with both high resolution image and short audio/visual clip capabilities. The high resolution (e.g., sensor with a 2560×1920 pixel array) provides quality offered by traditional film cameras.
More generally, applications for digital video have transcended into the domain of necessary survival equipment for today's digital citizens. In fact, applications involving digital video are so pervasive that virtually all facets of modern life—business, education, entertainment, healthcare, security, and even religion—have been affected by their presence. Aiding in their proliferation, multiple international standards have been created with new ones under development. In the 1990s, low bit-rate applications designed for limited bandwidth video telephony and conferencing motivated early standards like MPEG-1 and H.261. These standards provide picture quality comparable to a movie on VHS tape. As more bandwidth became available, MPEG-2, MPEG-4, and H.263 arrived to provide improvements in compression efficiency and DVD movie quality. The latest video coding standards, like WMV9/VC-1 and H.264/MPEG-4 Part 10 (AVC), make use of several advanced video coding tools to provide compression performance that can exceed MPEG-2 by a factor of two but at the expense of much higher complexity.
Common to all of these coding standards is the compression of video in both space and time. However, at closer inspection, even video encoders of the same standard can be very different. In fact, encoders often use proprietary strategies to improve compression efficiency, which translates directly to better picture quality at a given bit-rate. As video-enabled products continue to be commoditized, picture quality is quickly becoming a distinguishing feature that can foretell success or failure in the marketplace. To build competitive solutions, it is especially imperative that these strategies provide good economy, e.g., better quality for minimal complexity.
Encoders deploy many different tools to reduce both the spatial redundancy of content in each frame and the temporal redundancy between frames. Prediction is the primary facility for eliminating redundancy. If the prediction is better, the coding efficiency is higher, along with the video quality. The initial frame in a video sequence is independently compressed similar to a JPEG image using spatial prediction, i.e., intra-prediction. The subsequent frames are predicted from frames that have already been encoded, i.e., inter-prediction. When block-based motion-compensated prediction is used to model change from frame-to-frame, only the differences between the current and predicted frames need to be encoded. This approach has been used in most modern video coders since the early 1980s.
To track visual differences from frame-to-frame, each frame is tiled into macroblocks. Block-based motion estimation algorithms generate a set of vectors to describe block motion flow between frames, thereby, constructing the motion-compensated prediction. The vectors are determined using block-matching procedures that try to identify the most similar blocks in the current frame with those that have already been encoded in prior frames. Block matching techniques assume that an object in a scene undergoes a displacement in the x- and y-directions between successive frames. This translational displacement defines the components of a two-dimensional motion vector.
Motion estimation can be performed within each component of a frame, but is typically only done for luma. In this case, the chroma vectors assume the same vector coordinates as the luma vectors, although some scaling may be required depending on chroma format. Macroblocks are formed by N×M pixels, where N=M=16 for H.264/AVC. In general, the search for the best match between frames is determined by minimizing the image distortion D, which can be calculated using various metrics. The sum of absolute difference (SAD) between pixel-wise values in a block in the current frame and a block from the reference frame is commonly used to determine D. That is, the SAD associated with the motion vector v=(x,y) is given by
  D  =            S      ⁢                          ⁢      A      ⁢                          ⁢              D        ⁡                  (          v          )                      =                  S        ⁢                                  ⁢        A        ⁢                                  ⁢                  D          ⁡                      (                          x              ,              y                        )                              =                        ∑                      i            =            0                                N            -            1                          ⁢                              ∑                          j              =              0                                      M              -              1                                ⁢                                                                                  F                  c                                ⁡                                  (                                      i                    ,                    j                                    )                                            -                                                F                  r                                ⁡                                  (                                                            x                      -                      i                                        ,                                          y                      +                      j                                                        )                                                                                    where Fc(i,j) is a pixel in the ith column and jth row in a macroblock in the current frame and Fr(x+i,y+j) is the a co-located pixel in a reference frame with horizontal offset x and vertical offset y. Alternatively, the distortion can be represented as sum of squared error (SSE, but also called the sum of squared difference, SSD) such that
  D  =            S      ⁢                          ⁢      S      ⁢                          ⁢              E        ⁡                  (          v          )                      =                  S        ⁢                                  ⁢        S        ⁢                                  ⁢                  E          ⁡                      (                          x              ,              y                        )                              =                        ∑                      i            =            0                                N            -            1                          ⁢                              ∑                          j              =              0                                      M              -              1                                ⁢                                                    (                                                                            F                      c                                        ⁡                                          (                                              i                        ,                        j                                            )                                                        -                                                            F                      r                                        ⁡                                          (                                                                        x                          -                          i                                                ,                                                  y                          +                          j                                                                    )                                                                      )                            2                        .                              
Motion estimation algorithms usually define a sequence of ordered steps to methodically search for the vector {circumflex over (v)} that minimizes the image distortion for each macroblock. Depending on each step's search pattern and search range, the distortion D is evaluated over P locations such that v spans an area A=[v1 v2 . . . vP]. In practice, A should be large enough to cover the anticipated block displacement. Measuring the distortion at fractional distances by interpolating between integer pixel locations is also commonly used to minimize block error, e.g., producing vectors with fractional components. Generally, increasing the search range, e.g., the size of set A, can improve the likelihood that the global minima is discovered; however, the search range is often constrained to keep complexity manageable, especially in resource constrained embedded devices. The vector {circumflex over (v)} that minimizes D over area A for a given macroblock is selected such that
            v      ^        =                  min                  v          ∈          A                    ⁢              {                  S          ⁢                                          ⁢          S          ⁢                                          ⁢                      E            ⁡                          (              v              )                                      }              or      v    =                  min                  v          ∈          A                    ⁢                        {                      S            ⁢                                                  ⁢            A            ⁢                                                  ⁢                          D              ⁡                              (                v                )                                              }                .            However, using D alone does not guarantee optimal coding efficiency.
In codecs with more advanced tools for motion estimation, a single N×M macroblock may be partitioned into a variety of smaller blocks. Block-matching using multiple, smaller sub-blocks can reduce a macroblock's overall image distortion. The macroblock's inter-prediction mode u indicates the sub-block partition configuration selected from the W different arrangements allowed. For example, when W=4, u=0 represents no partitioning, u=1 represents two N×M/2 sub-blocks, u=2 represents two N/2×M sub-blocks, and u=3 represents four N/2×M/2 sub-blocks. The inter-prediction mode for no partitioning may be referred to as the single-partition inter-prediction mode and the inter-prediction modes for multiple partitions may be referred to as multiple-partition inter-prediction modes. Each partition is represented by its own motion vector although partitions in a macroblock can have the same vector coordinates. Ideally, the mode u is selected to minimize the overall bit-rate required to describe the macroblock's prediction error, e.g.,
                              Optimal                                                  Inter            ⁢                          -                        ⁢            Prediction            ⁢                                                  ⁢            Mode            ⁢                                                  ⁢                          u              ^                                            =                  min                  u          ∈          W                    ⁢              {                              ∑                          k              =              0                                                      K                u                            -              1                                ⁢                      [                                                            v                  u                                ⁡                                  (                  k                  )                                            +                                                m                  u                                ⁡                                  (                  k                  )                                            +                                                d                  u                                ⁡                                  (                  k                  )                                                      ]                          }              ,          ⁢      0    ≤    u    <    W    ,where each of the Ku sub-blocks will require vu bits to represent the vector v, mu overhead bits for mode u, and du bits to represent the block distortion residual.
In cases where parts of the frame contain lots of visual texture under motion, partitioning a macroblock into smaller independent sub-blocks may prove advantageous. While more bits are required to describe multiple partitions, e.g., vu bits tend to increase, the rate-to-distortion ratio may still be more favorable than using a single block if the overall image distortion is smaller, e.g.,
                    min                              u            ∈            W                    ,                      u            ≠            0                              ⁢              {                              ∑                          k              =              0                                                      K                u                            -              1                                ⁢                      [                                                            v                  u                                ⁡                                  (                  k                  )                                            +                                                m                  u                                ⁡                                  (                  k                  )                                            +                                                d                  u                                ⁡                                  (                  k                  )                                                      ]                          }              <          [                                    v            u                    ⁡                      (            0            )                          +                              m            u                    ⁡                      (            0            )                          +                              d            u                    ⁡                      (            0            )                              ]        ⁢      |          u      =      0        .In other cases where the image texture is smooth or uniform, a single vector per macroblock may provide a more attractive rate-to-distortion ratio. However, solving for the values of vu(k), mu(k) or du(k) when k≠0 is a complex optimization problem requiring iterative calculations that are generally not well-suited for real-time or low-cost applications.