Video compression is employed to allow the creation of digital video files of a manageable size. Although storage capacity and bandwidth have increased since the demand first arose for video compression, so to has the amount of digital video data available. Accordingly, the design of new encoding techniques has progressed to provide both better compression rates and higher perceptual quality. Typically, compression techniques can be classified as either lossy or lossless compression. Lossless compression, also referred to as entropy compression, seeks to remove unneeded redundancy in the data and replace the redundant data with a marker allowing full retrieval of the original data. Although this is ideal for completely accurate reproduction of a data stream or data file, it typically is insufficient for video data, as the data is often not sufficiently structured to allow for a sufficient reduction in data file size. Lossy compression discards data in a manner that prevents full recovery of the original file. If lossy compression is employed properly, the discarded data can be perceptually restored by a viewer. A data file or stream cannot be reduced by lossless compression to a size less than the entropy of the source. Lossy compression can be employed to reduce a file below the entropy level, but results in a distortion that can be measured as a rate distortion.
H.264 is a standard compression methodology employed in as MPEG-4 and has proved its superiority to its predecessor encoders in coding efficiency (e.g., it shows a more than 50% rate reduction in comparison to the popular MPEG-2). However, despite the fact that H.264 provides compression advantages, new encoding techniques to refine H.264 encoding can be employed to further increase the compression rates. It should be noted that radical changes to the methodology of compression cannot easily be implemented, as a standard decoder has been defined that must be able to decode the resulting data stream if compatibility with H.264 is to be maintained.
Based on an assumption of four types of redundancy—temporal, spatial, psychovisual, and statistical redundancy—video compression generally utilizes a hybrid structure, as shown in FIG. 1. A data stream is provided as an input to encoder 100. Upon receipt of the data stream, a motion compensation factor is applied to the data stream by motion compensator 112. After having been motion compensated, a video frame is considered to be a residual as, in the ideal, data corresponding to information in a previous frame has been removed. The residual is provided to a transform processor 102 that applies a transform (typically a discrete cosine transform (DCT)), the output of which is provided to a quantizer 104. The quantizer 104 applies lossy compression to each frame using an algorithm designed to co-operate with the other functional blocks in the system based on defined quantization steps. H.264 quantizers have been implemented as “hard decision” quantizers that make quantization decisions based solely upon the quantization levels defined. The quantized version of the transformed residual is provided to an entropy encoder 106 that removes the statistical redundancy in the data stream, and provides as an output an H.264 encoded bit stream. The output of the quantizer 104 is also used to determine the motion compensation that will be applied to a subsequent frame. The quantizer output is provided to de-quantizer 108, whose output is provided to an inverse transform processor 110. This provides a reconstruction of the quantized residual, and is used by motion compensator 112 to derive a motion prediction that is then combined with a subsequent frame to provide the next residual. In general, motion compensation deals with temporal redundancy (removal of information unchanged from a previous part of the data stream); transform handles spatial redundancy (decoupling correlation of information in an adjoining area); quantization is based on psychovisual redundancy (removing information not needed for the viewer's perception of the image); and entropy coding is designed for removing statistical redundancy (reduction of information through the use of lossless compression). Because the quantization part introduces permanent information loss to video data, video compression is categorized as lossy data compression. Rate distortion theory, as will be understood by those skilled in the art, indicates that the best coding efficiency a lossy compression method can achieve for coding a given information source is characterized by a rate distortion function, or equivalently distortion rate function, of the source. The four coding parts in the hybrid structure all contribute to the rate distortion function and there is no easy way to quantitatively separate their contributions. Therefore, the fundamental trade-off in the design of a video compression system is its overall rate distortion performance. The performance of each component has been subject to numerous optimization attempts, many of which use rate distortion (RD) methods.
RD methods for video compression can be classified into two categories. The first category computes the theoretical rate distortion function based on a given statistical model for video data. In general, there is always a problem of model mismatch due to the non-stationary nature of video data. The second category uses an operational rate distortion function, which is computed based on the data to be compressed. There exist two main problems with operational rate distortion methods. First, the formulated optimization problem is restricted and the rate distortion cost is optimized only over motion compensation and quantization step sizes. Second, there is no simple way to solve the restricted optimization problem if the actual rate distortion cost is used. Because hard decision quantization is used, there is no simple analytic formula to represent the actual rate distortion cost as a function of motion compensation and quantization step sizes. Hence, typical solutions to the restricted optimization problem involve a brute force approach that is computationally expensive. For this reason, an approximate rate distortion cost is often used in the restricted optimization problem in many operational rate distortion methods. For example, the optimization of motion compensation based on the prediction error instead of the actual distortion, which is the quantization error has been used in some implementations.
Most video compression standards, from the early MPEG-1, H.261, to the newest H.264 (which is also referred as MPEG-4, part-10), utilize the well-known hybrid coding structure shown in FIG. 1. The motion compensation design in H.264 has been significantly improved over previous standards. It allows various block sizes from 4×4 to 16×16. While a large block size is desirable for homogeneous regions, a small block size makes it possible to process details effectively. It also uses higher-pixel prediction accuracy.
For the transform processor function, H.264 uses the discrete cosine transform (DCT) with a block size of 4×4 while most other standards for video and image coding usually use the 8×8 DCT transform. Specifically, the transform matrix is
      w    ^    =      (                                        1            /            2                                                1            /            2                                                1            /            2                                                1            /            2                                                            1            /                          2.5                                                            0.5            /                          2.5                                                                          -              0.5                        /                          2.5                                                                          -              1                        /                          2.5                                                                        1            /            2                                                              -              1                        /            2                                                              -              1                        /            2                                                1            /            2                                                            0.5            /                          2.5                                                                          -              1                        /                          2.5                                                            1            /                          2.5                                                                          -              0.5                        /                          2.5                                            )  To facilitate fast implementation with integer operations, a simplified transform matrix is obtained as
  w  =      (                            1                          1                          1                          1                                      1                                      1            /            2                                                              -              1                        /            2                                                -            1                                                1                                      -            1                                                -            1                                    1                                                  1            /            2                                                -            1                                    1                                                    -              1                        /            2                                )  by extracting a factor f from ŵ as
  f  =      (                                        1            /            4                                                              1              /              10                                                            1            /            4                                                              1              /              10                                                                                      1              /              10                                                            2            /            5                                                              1              /              10                                                            2            /            5                                                            1            /            4                                                              1              /              10                                                            1            /            4                                                              1              /              10                                                                                      1              /              10                                                            2            /            5                                                              1              /              10                                                            2            /            5                                )  with ŵYŵT=wYwT{circle around (×)}f for any 4×4 matrix Y where the symbol {circle around (×)} denotes the element-wise multiplication.
Quantization in H.264 is simply achieved by a scalar quantizer. It is defined by 52 step sizes based on an index parameter p=0, 1, . . . 51. The quantization step size for a given p is specified asq[p]=h[prem]·2pquo   (1)where prem=p % 6 and pquo=floor(p/6) are the remainder and quotient of p divided by 6, and h[i]ε{10/16, 11/16, 13/16, 14/16, 16/16, 18/16,},6>i≧0. For the purpose of fast implementation, quantization and transform in H.264 are combined together. Specifically, the factor matrix f is combined with the quantization step size. Suppose that the decoder receives the quantized transform coefficients u and the quantization parameter p for a 4×4 block. Then the following process is defined in H.264 for the decoding,
                                                                                          T                                      -                    1                                                  ⁡                                  (                                                            Q                                              -                        1                                                              ⁡                                          (                      u                      )                                                        )                                            =                                                                                          w                      ^                                        T                                    ⁡                                      (                                          u                      ·                                              q                        ⁡                                                  [                          p                          ]                                                                                      )                                                  ·                                  w                  ^                                                                                                                        =                                                                                                    w                        ^                                            T                                        ⁡                                          (                                                                        (                                                      u                            ·                                                          h                              ⁡                                                              [                                                                  p                                  rem                                                                ]                                                                                      ·                                                          2                                                              p                                quo                                                                                                              )                                                ⊗                        f                                            )                                                        ·                  w                                            ,                                                                                          =                                                                            w                      T                                        ⁡                                          (                                                                        u                          ⊗                                                      (                                                          dq                              ⁡                                                              [                                                                  p                                  rem                                                                ]                                                                                      )                                                                          ·                                                  2                                                      p                            quo                                                                                              )                                                        ·                  w                  ·                                      1                    64                                                              ,                                                          (        2        )            where dq=(f·h[i]·64) with 6>i≧0 are constant matrices defined in the standard. It is clear that the computation of (2) can be conducted using only integer operations.
H.264 supports two entropy coding methods for residual coding, i.e., context adaptive variable length coding (CAVLC) in the baseline profile and context adaptive binary arithmetic coding (CABAC) in the main profile. CAVLC is based on variable length coding tables, while CABAC uses advanced arithmetic coding methods. Arithmetic coding is generally considered to be superior to variable length coding because it can adapt to symbol statistics and assign a non-integer number of bits to code a symbol. However, the complexity of arithmetic coding is much higher. Overall, CAVLC provides a baseline solution for applications with limited computation resource while CABAC targets better coding performance.
As discussed above, each individual coding part in the hybrid structure of H.264 has been well designed to achieve good coding performance using the state-of-the-art technologies. Optimization of an individual part in H.264 alone is unlikely to provide remarkable performance improvement. Further improvement of the coding performance largely depends on the design of the whole structure, for which rate distortion methods are studied. A joint optimal design of the whole encoding structure is possible because the standard only specifies a syntax for the coded bit stream, leaving the details of the encoding process open to each designer. This allows for a number of different encoder implementations, with a standard decoder being able to decode a data stream from any H.264 complain encoder.
Rate distortion methods for video compression in general can be roughly classified into two categories: methods based on source modeling and methods based on an operational rate distortion cost. The first category uses the theoretical rate distortion function, which characterizes the optimal rate distortion performance of any lossy coding method. The challenge of this approach is to model the data statistics. In general, a model mismatch typically results in inefficiencies in encoding. Because all theoretical results in rate distortion theory are based on a given statistical model, the model mismatch problem constantly exists as a gap between the simplified theoretical models and the complicated real world data.
The second category of rate distortion methods is based on an operational rate distortion function. An operational rate distortion framework for efficiently distributing bit budget among temporal and spatial coding methods for MPEG video compression has been proposed. Typically these solutions result in exponential complexity, which is tackled by utilizing a monotonicity property of operational rate distortion curves for dependent blocks/frames. The monotonicity property is based on an assumption that rate distortion performance for coding one frame is monotonic in the effectiveness of prediction, which depends on the reproduction quality of reference frames. A pruning rule can then be applied to reduce search complexity based on the monotonicity property. Generally speaking, the above assumption implies a linear relationship between distortion and the coding rate. This assumption is valid to a large extent for early standards such as MPEG-1, MPEG-2. However, the total coding rate includes more than just the rate for coding residuals. Motion vectors from motion compensation also need to be transmitted. For early standards, motion compensation is based on a large block size of 16×16, leading to a small number of motion vectors to be transmitted. As such, motion vector transmission consumes relatively few bits, and can largely be ignored. Thus, it is acceptable to apply the above assumption to simplify the rate distortion problem. However, when small block sizes, such as those in H.264, are allowed for motion compensation, motion vectors consume a significant portion of the total coding bits. Thus, this method cannot be directly applied to H.264.
Using the generalized Lagrangian multiplier method, a simple, effective operational RD method for H.264 video compression, particularly for the optimization of motion compensation has been proposed. Motion compensation is optimized based on the following operational rate distortion cost,
                    v        =                              arg            ⁢                                                  ⁢                                          min                v                            ⁢                                                          ⁢                              d                ⁡                                  (                                      x                    ,                                          p                      ⁡                                              (                                                  m                          ,                          v                                                )                                                                              )                                                              +                      λ            ·                          r              ⁡                              (                v                )                                                                        (        3        )            where x stands for the original image block, p(m, v) is the prediction with given prediction mode m and motion vector v, d(•) is a distance measure, r(v) is the number of bits for coding v, and λ is the Lagrangian multiplier. Empirical evidence indicates that a good Lagrangian multiplier λ, can be represented asλ=0.85·2(p−12)/3   (4)where p=0, 1, . . . 51 is the quantization parameter in (1). Clearly, the optimization here is not conducted based on the actual rate distortion cost. In order to avoid the expensive computation for residual coding, the distortion is approximated by the prediction error and the residual coding rate is not computed for motion compensation. Therefore, the optimization here is largely separated from residual coding.
It is, therefore, desirable to provide a method of encoding video data using H.264, as well as other encoding standards, making use of joint optimization across different elements in a hybrid encoder.