A video signal typically comprises a series of frames, each showing an image at one instant in time. When the frames are displayed quickly in succession they give the impression of a moving image. In order to reduce the data rate required to store or send a video signal, compression algorithms (commonly known as a ‘codecs’) are used to encode the data. Such compression algorithms typically divide each frame into a number of smaller blocks, each of which is encoded.
Color video images typically comprise several color planes, for example a RGB image comprises red, green and blue color planes, which when overlaid or mixed make up the full color image. Video applications commonly use a color scheme in which the planes do not correspond to specific colors. Instead one plane corresponds to luminance (the brightness of each pixel in the color image), and the other planes—usually two of them—contain certain color information (chrominance). When the chrominance information is combined with the luminance information, the color image can be derived and displayed, either directly or by first converting the information into separate RGB levels. The reason that a luminance-chrominance system is commonly used is that human perception is much more sensitive to differences in luminance than chrominance. Therefore video compression algorithms typically encode the chrominance information at a lower resolution than the luminance information, in order to reduce the amount of data needed, without unduly affecting image quality. Such blocks of data with differing resolutions of luminance and chrominance data are called ‘macroblocks’. A typical macroblock may, for example, have two planes of chrominance data at half the vertical and half the horizontal resolution of the luminance data. However, in this patent specification the term ‘macroblock’ is used to mean any block of image data that has chrominance data at less resolution than luminance data.
Data in the blocks of the video frame is typically encoded by use of a transform, which transforms the data into frequency space. A Discrete Cosine Transform (DCT) is often used for this purpose, but other types of transform may be used instead. The human eye is less sensitive to information contained in the high frequency components and therefore some information relating to the higher frequencies may be discarded or encoded using fewer bits, in order to reduce the amount of data. Once this is done the transformed block may be quantized, by scaling the transform coefficients to the nearest of a number of predetermined values. For example, if the transform coefficients are between −1 and 1, then scaling the coefficient by 20 and rounding to the nearest integer quantizes the coefficient to the nearest of 41 quantization points (from −20 to +20, including 0).
After quantization, the number of bits required to encode the data is reduced further by taking advantage of certain statistical properties of the quantized data. This process is called ‘entropy encoding’. For example, after quantization, many of the coefficients may have a value of zero; a type of entropy encoding called run-length coding, takes account of consecutive zero coefficients and encodes the length of each such ‘run’, rather than encoding each zero value separately. Other types of entropy encoding which take advantage of the statistical properties of the data, for example variable length encoding (VLC) or arithmetic coding, may also be used. The above describes simple encoding methods in which each block in each frame is encoded independently of the other frames. This method of encoding is still used, however most modern compression algorithms allow a variety of different block encoding modes, any of which may be used to encode a particular block.
An intra encoding mode is a mode in which each block is encoded on the basis of data held within that block (the source block) and on the basis of data in other blocks (reference blocks) in the same frame. The encoding process may work as follows. The contents of the source block are predicted on the basis one or more reference blocks in the same frame (this is called intra prediction). The difference between the predicted block and the source block is called a residual block. The residual block is encoded by image transforming, quantizing and entropy encoding, as explained in the paragraphs above. The encoded residual block is stored together with coding data identifying the reference blocks and identifying the encoding mode used for the intra prediction. During decoding the predicted block is computed from the coding data and the source block is reconstructed by adding the (decoded) residual block to the predicted block. There may be several different possible intra encoding modes based on different block sizes or different positions of the reference block(s) relative to the source block.
An inter encoding mode makes use of the fact that in a video signal there are often substantial similarities between successive frames, for example, in areas of the image in which there is no movement, or areas relating to a moving object which translates in position between successive frames. An inter encoding mode ‘predicts’ the content of a particular block (a source block) on the basis of another block (called a reference block) in a different frame (which may be one or more frames before or after the frame containing the block being predicted). This is called inter-prediction as the prediction is based in blocks in other frames. The residual block is the difference between the predicted block and the source block. The residual block is encoded by using an image transform, quantizing and entropy encoding. The encoded residual block is stored together with coding data identifying the reference block used and the particular inter-prediction mode used. The coding data may for example comprise a motion vector relating the reference block and the predicted block. During decoding the predicted block is computed from the coding data and the source block is reconstructed by adding the predicted block to the (decoded) residual block. There may be several different possible inter-prediction modes, each based on different block sizes, different reference blocks or different frames relative to the source block.
A skip mode is a special case of an inter encoding mode. It relates a source block directly to a reference block in another frame (i.e. the two are predicted to be identical). Thus, the source block is predicted to have exactly the same contents as the reference block. The source block may then be encoded as data indicating that it is a skip mode and data indicating the identity of the reference block. Decoding is carried out by finding the identity of the reference block and copying its data to form the reconstructed block.
It can be useful to know the distortion caused by encoding a block of video or image data; for example, if it is desired to encode an image but retain a given image quality. The distortion is typically measured as the sum of squared differences between coefficients of the original source block and the coefficients of the reconstructed block. Knowing the distortion is also useful when deciding which encoding mode to use, as will be explained below.
It is important to select the best mode for encoding each block, as this is an important factor in the performance of the compression algorithm. There are two principal considerations when selecting the block encoding mode; the first is the distortion which results from the encoding (i.e. the difference between the source image and the reconstructed image after decoding) and the second is the number of bits required to encode the block. Sometimes the latter consideration is referred to as ‘bit rate’, which is the number of bits required per second required to transmit the image at a given resolution. The bit rate is related to the overall number of bits required to encode the block. It is necessary not only to select between inter, intra and skip modes, but also to select the best type of inter encoding or best type of intra encoding.
One known theoretical method of choosing the best block encoding mode is to compute the rate-distortion cost of all the possible modes. The rate-distortion cost is a parameter, which takes account of both the distortion caused by the encoding and the number of bits required to encode the block.
It is possible to encode and decode each block to find the distortion and bit rate for each mode directly. For example, in the H.264/AVC encoding process, the best macroblock encoding mode may be selected by computing the rate-distortion cost of all possible modes. The best mode is typically the one with minimum rate-distortion cost. The rate distortion cost for a given mode may be defined as:JRD=SSD(S,C)+λ·R  (EQUATION 1)
where JRD represents the rate distortion, λ is the Lagrange multiplier, R is the number of bits required to encode the block according to that mode, and the SSD(S,C) is the sum of the squared differences (SSD) between the original blocks S and the reconstructed block C when that encoding mode is used. The sum of squared differences can be expressed as:
                              SSD          ⁡                      (                          S              ,              C                        )                          =                                            ∑                              i                =                0                                            N                -                1                                      ⁢                                          ∑                                  j                  =                  0                                                  N                  -                  1                                            ⁢                                                (                                                            s                      ij                                        -                                          c                      ij                                                        )                                2                                              =                                                                  S                -                C                                                    F            2                                              (                  EQUATION          ⁢                                          ⁢          2                )            
where sij and cij are the (i,j)th elements of the current original block S and the reconstructed block C, respectively. Moreover, N is the image block size (N=4 in H.264/AVC standard) and ∥ ∥F is Frobenius norm. We shall call the SSD(S,C) a spatial-domain SSD since the distortion computation is performed in spatial-domain pixel values. The inventors have found that the computation of a spatial-domain SSD is very time-consuming, since it is necessary to obtain the reconstructed block after Transformation—Quantization—Inverse Quantization—Inverse Transformation—Pixel Reconstruction for each possible mode. The above method of finding the best mode by calculating the SSD (S,C) and bit rate directly for each mode is called Rate Distortion Optimization (RDO). It can find the best mode accurately, but takes a lot of time and processing power.
To accelerate the coding process, the JVT reference software version JM 6.1d estimates the rate-distortion cost by using a fast SAD-based cost function instead:
                              J          SAD                =                  {                                                                                          SAD                    ⁡                                          (                                              S                        ,                        P                                            )                                                        +                                                                                    λ                        1                                            ·                      4                                        ⁢                    K                                                                                                if                  ⁢                                                                          ⁢                  intra                  ⁢                                                                          ⁢                  4                  ×                  4                  ⁢                                                                          ⁢                  mode                                                                                                      SAD                  ⁡                                      (                                          S                      ,                      P                                        )                                                                              otherwise                                                                        (                  EQUATION          ⁢                                          ⁢          3                )            
where SAD(S,P) is the sum of absolute differences between the original block S and the predicted block P. λ1 is an approximate exponential function of the quantization parameter (QP) which is almost the square of λ, and K is equal to 0 for the probable mode and 1 for the other modes. The SAD(S,P) is expressed by:
                              SAD          ⁡                      (                          S              ,              P                        )                          =                              ∑                          i              =              0                                      N              -              1                                ⁢                                    ∑                              j                =                0                                            N                -                1                                      ⁢                                                                          s                  ij                                -                                  p                  ij                                                                                                      (                  EQUATION          ⁢                                          ⁢          4                )            
where sij and pij are the (i,j)th elements of the current original block S and the predicted block P, respectively. This SAD-based cost function could save a lot of computations as the distortion part is based on the differences between the original block and the predicted block instead of the reconstructed block. However, this computation reduction usually comes with a quite significant degradation of coding efficiency. To achieve better rate-distortion performance, JM6.1d also provided an alternative SATD-based cost function:
                              J          SATD                =                  {                                                                                          SATD                    ⁡                                          (                                              S                        ,                        P                                            )                                                        +                                                                                    λ                        1                                            ·                      4                                        ⁢                    K                                                                                                if                  ⁢                                                                          ⁢                  intra                  ⁢                                                                          ⁢                  4                  ×                  4                  ⁢                                                                          ⁢                  mode                                                                                                      SATD                  ⁡                                      (                                          S                      ,                      P                                        )                                                                              otherwise                                                                        (                  EQUATION          ⁢                                          ⁢          5                )            
where SATD(S,P) is the sum of absolute Hadamard-transformed difference between the original block S and the predicted block P, which is given by:
                              SATD          ⁡                      (                          S              ,              P                        )                          =                              ∑                          i              =              0                                      N              -              1                                ⁢                                    ∑                              j                =                0                                            N                -                1                                      ⁢                                                        h                ij                                                                                      (                  EQUATION          ⁢                                          ⁢          6                )            
where hij are the (i, j)th element of the Hadamard transformed image block H which is the difference between the original block S and the predicted block P. The Hadamard transformed block H is defined as:
                              H          =                                                    T                H                            ⁡                              (                                  S                  -                  P                                )                                      ⁢                          T              H              T                                      ⁢                                  ⁢                  With          ⁢                      :                                              (                  EQUATION          ⁢                                          ⁢          7                )                                          T          H                =                  [                                                    1                                            1                                            1                                            1                                                                    1                                            1                                                              -                  1                                                                              -                  1                                                                                    1                                                              -                  1                                                                              -                  1                                                            1                                                                    1                                                              -                  1                                                            1                                                              -                  1                                                              ]                                    (                  EQUATION          ⁢                                          ⁢          8                )            
Experimental results show that the JSATD can achieve better rate-distortion performance than the JSAD, but its overall rate-distortion performance is still lower than the optimized JRD (found by computing the rate distortion of each mode directly). Thus, neither SAD nor SATD-based functions can predict the real distortion accurately, and therefore they lead to selection of sub-optimum encoding modes which have a higher bit rate or higher distortion than the optimum.
A rate-distortion performance comparison of H.264/AVC using RDO-based, SAD-based and SATD-based cost functions for different QPs (quantization step sizes) and three well-known test sequences in terms of PSNR and bit-rate is shown in the table in FIG. 1. As can be seen from the table, compared with a RDO-based encoder, the SAD-based and SATD-based cost functions are not good at selecting the mode having the best (lowest) rate-distortion cost.
In summary, computing the rate-distortion cost (hereinafter also referred to as rate-distortion) of each mode directly from the source and reconstructed blocks takes a lot of processing power and is not practical to carry out in real time without high end computing hardware. Meanwhile using the SAD and SADT functions are not good at predicting the real rate-distortion caused by the encoding process and may result in sub-optimum modes being selected.