Video coding typically consists of two main parts: prediction and coding of the prediction error, also referred to as residual. Prediction can be performed from previously coded pixels of the current frame, referred to as intra prediction or from a previously coded frame referred to as inter prediction. The residual is typically coded by a spatial 2D block transform such as a Discrete Cosine Transform (DCT) like transforms.
Spatial 2D transforms are typically used in image and video coding in order to exploit spatial correlation in image and video signals. In video coding, such transforms are applied to intra frame or inter frame prediction errors e.g. residuals.
The size of the transform can vary. For example, H.264 video standard exploits transforms of different sizes, such as 4×4 and 8×8. The spatial 2D transform is commonly used to decorrelate the signal and to improve the compression efficiency. The spatial 2D transform 110 is applied to the blocks of the original image or to the blocks of a residual image to produce transform coefficients
      c    i    =            ∑              x        =        0                    K        -        1              ⁢                  ∑                  y          =          0                          L          -          1                    ⁢                        r          ⁡                      (                          x              ,              y                        )                          *                              b            i                    ⁡                      (                          x              ,              y                        )                              where r(x,y) is the residual at position (x,y), ci are the transform coefficients for basis image and bi(x,y) is the 2D transform basis image i and K×L is the size of the residual block.
Then, the transform coefficients of the spatial 2D transform are quantized, scanned by a quantization and scanning unit 120 and entropy coded by an entropy encoder 130 as illustrated in FIG. 1. The decoder performs the inverse operations as illustrated in FIG. 2. The received bit stream is entropy decoded by an entropy decoder 140 to obtain the transform coefficients. Then a scaling and inverse scanning unit 150 performs scaling and inverse scanning and an inverse transformer 160 inverse transform to obtain the reconstructed residual block. It should be noted that scaling is the inverse operation to quantization. The residual block can be decoded by a weighted summation of the 2D transform basis images of the transform as described below:
                                          r            ^                    ⁡                      (                          x              ,              y                        )                          =                              ∑                          i              =              0                                      N              -              1                                ⁢                                                    c                ^                            i                        *                                                            b                  ^                                i                            ⁡                              (                                  x                  ,                  y                                )                                                                        (                  Eq          .                                          ⁢          1                )            where {circumflex over (r)}(x,y) is the reconstructed residual at position (x,y), ĉi are the inverse quantized transform coefficients for basis image i and {circumflex over (b)}i(x,y) is the 2D transform basis image i and N is the number of transform basis images. A 4×4 transform has typically 16 transform basis images.
One of the spatial 2D transforms that can be used for image coding is Karhunen-Loève Transform (KLT) which is optimal among all the unitary transforms with respect to its energy compaction properties, i.e. the KLT packs most energy in the smallest number of coefficients. The KLT is basically a transform that is optimal for a “given” signal, the KLT would therefore look differently for different signals. In this specification, the different KLT-like transforms are derived for different types of intra-prediction residual e.g. vertical, horizontal or diagonal. Hence, the KLT is derived from training on the original signal.
Accordingly, the KLT basis functions depend on the statistics of the original signal and the parameters of the KLT should be communicated to the decoder. Alternatively, a significant amount of computations is needed to calculate KLT based on the image statistics. These are limiting factors for using KLT in image and video coders.
The Discrete Cosine Transform (DCT) and other transforms approximating the DCT are very popular in image and video coding due to good energy compaction properties of the DCT and existence of fast DCT algorithms. It has been found that the compression efficiency of the DCT is close to that of KLT when the signal to be compressed is slowly changing i.e. when the signal to be compressed has a correlation coefficient close to 1. Therefore, the DCT is considered as a good approximation of KLT for signals with positive correlation between samples such as natural images, which often exhibit high spatial correlation.
Another transform that can be used in image and video coding is a Discrete Sine Transform (DST) which approximates KLT when encoding signals with a negative correlation.
There are some other KLT approximations that have been used in video coding, e.g. in Mode-Dependent Directional Transforms (MDDT) tool in Key Technical Area (KTA) which is a VCEG H.264 based software. In MDDT, a set of KLT-like transforms is used to encode the intra-prediction residual. Each transform is optimized to fit the statistics of the particular type of the image residual namely, the residuals corresponding to different intra-prediction directions. The transforms are optimized beforehand and the transform is chosen based on the intra prediction mode of the current block. It has been shown that using MDDT instead of DCT for encoding intra-coded blocks can provide better compression efficiency.
A transform consist of several basis images corresponding to different frequencies. Each basis image has a value at each position of the block. The simplest basis image corresponds to the lowest frequency which is the average of the signal. The lowest frequency (“DC”) basis image of the spatial 2D DCT is a function of a constant flat level.
In contrast, the lowest frequency basis images of discrete sine transform (DST) and KLT-like transforms are often not flat and might have a curved or concave shape. FIG. 6 illustrates the values of the lowest frequency basis function of the KLT and FIG. 5 illustrates the “DC” basis image of the DST which also has a concave form.
For some image blocks, multiple coefficients for a block are sent to the decoder. The basis images corresponding to these coefficients are combined by a weighted addition where the weight is according to each coefficient, producing an image approximating the block being encoded. However, it is common in video compression that only the lowest frequency coefficient is sent for the block. In case of DCT-like transforms, such an image would represent just a constant level or a level offset in case of residual image coding. In case the DST of some other KLT-like transform with a curved lower-frequency basis image, the bell-shaped basis image of DST or KLT transform may result in reconstruction artefacts, especially in smooth or slowly changing areas of an image. The examples of such artefacts are shown in FIGS. 3 and 4. The artefacts are long vertical stripes that are absent in the original image or video and arise because of a combination of vertical intra prediction and the non-flat shape of the corresponding 16×16 MDDT lowest-frequency basis image of FIG. 6b. 