Transmission and storage of video sequences are employed in several applications like e.g. TV broadcasts, internet video streaming services and video conferencing.
Video sequences in a raw format require a very large amount of data to be represented, as each second of a sequence may consist of tens of individual frames and each frame is represented by typically at least 8 bit per pixel, with each frame requiring several hundreds or thousands of pixels. In order to minimise the transmission and storage costs video compression is used on the raw video data. The aim is to represent the original information with as little capacity as possible, i.e., with as few bits as possible. The reduction of the capacity needed to represent a video sequence will affect the video quality of the compressed sequence, i.e. its similarity to the original uncompressed video sequence.
State-of-the-art video encoders, such as AVC/H.264, utilise four main processes to achieve the maximum level of video compression while achieving a desired level of video quality for the compressed video sequence: prediction, transformation, quantisation and entropy coding. The prediction process exploits the temporal and spatial redundancy found in video sequences to greatly reduce the capacity required to represent the data. The mechanism used to predict data is known to both encoder and decoder, thus only an error signal, or residual, must be sent to the decoder to reconstruct the original signal. This process is typically performed on blocks of data (e.g. 8×8 pixels) rather than entire frames. The prediction is typically performed against already reconstructed frames or blocks of reconstructed pixels belonging to the same frame.
The transformation process aims to exploit the correlation present in the residual signals. It does so by concentrating the energy of the signal into few coefficients. Thus the transform coefficients typically require fewer bits to be represented than the pixels of the residual. H.264 uses 4×4 and 8×8 integer type transforms based on the Discrete Cosine Transform (DCT).
The capacity required to represent the data in output of the transformation process may still be too high for many applications. Moreover, it is not possible to modify the transformation process in order to achieve the desired level of capacity for the compressed signal. The quantisation process takes care of that, by allowing a further reduction of the capacity needed to represent the signal. It should be noted that this process is destructive, i.e. the reconstructed sequence will look different to the original
The entropy coding process takes all the non-zero quantised transform coefficients and processes them to be efficiently represented into a stream of bits. This requires reading, or scanning, the transform coefficients in a certain order to minimise the capacity required to represent the compressed video sequence.
The above description applies to a video encoder; a video decoder will perform all of the above processes in roughly reverse order. In particular, the transformation process on the decoder side will require the use of the inverse of the transform being used on the encoder. Similarly, entropy coding becomes entropy decoding and the quantisation process becomes inverse scaling. The prediction process is typically performed in the same exact fashion on both encoder and decoder.
The present invention relates to the transformation part of the coding, thus a more thorough review of the transform process is presented here.
The statistical properties of the residual affect the ability of the transform (i.e. DCT) to compress the energy of the input signal in a small number of coefficients. The residual shows very different statistical properties depending on the quality of the prediction and whether the prediction exploits spatial or temporal redundancy. Other factors affecting the quality of the prediction are the size of the blocks being used and the spatial/temporal characteristics of the sequence being processed.
It is well known that the DCT approaches maximum energy compaction performance for highly correlated Markov-I signals. DCT's energy compaction performance starts dropping as the signal correlation becomes weaker. For instance, it is possible to show how the Discrete Sine Transform (DST) can outperform the DCT for input signals with lower adjacent correlation characteristics.
The DCT and DST in image and video coding are normally used on blocks, i.e. 2D signals; this means that a one dimensional transform is first performed in one direction (e.g., horizontal) followed by a one dimensional transform performed in the other direction. As already mentioned the energy compaction ability of a transform is dependent on the statistics of the input signal. It is possible, and indeed it is also common under some circumstances, for the two-dimensional signal input to the transform to display different statistics along the two vertical and horizontal axes. In this case it would be desirable to choose the best performing transform on each axis. A similar approach has already been attempted within the new ISO and ITU video coding standard under development, High Efficiency Video Coding (HEVC). In particular, a combination of two one dimensional separable transforms such as a DCT-like [2] and DST [3] has been used in HEVC standard under development.
While previous coding standards based on DCT use a two-dimensional transform (2D DCT), newer solutions apply a combination of DCT and DST to intra predicted blocks, i.e. on blocks that are spatially predicted. It has been shown that DST is a better choice than DCT for transformation of rows, when the directional prediction is from a direction that is closer to horizontal then vertical, and, similarly, is a better choice for transformation of columns when the directional prediction is closer to vertical. In the remaining direction (e.g. on rows, when DST is applied on columns), DCT is used.
For implementation purposes, in video coding it is common to use integer approximations of DCT and DST, which will in rest of this text be simply referred to as DCT and DST. One of solutions for integer DCT-like transform uses 16-bit intermediate data representation and is known as partial butterfly. Its main properties are same (anti)symmetry properties as of DCT, almost orthogonal basis vectors, 16 bit data representation before and after each transform stage, 16 bit multipliers for all internal multiplications and no need for correction of different norms of basis vectors during (de)quantisation.