Transform coding is a compression technique used in many audio, image and video compression systems. Uncompressed digital images and video are typically represented or captured as samples of picture elements or colors at locations in an image or video frame arranged in a two-dimensional (2D) grid. This is referred to as a spatial-domain representation of the image or video. For example, a typical format for a rectangular-shaped image consists of three two-dimensional arrays of 8-bit color samples. Each sample is a number representing the value of a color component at a spatial location in a grid, where each color component represents an amplitude along an axis within a color space, such as RGB, or YUV, among others. An individual sample in one of these arrays may be referred to as a pixel. (In other common usage, the term pixel is also often used to refer to an n-tuple of n color component samples that are spatially co-located—for example, to refer to a 3-tuple grouping of the R, G, and B color component values for a given spatial location—however, the term is used here to refer to a scalar-valued sample). Various image and video systems may use various different color, spatial and time resolutions of sampling. Similarly, digital audio is typically represented as time-sampled audio signal stream. For example, a typical audio format consists of a stream of 16-bit amplitude samples of an audio signal representing audio signal amplitudes at regularly-spaced time instants.
Uncompressed digital audio, image and video signals can consume considerable storage and transmission capacity. Transform coding can be used to reduce the quantity of data needed for representing such digital audio, images and video by transforming the spatial-domain (or time-domain) representation of the signal into a frequency-domain (or other like transform domain) representation, to enable a reduction in the quantity of data needed to represent the signal. The reduction in the quantity of data is typically accomplished by the application of a process known as quantization or by the selective discarding of certain frequency components of the transform-domain representation (or a combination of the two), followed by application of entropy encoding techniques such as adaptive Huffman encoding or adaptive arithmetic encoding. The quantization process may be applied selectively, based on the estimated degree of perceptual sensitivity of the individual frequency components or based on other criteria. Appropriate application of transform coding generally produces much less perceptible degradation of the digital signal as compared to reducing the color sample fidelity or spatial resolution of images or video directly in the spatial domain, or of audio in the time domain.
More specifically, a typical block transform-based coding technology divides the uncompressed pixels of the digital image into fixed-size two dimensional blocks (X1, . . . Xn). A linear transform that performs spatial-frequency analysis is applied to the blocks, which converts the spatial-domain samples within the block to a set of frequency (or transform) coefficients generally representing the strength of the digital signal in corresponding frequency bands over the block interval. For compression, the transform coefficients may be quantized (i.e., reduced in precision, such as by dropping least significant bits of the coefficient values or otherwise mapping values in a higher precision number set to a lower precision), and also entropy or variable-length coded into a compressed data stream. At decoding, the transform coefficients will be inverse-quantized and inversely transformed back into the spatial domain to nearly reconstruct the original color/spatial sampled image/video signal (reconstructed blocks {circumflex over (X)}1, . . . {circumflex over (X)}n)
The ability to exploit the correlation of samples in a block and thus maximize compression capability is a major requirement in transform design. In many block transform-based coding applications, the transform should be reversible to support both lossy and lossless compression, depending on the quantization operation applied in the transformed domain. With no quantization applied, for example, an encoding technology utilizing a reversible transform can enable the exact reproduction of the input data upon application of the corresponding decoding process. However, the requirement of reversibility in these applications constrains the choice of transforms upon which the coding technology can be designed. The implementation complexity of a transform is another important design constraint. Thus, transform designs are often chosen so that the application of the forward and inverse transforms involves only multiplications by small integers and other simple mathematical operations such as additions, subtractions, and shift operations, so that fast integer implementations with minimal dynamic range expansion can be obtained.
Many image and video compression systems, such as the JPEG (ITU-T T.81|ISO/IEC 10918-1) and MPEG-2 (ITU-T H.262|ISO/IEC 13818-2), among others, utilize transforms based on the Discrete Cosine Transform (DCT). The DCT is known to have favorable energy compaction properties. The DCT is described by N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete Cosine Transform,” IEEE Transactions on Computers, C-23 (January 1974), pp. 90-93.
When compressing a still image (or an intra coded frame in a video sequence), most common standards such as JPEG and MPEG-2 partition the arrays representing the image into 8×8 areas and apply a block transform to each such image area. The transform coefficients in a given partition (commonly known as a block) in these designs are influenced only by the sample values within the block region. In image and video coding, quantization of these independently-constructed blocks can result in discontinuities at block boundaries, and thus produce visually annoying artifacts known as blocking artifacts or blocking effects. Similarly for audio data, when non-overlapping blocks are independently transform coded, quantization errors will produce discontinuities in the signal at the block boundaries upon reconstruction of the audio signal at the decoder. For audio, a periodic clicking effect may be heard.
Techniques that are used to mitigate the blocking artifacts include using deblocking filters to smooth the signal values across inter-block edge boundaries, and using spatial extrapolation to encode differences between the raw input data and a prediction from neighboring block edges. These techniques are not without their flaws. For instance, the deblocking filter approach is “open loop,” i.e., the forward transform process does not ordinarily take into account the fact that deblocking is going to be performed after the inverse transform by the decoder. Also, both these techniques require significant computational implementation resources.
Another approach to reduce blocking effects is by using a lapped transform as described in H. Malvar, “Signal Processing with Lapped Transforms,” Artech House, Norwood Mass., 1992. A lapped transform is a transform having an input region that spans, besides the data samples in the current block, some adjacent samples in neighboring blocks. Likewise, on the reconstruction side, the inverse lapped transform influences some decoded data samples in neighboring blocks as well as data samples of the current block. Thus, the inverse transform can preserve continuity across block boundaries even in the presence of quantization, consequently leading to a reduction of blocking effects. Another advantage of a lapped transform is that it can exploit cross-block correlation, which yields greater compression capability.
For the case of 2-dimensional (2D) data, the lapped 2D transform is a function of the current block, together with select elements of blocks to the left, above, right, below and possibly of the above-left, above-right, below-left and below-right blocks. The number of data samples in neighboring blocks that are used to compute the current transform is referred to as the amount of overlap.
For picture (image) compression, one of the best-performing transforms in terms of rate-distortion performance is the lapped biorthogonal transform (LBT). See, H. S. Malvar, “Biorthogonal And Nonuniform Lapped Transforms For Transform Coding With Reduced Blocking And Ringing Artifacts,” IEEE Trans. on Signal Processing, vol. 46, pp. 1043-1053, April 1998.