Block Transform-Based Coding
Transform coding is a compression technique used in many audio, image and video compression systems. Uncompressed digital image and video is typically represented or captured as samples of picture elements or colors at locations in an image or video frame arranged in a two-dimensional (2D) grid. This is referred to as a spatial-domain representation of the image or video. For example, a typical format for images consists of a stream of 24-bit color picture element samples arranged as a grid. Each sample is a number representing color components at a pixel location in the grid within a color space, such as RGB, or YIQ, among others. Various image and video systems may use various different color, spatial and time resolutions of sampling. Similarly, digital audio is typically represented as time-sampled audio signal stream. For example, a typical audio format consists of a stream of 16-bit amplitude samples of an audio signal taken at regular time intervals.
Uncompressed digital audio, image and video signals can consume considerable storage and transmission capacity. Transform coding reduces the size of digital audio, images and video by transforming the spatial-domain representation of the signal into a frequency-domain (or other like transform domain) representation, and then reducing resolution of certain generally less perceptible frequency components of the transform-domain representation. This generally produces much less perceptible degradation of the digital signal compared to reducing color or spatial resolution of images or video in the spatial domain, or of audio in the time domain.
More specifically, a typical block transform-based codec 100 shown in FIG. 1 divides the uncompressed digital image's pixels into fixed-size two dimensional blocks (X1, . . . Xn), each block possibly overlapping with other blocks. A linear transform 120-121 that does spatial-frequency analysis is applied to each block, which converts the spaced samples within the block to a set of frequency (or transform) coefficients generally representing the strength of the digital signal in corresponding frequency bands over the block interval. For compression, the transform coefficients may be selectively quantized 130 (i.e., reduced in resolution, such as by dropping least significant bits of the coefficient values or otherwise mapping values in a higher resolution number set to a lower resolution), and also entropy or variable-length coded 130 into a compressed data stream. At decoding, the transform coefficients will inversely transform 170-171 to nearly reconstruct the original color/spatial sampled image/video signal (reconstructed blocks {circumflex over (X)}1, . . . {circumflex over (X)}n).
The block transform 120-121 can be defined as a mathematical operation on a vector x of size N. Most often, the operation is a linear multiplication, producing the transform domain output y=Mx, M being the transform matrix. When the input data is arbitrarily long, it is segmented into N sized vectors and a block transform is applied to each segment. For the purpose of data compression, reversible block transforms are chosen. In other words, the matrix M is invertible. In multiple dimensions (e.g., for image and video), block transforms are typically implemented as separable operations. The matrix multiplication is applied separably along each dimension of the data (i.e., both rows and columns).
For compression, the transform coefficients (components of vector y) may be selectively quantized (i.e., reduced in resolution, such as by dropping least significant bits of the coefficient values or otherwise mapping values in a higher resolution number set to a lower resolution), and also entropy or variable-length coded into a compressed data stream.
At decoding in the decoder 150, the inverse of these operations (dequantization/entropy decoding 160 and inverse block transform 170-171) are applied on the decoder 150 side, as show in FIG. 1. While reconstructing the data, the inverse matrix M−1 (inverse transform 170-171) is applied as a multiplier to the transform domain data. When applied to the transform domain data, the inverse transform nearly reconstructs the original time-domain or spatial-domain digital media.
In many block transform-based coding applications, the transform is desirably reversible to support both lossy and lossless compression depending on the quantization factor. With no quantization (generally represented as a quantization factor of 1) for example, a codec utilizing a reversible transform can exactly reproduce the input data at decoding. However, the requirement of reversibility in these applications constrains the choice of transforms upon which the codec can be designed.
Many image and video compression systems, such as MPEG and Windows Media, among others, utilize transforms based on the Discrete Cosine Transform (DCT). The DCT is known to have favorable energy compaction properties that result in near-optimal data compression. In these compression systems, the inverse DCT (IDCT) is employed in the reconstruction loops in both the encoder and the decoder of the compression system for reconstructing individual image blocks. The DCT is described by N. Ahmed, T. Nataraj an, and K. R. Rao, “Discrete Cosine Transform,” IEEE Transactions on Computers, C-23 (January 1974), pp. 90-93. An exemplary implementation of the IDCT is described in “IEEE Standard Specification for the Implementations of 8×8 Inverse Discrete Cosine Transform,” IEEE Std. 1180-1990, Dec. 6, 1990.
Conventional data transforms used to implement reversible 2D data compressor have generally suffered one or more of the following primary disadvantages,                1. Unequal norms between transform coefficients, requiring complicated entropy coding schemes;        2. Poor approximations to optimal transforms, such as the DCT; and        3. High computational complexity.        
Conventional Implementation of 2D Transform
A separable 2D transform is typically implemented by performing 1D transforms on the rows of the data, followed by 1D transform on its columns of data (or vice versa). See, A. K. Jain, Fundamentals of Digital Image Processing, Prentice Hall, 1989. In matrix notation, let T represent the transform matrix and X be the 2D data. The separable 2D transform with T is defined by Y in the following equation.Y=T×T′  (1)
Indeed, the row-wise and column-wise transforms may be distinct. For instance, the data matrix could be non-square (say of size 4×8), or the row-wise and column-wise transforms could be the DCT and discrete sine transform (DST) respectively. In this case, the pre and post multipliers are different (say T1 and T2) and the transform Y is given byY=T1×T′2  (2)
For example, FIG. 2 shows a 2D 4×4 DCT implemented in two stages. In the first stage, the columns of the data matrix are transformed using a 4 point 1D DCT. In the second stage, 4 point 1D DCTs are applied along the rows. With infinite arithmetic accuracy, this ordering may be switched with no change in the output.
The 4 point 1D DCT can be implemented as a sequence of multiplication and addition operations on the 4 input data values, as represented in the signal flow graph shown in FIG. 3. The values c and s in this diagram are respectively cosine and sine of π/8. The separable transform approach works well for a lossy codec. Lossless codecs are more challenging to realize. Even with unit quantization, the separable 2D DCT described above in conjunction with its separable inverse DCT or IDCT is not guaranteed to produce a bit exact match to the original input. This is because the divisors in FIG. 3 give rise to rounding errors that may not cancel out between the encoder and decoder.
Lifting
In order to achieve lossless compression with a block transform-based codec, it is necessary to replace the above-described 4×4 2D DCT with a lossless transform. A separable transform may be used only if each 1D transform is lossless or reversible. Although multiple choices exist for reversible 1D transforms, those based on “lifting” are by far the most desirable. Lifting is a process of performing a matrix-vector multiplication using successive “shears.” A shear is defined as a multiplication of the operand vector with a matrix which is an identity matrix plus one non-zero off-diagonal element. Sign inversion of one or more vector coefficients may occur anywhere during this process, without loss of generality.
Lifting has been implemented through ladder or lattice filter structures in the past. Lifting or successive shear based techniques have been used in graphics. See, A. Tanaka, M. Kameyama, S. Kazama, and O. Watanabe, “A rotation method for raster image using skew transformation,” Proc IEEE Conf on Computer Vision and Pattern Recognition, pages 272-277, June 1986; and A. W. Paeth, “A fast algorithm for general raster rotation,” Proceedings of Graphics Interface '86, pages 77-81, May 1986. In fact, it can be argued that Gauss-Jordan elimination is a manifestation of lifting.
One simple 2 point operation is the Hadamard transform, given by the transform matrix
      T    H    =            1              2              ⁢                  (                                            1                                      1                                                          1                                                      -                1                                                    )            .      Two approaches are commonly employed for implementing a lifting-based (reversible) 1D Hadamard transform. The first is to implement the normalized or scale-free Hadamard transform in lifting steps, as shown in FIG. 4. The second approach is to allow the scales to differ between the two transform coefficients, as shown in FIG. 5.
Problems with Lifting
Lifting is not without its problems. In the first Hadamard transform approach shown in FIG. 4, the two transform coefficients are normalized. This is desirable for realizing multi-stage transforms, such as the 4 or 8 point DCT. However, this implementation suffers from two major drawbacks—first, each 2 point Hadamard transform requires three non-trivial (i.e., computationally expensive) lifting steps, and second, rounding errors in the lifting steps cause low pass energy to “leak” into the high frequency term leading to reduced compression efficiency. In this first approach, using the approximation
      tan    ⁡          (              π        8            )        ≈            3      8        ⁢                  ⁢    and    ⁢                  ⁢          cos      (              π        4            )        ≈      3    4  results in the AC basis function [0.75-0.7188]. While the discrepancy from the required [0.7071 0.7071] does not seem overly large, a DC signal of amplitude 64 produces an AC response of 2 units, which leaks into the expensive-to-encode high frequency band.
The second approach (FIG. 5) uses trivial lifting steps. However, the low pass term is scaled up by a factor of √{square root over (2)}, whereas the high pass term is scaled down by 1/√{square root over (2)} (or vice versa). The resolution of the two coefficients differs by one bit. In two dimensions, the high-high term is less in resolution by 2 bits compared to the low-low term. Cascaded transform stages only increase this discrepancy. Entropy coding is more difficult to implement due to the differing ranges of the coefficients.
In summary, the problems with lifting based lossless transforms are:                1. Possible unequal scaling between transform coefficients, making for more complex entropy coding mechanisms.        2. Poor approximations to desired transform basis functions that may cause undesirable effects such as the leakage of DC into AC bands.        3. Potentially high computational complexity, especially so if the lifting based implementation is designed to well approximate the desired transform.        