Overview of Block Transform-Based Coding
Transform coding is a compression technique used in many audio, image and video compression systems. Uncompressed digital image and video is typically represented or captured as samples of picture elements or colors at locations in an image or video frame arranged in a two-dimensional (2D) grid. This is referred to as a spatial-domain representation of the image or video. For example, a typical format for images consists of a stream of 24-bit color picture element samples arranged as a grid. Each sample is a number representing color components at a pixel location in the grid within a color space, such as RGB, or YIQ, among others. Various image and video systems may use various different color, spatial and time resolutions of sampling. Similarly, digital audio is typically represented as time-sampled audio signal stream. For example, a typical audio format consists of a stream of 16-bit amplitude samples of an audio signal taken at regular time intervals.
Uncompressed digital audio, image and video signals can consume considerable storage and transmission capacity. Transform coding reduces the size of digital audio, images and video by transforming the spatial-domain representation of the signal into a frequency-domain (or other like transform domain) representation, and then reducing resolution of certain generally less perceptible frequency components of the transform-domain representation. This generally produces much less perceptible degradation of the digital signal compared to reducing color or spatial resolution of images or video in the spatial domain, or of audio in the time domain.
More specifically, a typical block transform-based codec 100 shown in FIG. 1 divides the uncompressed digital image's pixels into fixed-size two dimensional blocks (X1, . . . Xn), each block possibly overlapping with other blocks. A linear transform 120-121 that does spatial-frequency analysis is applied to each block, which converts the spaced samples within the block to a set of frequency (or transform) coefficients generally representing the strength of the digital signal in corresponding frequency bands over the block interval. For compression, the transform coefficients may be selectively quantized 130 (i.e., reduced in resolution, such as by dropping least significant bits of the coefficient values or otherwise mapping values in a higher resolution number set to a lower resolution), and also entropy or variable-length coded 130 into a compressed data stream. At decoding, the transform coefficients will inversely transform 170-171 to nearly reconstruct the original color/spatial sampled image/video signal (reconstructed blocks {circumflex over (X)}1, . . . {circumflex over (X)}n).
The block transform 120-121 can be defined as a mathematical operation on a vector x of size N. Most often, the operation is a linear multiplication, producing the transform domain output y=M×,M being the transform matrix. When the input data is arbitrarily long, it is segmented into N sized vectors and a block transform is applied to each segment. For the purpose of data compression, reversible block transforms are chosen. In other words, the matrix M is invertible. In multiple dimensions (e.g., for image and video), block transforms are typically implemented as separable operations.
The matrix multiplication is applied separably along each dimension of the data (i.e., both rows and columns).
For compression, the transform coefficients (components of vector y) may be selectively quantized (i.e., reduced in resolution, such as by dropping least significant bits of the coefficient values or otherwise mapping values in a higher resolution number set to a lower resolution), and also entropy or variable-length coded into a compressed data stream.
At decoding in the decoder 150, the inverse of these operations (dequantization/entropy decoding 160 and inverse block transform 170-171) are applied on the decoder 150 side, as show in FIG. 1. While reconstructing the data, the inverse matrix M−1 (inverse transform 170-171) is applied as a multiplier to the transform domain data. When applied to the transform domain data, the inverse transform nearly reconstructs the original time-domain or spatial-domain digital media.
The transform used may be simple DPCM type predictor/correctors, or they may be more complicated structures such as wavelets or DCTs (Discrete Cosine Transforms). The commonly used standards JPEG/MPEG2/MPEG4, JPEG2000, and Windows Media Video (WMV) use the DCT, wavelet, and integerized-DCT respectively. In addition, WMV uses a lapped smoothing operator that provides visual and rate-distortion benefit for intra blocks and intra frames. The lapped smoothing operator, in conjunction with the block transform, tries to mimic a lapped transform of the type described in H. S. Malvar, Signal Processing with Lapped Transforms, Artech House, Boston, Mass., 1992.
Coefficient Scan Patterns
Many block transform-based codecs including JPEG, MPEG2, MPEG4 and WMV use a run length coding technique to encode the quantized coefficients corresponding to a particular block. (See, e.g., W. B. Pennebaker and J. L. Mitchell, JPEG: Still Image Compression Standard, Van Nostrand Reinhold, New York, 1993.) Run length coding proceeds by scanning a block of quantized transform coefficients according to a pre-determined pattern. One such example is the continuous “zigzag” scan pattern shown in FIG. 3. There is no inherent requirement for a scan pattern to be continuous, although a similar continuous zigzag scan pattern is used widely in JPEG and MPEG2/4.
The run length coding technique exploits the statistics of the underlying transform. Typically, larger coefficients occur towards the “DC” value (which is conventionally represented at the top left corner), and the more infrequent and smaller coefficients happen at larger distances from DC. For example, it is common for most of the transform coefficients of a transform block to have a value of zero after quantization. Many scan patterns give higher priority to coefficients that are more likely to have non-zero values. In other words, such coefficients are scanned earlier in the scan pattern. In this way, the non-zero coefficients are more likely to be bunched together, followed by one or more long groups of zero value coefficients. In particular, this leads to more efficient run/level/last coding, but other forms of entropy coding also benefit from the reordering.
A video compression system that selects between a limited set of pre-determined or static scan patterns, such as depending upon the block dimensions and whether the image is in interlaced or progressive format, is described in Lin et al., “Scan Patterns For Interlaced Video Content,” U.S. patent application Ser. No. 10/989,844, filed Nov. 15, 2004; and Liang et al., “Scan Patterns For Progressive Video Content,” U.S. patent application Ser. No. 10/989,594, also filed Nov. 15, 2004.