Audio compression techniques have been developed to transmit audio signals in constrained bandwidth channels and store such signals on media with limited capacity. In audio compression, no assumptions can be made about the source or characteristics of the sound. Algorithms must be general enough to deal with arbitrary types of audio signals, which in turn poses a substantial constraint on viable approaches. (In this document, the term "audio" refers to a signal that can be any sound in general, such as music of any type, speech, and a mixture of music and voice). General audio compression thus differs from speech coding in one significant aspect: in speech coding where the source is known a priori, model based algorithms are practical.
Many audio compression techniques rely upon a "psychoacoustic model" to achieve substantial compression. Psychoacoustics describes the relationship between acoustic events and the resulting perceived sounds. Thus, in a psychoacoustic model, the response of the human auditory system is taken into account in order to remove audio signal components that are imperceptible to human ears. Spectral "masking" is one of the most frequently exploited psychoacoustic phenomena. "Masking" describes the effect by which a fainter, but distinctly audible, signal becomes inaudible when a louder signal occurs simultaneously with, or within a very short time of, the lower amplitude signal. Masking depends on the spectral composition of both the masking signal and the masked signal, and on their variations with time. For example, FIG. 1 is plot of the spectrum for a typical signal (trumpet) 10 and of the human perceptual threshold 12. The perceptual threshold 12 varies with frequency and power. Note that a great deal of the signal 10 is below the perceptual threshold 12 and therefore redundant. Thus, this part of the audio signal may be discarded.
One well-known technique that utilizes a psychoacoustic model is embodied in the MPEG-Audio standard (ISO/IEC 11172-3; 1993(E)) (here, simply "MPEG"). FIG. 2 is a block diagram of a conventional MPEG audio encoder. A digitized audio signal (e.g., a 16-bit pulse code modulated--PCM--signal) is input into one or more filter banks 20 and into a psychoacoustic "model" 22. The filter banks 20 perform a time-to-frequency mapping, generating multiple subbands (e.g., 32). The filter banks 20 are "critically" sampled so that there are as many samples in the analyzed domain as there are in the time domain. The filter banks 20 provide the primary frequency separation for the encoder; a similar set of filter banks 20 serves as the reconstruction filters for the corresponding decoder. The output samples of the filter banks 20 are then quantized by a bit or noise allocation function 24.
The parallel psychoacoustic model 22 calculates a "just noticeable" noise level for each band of the filter banks 20, in the form of a "signal-to-mask" ratio. This noise level is used in the bit or noise allocation function 24 to determine the actual quantizer and quantizer levels. The quantized samples from the bit or noise allocation function 24 are then applied to a bitstream formatting function 26, which outputs the final encoded (compressed) bitstream. The output of the psychoacoustic model 22 may be used to adjust bit allocations in the bitstream formatting function 26, in known fashion.
Most approaches to audio compression can be broadly divided into two major categories: time and frequency domain quantization. An MPEG coder/decoder ("codec") is an example of an approach employing time domain scalar quantization. In particular, MPEG employs scalar quantization of the time domain signal in individual subbands (typically 32 subbands) while bit allocation in the scalar quantizer is based on a psychoacoustic model, which is implemented separately in the frequency domain (dual-path approach).
MPEG audio compression is limited to applications with higher bit-rates, 1.5 bits per sample and higher. At 1.5 bits per sample, MPEG audio does not preserve the full range of frequency content. Instead, frequency components at or near the Nyquist limit are thrown away in the compression process. In a sense, MPEG audio does not truly achieve compression at the rate of 1.5 bits per sample.
Quantization is one of the most common and direct techniques to achieve data compression. There are two basic quantization types: scalar and vector. Scalar quantization encodes data points individually, while vector quantization groups input data into vectors, each of which is encoded as a whole. Vector quantization typically searches a codebook (a collection of vectors) for the closest match to an input vector, yielding an output index. A de-quantizer simply performs a table lookup in an identical codebook to reconstruct the original vector. Other approaches that do not involve codebooks are known, such as closed form solutions.
It is well known that scalar quantization is not optimal with respect to rate/distortion tradeoffs. Scalar quantization cannot exploit correlations among adjacent data points and thus scalar quantization yields higher distortion levels than vector quantization for a given bit rate. Vector quantization schemes usually can achieve far better compression ratios at a given distortion level. Thus, time domain scalar quantization limits the degree of compression, resulting in higher bit-rates. Further, human ears are sensitive to the distortion associated with zeroing even a single time domain sample. This phenomenon makes direct application of traditional vector quantization techniques on a time domain audio signal an unattractive proposition, since vector quantization at the rate of 1 bit per sample or lower often leads to zeroing of some vector components (that is, time domain samples).
Frequency domain quantization based audio compression is an alternative to time domain quantization based audio compression. However, there is a significant difficulty that needs to be resolved in frequency domain quantization based audio compression. The input audio signal is continuous, with no practical limits on the total time duration. It is thus necessary to encode the audio signal in a piecewise manner. Each piece is called an audio encode or decode frame. Performing quantization in the frequency domain on a per frame basis generally leads to discontinuities at the frame boundaries. Such discontinuities result in objectionable audible artifacts (e.g., "clicks" and "pops"). One remedy to this discontinuity problem is to use overlapped frames, which results in proportionally lower compression ratios and higher computational complexity. A more popular approach is to use "critically filtered" subband filter banks, which employ a history buffer that maintains continuity at frame boundaries, but at a cost of latency in the codec-reconstructed audio signal. Another complex approach is to enforce boundary conditions as constraints in audio encode and decode processes.
The inventors have determined that it would be desirable to provide an audio compression technique suitable for real-time applications while having reduced computational complexity. The technique should provide low bit-rate compression (about 1-bit per sample) for music and speech, while being applicable to higher bit-rate audio compression. The present invention provides such a technique.