There is considerable interest among those in the fields of audio- and video-signal processing to minimize the amount of information required to represent a signal without perceptible loss in signal quality. By reducing information requirements, signals impose lower information capacity requirements upon communication channels and storage media.
Analog signals which have been subject to compression or dynamic range reduction, for example, impose lower information capacity requirements than such signals without compression. Digital signals encoded with fewer binary bits impose lower information capacity requirements than coded signals using a greater number of bits to represent the signal. Of course, there are limits to the amount of reduction which can be realized without degrading the perceived signal quality. The following discussion is directed more particularly to digital techniques, but it should be realized that corresponding considerations apply to analog techniques as well.
The number of bits available for representing each sample of a digital signal establishes the accuracy of the digital signal representation. Lower bit rates mean that fewer bits are available to represent each sample; therefore, lower bit rates imply greater quantizing inaccuracies or quantizing errors. In many applications, quantizing errors are manifested as quantizing noise, and if the errors are of sufficient magnitude, the quantizing noise will degrade the subjective quality of the coded signal.
Various "split-band" coding techniques attempt to reduce information requirements without any perceptible degradation by exploiting various psycho-perceptual effects. In audio applications, for example, the human auditory system displays frequency-analysis properties resembling those of highly asymmetrical tuned filters having variable center frequencies and bandwidths that vary as a function of the center frequency. The ability of the human auditory system to detect distinct tones generally increases as the difference in frequency between the tones increases; however, the resolving ability of the human auditory system remains substantially constant for frequency differences less than the bandwidth of the above mentioned filters. Thus, the frequency-resolving ability of the human auditory system varies according to the bandwidth of these filters throughout the audio spectrum. The effective bandwidth of such an auditory filter is referred to as a "critical band." A dominant signal within a critical band is more likely to mask the audibility of other signals anywhere within that critical band than it is likely to mask other signals at frequencies outside that critical band. See generally, the Audio Engineering Handbook, K. Blair Benson ed., McGraw-Hill, San Francisco, 1988, pages 1.40-1.42 and 4.8-4.10.
Audio split-band coding techniques which divide the useful signal bandwidth into frequency bands with bandwidths approximating the critical bands of the human auditory system can better exploit psychoacoustic effects than wider band techniques. Such split-band coding techniques, in concept, generally comprise dividing the signal bandwidth with a filter bank, reducing the information requirements of the signal passed by each filter band to such an extent that signal degradation is just inaudible, and reconstructing a replica of the original signal with an inverse process. Two such techniques are subband coding and transform coding. Audio subband and transform coders can reduce information requirements in particular frequency bands where the resulting artifacts are psychoacoustically masked by one or more spectral components and, therefore, do not degrade the subjective quality of the encoded signal.
Subband coders may use any of various techniques to implement a filter bank with analog or digital filters. In digital subband coders, an input signal comprising signal samples is passed through a bank of digital filters. Each subband signal passed by a respective filter in the filter bank is downsampled according to the bandwidth of that subband's filter The coder attempts to quantize each subband signal using just enough bits to render the quantizing noise imperceptible. Each subband signal comprises samples which represent a portion of the input signal spectrum.
Transform coders may use any of various so-called time-domain to frequency-domain transforms to implement a bank of digital filters. Individual coefficients obtained from the transform, or two or more adjacent coefficients grouped together, define "subbands" having effective bandwidths which are sums of individual transform coefficient bandwidths. The coefficients in a subband constitute a respective subband signal. The coder attempts to quantize the coefficients in each subband using just enough bits to render the quantizing noise imperceptible.
Throughout the following discussion, the term "split-band coder" shall refer to subband coders, transform coders, and other split-band coding techniques which operate upon portions of the useful signal bandwidth. The term "subband" shall refer to these portions of the useful signal bandwidth, whether implemented by a true subband coder, a transform coder, or other technique.
As discussed above, many digital split-band coders utilizing psycho-perceptual principles provide high-quality coding at low bit rates by applying a filter bank to an input signal to generate subband signals, generating quantized information by attempting to quantize the subband signals using a number of bits such that resulting quantizing noise is just imperceptible due to psycho-perceptual masking effects, and assembling the quantized information into a form suitable for transmission or storage.
A complementary digital split-band decoder recovers a replica of the original input signal by extracting quantized information from an encoded signal, dequantizing the quantized information to obtain subband signals, and applying an inverse or synthesis filter bank to the subband signals to generate the replica of the original input signal.
The number of bits allocated to quantize the subband signals must be available to the decoder to permit accurate dequantization. A "forward-adaptive" encoder uses an allocation function to establish allocation values and explicitly passes these allocation values as "side information" to a decoder. A "backward-adaptive" encoder establishes allocation values by applying an allocation function to selected information and passes the selected information in the encoded signal rather than explicitly passing the allocation values. A backward-adaptive decoder reestablishes the allocation values by applying an allocation function to the selected information which it extracts from the encoded signal.
In one embodiment of a backward-adaptive encoder/decoder system, an encoder prepares an estimate of the input signal spectral envelope, establishes allocation values by applying an allocation function to the envelope estimate, scales signal information using elements of the envelope estimate as scale factors, quantizes the scaled signal information according to the established allocation values, and assembles the quantized information and the envelope estimate into an encoded signal. A backward-adaptive decoder extracts the envelope estimate and quantized information from the encoded signal, establishes allocation values by applying an allocation function to the envelope estimate, dequantizes the quantized information, and reverses the scaling of the signal information. Scaling is used to increase the dynamic range of information which can be represented by the limited number of bits available for quantizing. Two examples of a backward-adaptive audio encoder/decoder system are disclosed in U.S. Pat. Nos. 4,790,016 and 5,109,417, which are incorporated herein by reference in their entirety.
Backward-adaptive techniques are attractive in many low bit-rate coding systems because no bits are required to pass explicit allocation values. The decoder recreates the allocation values by applying an allocation function to information extracted from the encoded signal. A backward-adaptive decoder must use an allocation function which is identical, or at least exactly equivalent, to that utilized by the encoder, otherwise accurate dequantization in the decoder is not guaranteed. As a result, the complexity or implementation cost of the decoder is similar to that of the encoder. Any restriction upon decoder complexity usually imposes restrictions upon the complexity of the allocation function in both the encoder and decoder, thereby limiting overall performance of the encoder/decoder system.
Generally speaking, it is desirable to use allocation functions based upon perceptual models which are as sophisticated as can be implemented practically. This is because complex allocation functions based upon sophisticated psycho-perceptual models are usually able to establish allocation values which achieve equivalent subjective coding quality at lower bit rates than the allocation values established by less complex allocation functions based upon simpler models. In addition to using better perceptual models, an allocation function can further improve coding performance by making proper allowance for spectral distortions introduced by the decoding process. These distortions generally arise from synthesis filter bank imperfections. Because of practical considerations for the decoder, however, many backward-adaptive coding systems cannot utilize allocation functions based upon such computationally intensive models.
Forward-adaptive techniques are attractive in many high-quality coding systems because overall system performance is not constrained by restrictions to allocation function complexity in the decoder; the decoder does not need to perform an allocation function to establish allocation values. A forward-adaptive decoder can be computationally less complex and need not impose any restrictions upon the allocation function performed by the encoder. In addition, improved allocation functions may be incorporated into the encoders of forward-adaptive coding systems while maintaining compatibility with existing decoders. The allocation function used in an encoder can be the result of an independent design choice.
The ability to improve the allocation function in an encoder is significant. As advances are made in the arts of signal coding and signal processing, increasingly sophisticated allocation functions become economically practical. By increasing the sophistication of allocation functions, bit rates may be decreased for a given signal quality, or signal quality may be increased for a given bit rate.
Despite these advantages, however, forward-adaptive coding systems are unsuitable for many low bit-rate applications because they require a significant number of bits to convey side information. Generally, even more bits are required to convey side information as allocation functions seek to improve coding performance by dividing the spectrum into narrower, and therefore more numerous, subbands. Furthermore, the number of bits required to carry this side information will represent a larger proportion of the coded signal as improved coding techniques decrease the number of bits required to carry the remainder of the coded signal.
There is, therefore, a desire to develop efficient allocation functions based upon more sophisticated perceptual models which are suitable for low-cost implementation of coding systems, and which properly allow for spectral distortions produced by the decoding process.
One fairly sophisticated psychoacoustic model based upon the mechanics of human hearing is described by Schroeder, Atal and Hall, "Optimizing Digital Speech Coders by Exploiting Masking Properties of the Human Ear," J. Acoust. Soc. Am., December 1979, pp. 1647-1652. The model comprises (1) performing a short-time spectral analysis of an input signal by applying a short-time Fourier transform, (2) obtaining the input signal critical-band densities by mapping the resulting spectral coefficients into critical bands x, and (3) generating a basilar-membrane "excitation pattern" by convolving the critical band densities with a basilar membrane "spreading function." This model is applied to the input signal and to a noise signal representing quantizing errors to generate a "signal excitation pattern" and a "noise excitation pattern," respectively. The loudness of the input signal and the noise signal are calculated by integrating functions of the respective excitation patterns. The loudness of the input signal and the noise signal whose excitation pattern falls below a masking threshold is zero; that is, it is inaudible. The masking function is obtained from the product of the signal excitation pattern and a "sensitivity function" which defines the threshold of masking. An objective measure of coding performance is a ratio obtained by dividing the loudness of the noise signal by the loudness of the input signal. The model is straightforward and provides reasonably good results for spectral energy below about 5 kHz, but it is computationally intensive and makes no allowance for decoder spectral distortions.