There is considerable interest among those in the fields of audio signal processing to minimize the amount of information required to represent an audio signal without perceptible loss in signal quality. By reducing the amount of information required, signal representations impose lower information capacity requirements upon communication paths and storage media.
There is particular interest in developing ways to convey in real time multiple channels of high-quality digital audio signals over relatively low-bandwidth communication paths such as conventional residential telephone lines. This type of communication path is commonly used to connect personal computers to public networks and, at present, is capable of no more than about 50 k-bits per sec. By conveying audio signal in real time, the audio information represented by the signals can be presented or played back without interruption as the signals are received.
Information capacity requirements can be reduced by applying either or both of two data compression techniques. One type, sometimes referred to as "lossy" compression, reduces information capacity requirements in a manner which does not assure, and generally prevents, perfect recovery of the original signal. Another type, sometimes referred to as "lossless" compression, reduces information capacity requirements in a manner that permits perfect recovery of the original signal.
Quantization is one well known lossy compression technique. Quantization can reduce information capacity requirements by reducing the number of bits used to represent each sample of a digital signal, thereby reducing the accuracy of the digital signal representation. In audio coding applications, the reduced accuracy or quantizing error is manifested as quantizing noise. If the errors are of sufficient magnitude, the quantizing noise will degrade the subjective quality of the coded signal.
Various audio coding techniques attempt to apply lossy compression techniques to an input signal without suffering any perceptible degradation by removing components of information which are imperceptible or irrelevant to perceived coding quality. A complementary decoding technique can recover a replica of the input signal which is perceptually indistinguishable from the input signal provided the removed component is truly irrelevant. For example, split-band encoding splits an input signal into several narrow-band signals and adaptively quantizes each narrow-band signal according to psychoacoustic principles.
Psychoacoustic principles are based on the frequency-analysis properties of the human auditory system that resemble highly asymmetrical tuned filters having variable center frequencies and bandwidths that vary as a function of the center frequency. The ability of the human auditory system to detect distinct tones generally increases as the difference in frequency between the tones increases; however, the resolving ability of the human auditory system remains substantially constant for frequency differences less than the bandwidth of the filtering behavior mentioned above. This bandwidth varies throughout the audio spectrum and is referred to as a "critical bandwidth." A dominant signal is more likely to mask the audibility of other signals anywhere within a critical bandwidth than it is likely to mask other signals at frequencies outside that critical bandwidth. A dominant signal may mask other signals which occur not only at the same time as the masking signal, but also which occur before and after the masking signal. The duration of pre- and postmasking effects depend upon the magnitude of the masking signal, but premasking effects are usually of much shorter duration than postmasking effects. The premasking interval can extend beyond 100 msec. but is generally regarded to be limited to less than 5 msec. The postmasking interval can extend beyond 500 msec. but is generally regarded to be limited to about 50 msec. A masked component of a signal is irrelevant and can be removed without changing the perceptual experience of a human listener.
Split-band audio encoding often comprises using a forward or "analysis" filter bank to divide an audio signal bandwidth into several subband signals each having a bandwidth commensurate with the critical bandwidths of the human auditory system. Each subband signal is quantized using just enough bits to ensure that the quantizing noise in each subband is masked by the spectral component in that subband and possibly adjacent subbands. Split-band audio decoding comprises reconstructing a replica of the original signal using an inverse or "synthesis" filter bank. If the bandwidths of the filters in the filter banks and the quantizing accuracy of the subband signals are chosen properly, the reconstructed replica can be perceptually indistinguishable from the original signal.
Two such coding techniques are subband coding and transform coding. Subband coding may use various analog and/or digital filtering techniques to implement the filter banks. Transform coding uses various time-domain to frequency-domain transforms to implement the filter banks. Adjacent frequency-domain transform coefficients may be grouped to define "subbands" having effective bandwidths which are sums of individual transform coefficient bandwidths.
Throughout the following discussion, the term "split-band coding" and the like refers to subband encoding and decoding, transform encoding and decoding, and other encoding and decoding techniques which operate upon portions of the useful signal bandwidth. The term "subband" refers to these portions of the useful signal bandwidth, whether implemented by a true subband coder, a transform coder, or other technique. The term "subband signal" refers to a split-band filtered signal representation within a respective subband.
Lossy compression may include scaling. Many coding techniques including split-band coding convey signals using a scaled representation to extend the dynamic range of encoded information represented by a limited number of bits. A scaled representation comprises one or more "scaling factors" associated with "scaled values" corresponding to elements of the encoded signals. Many forms of scaled representation are known. By sacrificing some accuracy in the scaled values, even fewer bits may be used to convey information using a "block-scaled representation." A block-scaled representation comprises a group or block of scaled values associated with a common scaling factor.
A lossless type of compression reduces information capacity requirements without degradation by reducing or eliminating components of the signal which are redundant. A complementary decompression technique can recover the original signal perfectly by providing the redundant component removed during compression. Examples of lossless compression techniques include run-length encoding, differential coding, linear predictive coding, and transform coding. Variations, combinations and adaptive forms of these compression techniques are also known.
Hybrid techniques combining lossless and lossy compression techniques are also known. For example, split-band coding using a transform-based filter bank combines lossless transform coding with lossy psychoacoustic perceptual coding.
Single-channel coding techniques such as those discussed above do not provide a sufficient reduction in information requirements to permit multiple channels of high-quality audio to be conveyed over low-bandwidth paths, e.g., conventional telephone lines, for real-time playback. Various high-performance coding systems require on the order of 64 k-bits per second or more to convey in real time audio signals having a bandwidth of 15 kHz. Because multiples of these bit rates are required to convey multiple audio channels, impossibly large improvements in the performance of single-channel coding systems are needed to allow multiple channels of audio to be conveyed in real time over limited-bandwidth communication paths such as conventional residential telephone lines. The needed additional reduction in information capacity requirements is addressed by multiple-channel coding techniques referred to herein as spatial coding techniques.
One form of spatial coding combines multiple signals according to an encoding matrix and recovers a replica of the original signals using a complementary decoding matrix. Many 4:2:4 matrixing techniques are known that combine four signals into two signals for transmission or storage and subsequently recover a replica of the four original signals from the two encoded signals. This coding technique suffers from high levels of crosstalk between signals. A number of adaptive matrixing techniques have been developed to reduce the level of crosstalk but neither the reduction in crosstalk nor the reduction in information capacity requirements is sufficient.
Another form of spatial coding splits multiple input signals into subband signals, generates a vector of steering information representing spectral levels of the channels in each subband, combines the subband signals for all channels in a given frequency subband to produce a summation or composite subband signal, perceptually encodes the composite subband signals, and assembles the encoded composite subband signals and the steering vectors into an encoded signal. A complementary decoder generates a subband signal in a respective frequency subband for each output signal by scaling the appropriate composite subband signal according to the steering vector for that subband, and generates an output signal by passing the scaled subband signals through an inverse filter bank. Two examples of such a coding system are disclosed in Davis, et al., U.S. Pat. No. 5,583,962, and in "Coding of Moving Pictures and Associated Audio for Digital Storage Media At Up To About 1.5 Mbit/s," International Organization for Standardization, CD 11172-3, Part 3 (Audio), Annex 3-G (Joint Stereo Coding), pp. G-1 to G-4.
Unfortunately, these spatial coding techniques, even when combined with perceptual coding, do not permit multiple channels of high-quality audio to be conveyed over low-bandwidth paths at a bit rate low enough for real-time playback. When the bit rate is reduced sufficiently, these techniques reproduce replicas of the original input signals with undesirable artifacts such as chirps, clicks and sounds that resemble a zipper being opened or closed ("zipper noise").