1. Field of the Invention
This invention relates to high quality encoding and decoding of multi-channel audio signals and more specifically to a subband encoder that employs perfect/non-perfect reconstruction filters, predictive/non-predictive subband encoding, transient analysis, and psycho-acoustic/minimum mean-square-error (mmse) bit allocation over time, frequency and the multiple audio channels to generate a data stream with a constrained decoding computational load.
2. Description of the Related Art
Pulse code modulation (PCM) based speech coders were first developed in the 1960's. In the early 1970's, low bit-rate speech coders were developed for use with the digital telephone networks, which had a restricted bandwidth of approximately 3.5 kHz. In 1979 Johnston outlined a 7.5 kHz sub-band differential PCM (DPCM) that was suitable for speech and music signals. In the early 1980's this work was developed using more sophisticated adaptive DPCM techniques (ADPCM), but it was not until 1988 that a true wideband high quality ADPCM coder was discussed.
In the mid-late 1980's new methods for coding very high quality audio signals were developed based on high resolution filter-banks and/or transform coders, in which the quantizer bit-allocations were determined by a psychoacoustic masking model. In general, the psychoacoustic masking model tries to establish a quantization noise audibility threshold at all frequencies. The threshold is used to allocate quantization bits to reduce the likelihood that the quantization noise will become audible. The quantization noise threshold is calculated in the frequency domain from the absolute energy of the frequency-transformed audio signal. The dominant frequency components of the audio signal tend to mask the audibility of other components which are close in the bark scale (human auditory frequency scale) to the dominant signal.
Thus, the known high quality audio and music coders can be divided into two broad classes of schemes.
1) Medium to high frequency resolution subband/transform coders which adaptively quantize the subband or coefficient samples within the analysis window according to a psychoacoustic mask calculation.
These coders exploit the large short-term spectral variances of general music signals by allowing the bit-allocations to adapt according to the spectral energy of the signal. The high resolution of these coders allows the frequency transformed signal to be applied directly to the psychoacoustic model, which is based on a critical band theory of hearing. Dolby's AC-3 audio coder, Todd et al., "AC-3: Flexible Perceptual Coding for Audio Transmission and Storage" Convention of the Audio Engineering Society, February, 1994, typically computes 1024-ffts on the respective PCM signals and applies a psychoacoustic model to the 1024 frequency coefficients in each channel to determine the bit rate for each coefficient. The Dolby system uses a transient analysis that reduces the window size to 256 samples to isolate the transients. The AC-3 coder uses a proprietary backward adaptation algorithm to decode the bit allocation. This reduces the amount of bit allocation information that is sent along side the encoded audio data. As a result, the bandwidth available to audio is increased over forward adaptive schemes which leads to an improvement in sound quality.
2) Low resolution subband coders which make-up for their poor frequency resolution by processing the subband samples using ADPCM. The quantization of the differential subband signals is either fixed or adapts to minimize the quantization noise power across all or some of the subbands, without any explicit reference to psychoacoustic masking theory. It is commonly accepted that a direct psychoacoustic distortion threshold cannot be applied to predictive/differential subband signals because of the difficulty in estimating the predictor performance ahead of the bit allocation process. The problems is further compounded by the interaction of quantization noise on the prediction process.
These coders work because perceptually critical audio signals are generally periodic over long periods of time. This periodicity is exploited by predictive differential quantization. Splitting the signal into a small number of sub-bands reduces the audible effects of noise modulation and allows the exploitation of long-term spectral variances in audio signals. If the number of subbands is increased, the prediction gain within each sub-band is reduced and at some point the prediction gain will tend to zero.
Digital Theater Systems, L.P. (DTS) makes use of an audio coder in which each PCM audio channel is filtered into four subbands and each subband is encoded using a backward ADPCM encoder that adapts the predictor coefficients to the sub-band data. The bit allocation is fixed and the same for each channel, with the lower frequency subbands being assigned more bits than the higher frequency subbands. The bit allocation provides a fixed compression ratio, for example, 4:1. The DTS coder is described by Mike Smyth and Stephen Smyth, "APT-X100: A LOW-DELAY, LOW BIT-RATE, SUB-BAND ADPCM AUDIO CODER FOR BROADCASTING," Proceedings of the 10th International AES Conference 1991, pp. 41-56.
Both types of audio coders have other common limitations. First, known audio coders encode/decode with a fixed frame size, i.e. the number of samples or period of time represented by a frame is fixed. As a result, as the encoded transmission rate increases relative to the sampling rate, the amount of data (bytes) in the frame also increases. Thus, the decoder buffer size must be designed to accommodate the worst case scenario to avoid data overflow. This increases the amount of RAM, which is a primary cost component of the decoder. Secondly, the known audio coders are not easily expandable to sampling frequencies greater than 48 kHz. To do so would make the existing decoders incompatible with the format required for the new encoders. This lack of future compatibility is a serious limitation. Furthermore, the known formats used to encode the PCM data require that the entire frame be read in by the decoder before playback can be initiated. This requires that the buffer size be limited to approximately 100 ms blocks of data such that the delay or latency does not annoy the listener.
In addition, although these coders have encoding capability up to 24 kHz, often times the higher subbands are dropped. This reduces the high frequency fidelity or ambiance of the reconstructed signal. Known encoders typically employ one of two types of error detection schemes. The most common is Read Solomon coding, in which the encoder adds error detection bits to the side information in the data stream. This facilitates the detection and correction of any errors in the side information. However, errors in the audio data go undetected. Another approach is to check the frame and audio headers for invalid code states. For example, a particular 3-bit parameter may have only 3 valid states. If one of the other 5 states is identified then an error must have occurred. This only provides detection capability and does not detect errors in the audio data.