1. Field of the Invention
This invention relates to high quality encoding and decoding of multi-channel audio signals and more specifically to a subband encoder that employs perfect/non-perfect reconstruction filters, predictive/non-predictive subband encoding, transient analysis, and psycho-acousti c/minimum mean-square-error (mmse) bit allocation over time, frequency and the multiple audio channels to generate a data stream with a constrained decoding computational load.
2. Description of the Related Art
Pulse code modulation (PCM) based speech coders were first developed in the 1960""s. In the early 1970""s, low bit-rate speech coders were developed for use with the digital telephone networks, which had a restricted bandwidth of approximately 3.5 kHz. In 1979 Johnston outlined a 7.5 kHz sub-band differential PCM (DPCM) that was suitable for speech and music signals. In the early 1980""s this work was developed using more sophisticated adaptive DPCM techniques (ADPCM), but it was not until 1988 that a true wideband high quality ADPCM coder was discussed.
In the mid-late 1980""s new methods for coding very high quality audio signals were developed based on high resolution filter-banks and/or transform coders, in which the quantizer bit-allocations were determined by a psychoacoustic masking model. In general, the psychoacoustic masking model tries to establish a quantization noise audibility threshold at all frequencies. The threshold is used to allocate quantization bits to reduce the likelihood that the quantization noise will become audible. The quantization noise threshold is calculated in the frequency domain from the absolute energy of the frequency-transformed audio signal. The dominant frequency components of the audio signal tend to mask the audibility of other components which are close in the bark scale (human auditory frequency scale) to the dominant signal.
Thus, the known high quality audio and music coders can be divided into two broad classes of schemes.
1) Medium to high frequency resolution subband/transform coders which adaptively quantize the subband or coefficient samples within the analysis window according to a psychoacoustic mask calculation.
These coders exploit the large short-term spectral variances of general music signals by allowing the bit-allocations to adapt according to the spectral energy of the signal. The high resolution of these coders allows the frequency transformed signal to be applied directly to the psychoacoustic model, which is based on a critical band theory of hearing. Dolby""s AC-3 audio coder, Todd et al., xe2x80x9cAC-3: Flexible Perceptual Coding for Audio Transmission and Storagexe2x80x9d Convention of the Audio Engineering Society, February, 1994, typically computes 1024-ffts on the respective PCM signals and applies a psychoacoustic model to the 1024 frequency coefficients in each channel to determine the bit rate for each coefficient. The Dolby system uses a transient analysis that reduces the window size to 256 samples to isolate the transients. The AC-3 coder uses a proprietary backward adaptation algorithm to decode the bit allocation. This reduces the amount of bit allocation information that is sent along side the encoded audio data. As a result, the bandwidth available to audio is increased over forward adaptive schemes which leads to an improvement in sound quality.
2) Low resolution subband coders which make-up for their poor frequency resolution by processing the subband samples using ADPCM. The quantization of the differential subband signals is either fixed or adapts to minimize the quantization noise power across all or some of the subbands, without any explicit reference to psychoacoustic masking theory. It is commonly accepted that a direct psychoacoustic distortion threshold cannot be applied to predictiv e/differential subband signals because of the difficulty in estimating the predictor performance ahead of the bit allocation process. The problems is further compounded by the interaction of quantization noise on the prediction process.
These coders work because perceptually critical audio signals are generally periodic over long periods of time. This periodicity is exploited by predictive differential quantization. Splitting the signal into a small number of sub-bands reduces the audible effects of noise modulation and allows the exploitation of long-term spectral variances in audio signals. If the number of subbands is increased, the prediction gain within each sub-band is reduced and at some point the prediction gain will tend to zero.
Digital Theater Systems, L. P. (DTS) makes use of an audio coder in which each PCM audio channel is filtered into four subbands and each subband is encoded using a backward ADPCM encoder that adapts the predictor coefficients to the sub-band data. The bit allocation is fixed and the same for each channel, with the lower frequency subbands being assigned more bits than the higher frequency subbands. The bit allocation provides a fixed compression ratio, for example, 4:1. The DTS coder is described by Mike Smyth and Stephen Smyth, xe2x80x9cAPT-X100: A LOW-DELAY, LOW BIT-RATE, SUB-BAN D ADPCM AUDIO CODER FOR BROADCASTING,xe2x80x9d Proceedings of the 10th International AES Conference 1991, pp. 41-56.
Both types of audio coders have other common limitations. First, known audio coders encode/decode with a fixed frame size, i.e. the number of samples or period of time represented by a frame is fixed. As a result, as the encoded transmission rate increases relative to the sampling rate, the amount of data (bytes) in the frame also increases. Thus, the decoder buffer size must be designed to accommodate the worst case scenario to avoid data overflow. This increases the amount of RAM, which is a primary cost component of the decoder. Secondly, the known audio coders are not easily expandable to sampling frequencies greater than 48 kHz. To do so would make the existing decoders incompatible with the format required for the new encoders. This lack of future compatibility is a serious limitation. Furthermore, the known formats used to encode the PCM data require that the entire frame be read in by the decoder before playback can be initiated. This requires that the buffer size be limited to approximately 100 ms blocks of data such that the delay or latency does not annoy the listener.
In addition, although these coders have encoding capability up to 24 kHz, often times the higher subbands are dropped. This reduces the high frequency fidelity or ambiance of the reconstructed signal. Known encoders typically employ one of two types of error detection schemes. The most common is Read Solomon coding, in which the encoder adds error detection bits to the side information in the data stream. This facilitates the detection and correction of any errors in the side information. However, errors in the audio data go undetected. Another approach is to check the frame and audio headers for invalid code states. For example, a particular 3-bit parameter may have only 3 valid states. If one of the other 5 states is identified then an error must have occurred. This only provides detection capability and does not detect errors in the audio data.
In view of the above problems, the present invention provides a multi-channel audio coder with the flexibility to accommodate a wide range of compression levels with better than CD quality at high bit rates and improved perceptual quality at low bit rates, with reduced playback latency, simplified error detection, improved pre-echo distortion, and future expandability to higher sampling rates.
This is accomplished with a subband coder that windows each audio channel into a sequence of audio frames, filters the frames into baseband and high frequency ranges, and decomposes each baseband signal into a plurality of subbands. The subband coder normally selects a non-perfect filter to decompose the baseband signal when the bit rate is low, but selects a perfect filter when the bit rate is sufficiently high. A high frequency coding stage encodes the high frequency signal independently of the baseband signal. A baseband coding stage includes a VQ and an ADPCM coder that encode the higher and lower frequency subbands, respectively. Each subband frame includes at least one subframe, each of which are further subdivided into a plurality of sub-subf rames. Each subframe is analyzed to estimate the prediction gain of the ADPCM coder, where the prediction capability is disabled when the prediction gain is low, and to detect transients to adjust the pre and post-transient SFs.
A global bit management (GBM) system allocates bits to each subframe by taking advantage of the differences between the multiple audio channels, the multiple subbands, and the subframes within the current frame. The GBM system initially allocates bits to each subframe by calculating its SMR modified by the prediction gain to satisfy a psychoacoustic model. The GBM system then allocates any remaining bits according to a MMSE approach to either immediately switch to a MMSE allocation, lower the overall noise floor, or gradually morph to a MMSE allocation.
A multiplexer generates output frames that include a sync word, a frame header, an audio header and at least one subframe, and which are multiplexed into a data stream at a transmission rate. The frame header includes the window size and the size of the current output frame. The audio header indicates a packing arrangement and a coding format for the audio frame. Each audio subframe includes side information for decoding the audio subframe without reference to any other subframe, high frequency VQ codes, a plurality of baseband audio sub-subframes, in which audio data for each channel""s lower frequency subbands is packed and multiplexed with the other channels, a high frequency audio block, in which audio data in the high frequency range for each channel is packed and multiplexed with the other channels so that the multi-channel audio signal is decodable at a plurality of decoding sampling rates, and an unpack sync for verifying the end of the subframe.
The window size is selected as a function of the ratio of the transmission rate to the encoder sampling rate so that the size of the output frame is constrained to lie in a desired range. When the amount of compression is relatively low the window size is reduced so that the frame size does not exceed an upper maximum. As a, result, a decoder can use an input buffer with a fixed and relatively small amount of RAM. When the amount of compression is relatively high, the window size is increased. As a result, the GBM system can distribute bits over a larger time window thereby improving encoder performance.
These and other features and advantages of the invention will be apparent to those skilled in the art from the following detailed description of preferred embodiments, taken together with the accompanying drawings and tables, in which: