1. Field of the Invention
The invention pertains to audio signal processing, and more particularly to multichannel audio encoding (e.g., encoding of data indicative of a multichannel audio signal) and decoding. In typical embodiments, a downmix of low frequency components of individual channels of multichannel input audio undergo waveform coding and the other (higher frequency) frequency components of the input audio undergo parametric coding. Some embodiments encode multichannel audio data in accordance with one of the formats known as AC-3 and E-AC-3 (Enhanced AC-3), or in accordance with another encoding format.
2. Background of the Invention
Dolby Laboratories provides proprietary implementations of AC-3 and E-AC-3 known as Dolby Digital and Dolby Digital Plus, respectively. Dolby, Dolby Digital, and Dolby Digital Plus are trademarks of Dolby Laboratories Licensing Corporation.
Although the invention is not limited to use in encoding audio data in accordance with the E-AC-3 (or AC-3) format, for convenience it will be described in embodiments in which it encodes an audio bitstream in accordance with the E-AC-3 format.
An AC-3 or E-AC-3 encoded bitstream comprises metadata and can comprise one to six channels of audio content. The audio content is audio data that has been compressed using perceptual audio coding. Details of AC-3 coding are well known and are set forth in many published references including the following:    ATSC Standard A52/A: Digital Audio Compression Standard (AC-3), Revision A, Advanced Television Systems Committee, 20 Aug. 2001; and    U.S. Pat. Nos. 5,583,962; 5,632,005; 5,633,981; 5,727,119; and 6,021,386.    Details of Dolby Digital Plus (E-AC-3) coding are set forth in, for example, “Introduction to Dolby Digital Plus, an Enhancement to the Dolby Digital Coding System,” AES Convention Paper 6196, 117th AES Convention, Oct. 28, 2004.
Each frame of an AC-3 encoded audio bitstream contains audio content and metadata for 1536 samples of digital audio. For a sampling rate of 48 kHz, this represents 32 milliseconds of digital audio or a rate of 31.25 frames per second of audio.
Each frame of an E-AC-3 encoded audio bitstream contains audio content and metadata for 256, 512, 768 or 1536 samples of digital audio, depending on whether the frame contains one, two, three or six blocks of audio data respectively.
The audio content encoding performed by typical implementations of E-AC-3 encoding includes waveform encoding and parametric encoding.
Waveform encoding of an audio input signal (typically performed to compress the signal so that the encoded signal comprises fewer bits than the input signal) encodes the input signal in a manner which preserves the input signal's waveform as much as possible subject to applicable constraints (e.g., so that the waveform of the encoded signal matches that of the input signal to the extent possible). For example, in conventional E-AC-3 encoding, waveform encoding is performed on the low frequency components (typically, up to 3.5 kHz or 4.6 kHz) of each channel of a multichannel input signal to compress such low frequency content of the input signal, by generating (in the frequency domain) a quantized representation (quantized mantissa and exponent) of each sample (which is a frequency component) of each low frequency band of each channel of the input signal.
More specifically, typical implementations of E-AC-3 encoders (and some other conventional audio encoders) implement a psychoacoustic model to analyze frequency domain data indicative of the input signal on a banded basis (i.e., typically 50 nonuniform bands approximating the frequency bands of the well-known psychoacoustic scale known as the Bark scale) to determine an optimal allocation of bits to each mantissa. To perform waveform encoding on the low frequency components of the input signal, the mantissa data (indicative of the low frequency content) are quantized to a number of bits corresponding to the determined bit allocation. The quantized mantissa data (and corresponding exponent data and typically also corresponding metadata) are then formatted into an encoded output bitstream.
Parametric encoding, another well-known type of audio signal encoding, extracts and encodes feature parameters of the input audio signal, such that the reconstructed signal (after encoding and subsequent decoding) has as much intelligibility as possible (subject to applicable constraints), but such that the waveform of the encoded signal may by very different from that of the input signal.
For example, PCT International Application Publication No. WO 03/083834 A1, published Oct. 9, 2003 and PCT International Application Publication No. WO 2004/102532 A1, published Nov. 25, 2004, describe a type of parametric coding known as spectral extension coding. In spectral extension coding, the frequency components of a full frequency range audio input signal are encoded as a sequence of frequency components of a limited frequency range signal (a baseband signal) and a corresponding sequence of encoding parameters (indicative of a residual signal) which determine (with the baseband signal) an approximated version of the full frequency range input signal.
Another well known type of parametric encoding is channel coupling coding. In channel coupling coding, a monophonic downmix of the channels of an audio input signal is constructed. The input signal is encoded as this downmix (a sequence of frequency components) and a corresponding sequence of coupling parameters. The coupling parameters are level parameters which determine (with the downmix) an approximated version of each of the channels of the input signal. The coupling parameters are frequency-banded metadata that match the energy of the monophonic downmix to the energy of each channel of the input signal.
For example, conventional E-AC-3 encoding of a 5.1 channel input signal (with an available bitrate of 192 kbps for delivery of the encoded signal) typically implements channel coupling coding to encode the intermediate frequency components (in the range F1<f≦F2, where F1 is typically equal to 3.5 kHz or 4.6 kHz, and F2 is typically equal to 10 kHz or 10.2 kHz) of each channel of the input signal, and spectral extension coding to encode the high frequency components (in the range F2<f≦F3, where F2 is typically equal to 10 kHz or 10.2 kHz, and F3 is typically equal to 14.8 kHz or 16 kHz) of each channel of the input signal. The monophonic downmix determined during performance of the channel coupling encoding is waveform coded, and the waveform coded downmix is delivered (in the encoded output signal) along with the coupling parameters. The downmix determined during performance of the channel coupling encoding is employed as the baseband signal for the spectral extension coding. The spectral extension coding determines (from the baseband signal and the high frequency components of each channel of the input signal) another set of encoding parameters (SPX parameters). The SPX parameters are included in and delivered with the encoded output signal.
In another type of parametric coding sometimes referred to as spatial audio coding, a downmix (e.g., a mono or stereo downmix) of the channels of a multichannel audio input signal is generated. The input signal is encoded as an output signal including this downmix (a sequence of frequency components) and a corresponding sequence of spatial parameters (or as a waveform coded version of each channel of the downmix, with a corresponding sequence of spatial parameters). The spatial parameters allow for restoration of both the amplitude envelope of each channel of the audio input signal and the interchannel correlations between the channels of the audio input signal from the downmix of the input signal. This type of parametric coding may be performed on all frequency components of the input signal (i.e., over the full frequency range of the input signal) rather than on just the frequency components in a subrange of the input signal's full frequency range (i.e., so that the encoded version of the input signal includes the downmix and spatial parameters for all frequencies of the input signal's full frequency range, rather than just a subset thereof).
In E-AC-3 or AC-3 encoding of an audio bitstream, blocks of input audio samples to be encoded undergo time-to-frequency domain transformation resulting in blocks of frequency domain data, commonly referred to as transform coefficients (or frequency coefficients or frequency components) located in uniformly spaced frequency bins. The frequency coefficient in each bin is then converted (e.g., in BFPE stage 7 of the FIG. 1 system) into a floating point format comprising an exponent and a mantissa.
Typically, the mantissa bit assignment is based on the difference between a fine-grain signal spectrum (represented by a power spectral density (“PSD”) value for each frequency bin) and a coarse-grain masking curve (represented by a mask value for each frequency band).
FIG. 1 is an encoder configured to perform conventional E-AC-3 encoding on time-domain input audio data 1. Analysis filter bank 2 of the encoder converts the time-domain input audio data 1 into frequency-domain audio data 3, and block floating point encoding (BFPE) stage 7 generates a floating point representation of each frequency component of data 3, comprising an exponent and mantissa for each frequency bin. The frequency-domain data output from stage 7 will sometimes also be referred to herein as frequency domain audio data 3. The frequency domain audio data output from stage 7 are then encoded, including by performing waveform coding (in elements 4, 6, 10, and 11 of the FIG. 1 system) on the low frequency components (having frequency less than or equal to “F1”, where F1 is typically equal to 3.5 kHz or 4.6 kHz) of the frequency domain data output from stage 7, and by performing parametric coding (in parametric encoding stage 12) on the other frequency components (those having frequency greater than F1) of the frequency domain data output from stage 7.
The waveform encoding includes quantization of the mantissas (of the low frequency components output from stage 7) in quantizer 6 and tenting of the exponents (of the low frequency components output from stage 7) in tenting stage 10 and encoding (in exponent coding stage 11) of the tented exponents generated in stage 10. Formatter 8 generates an E-AC-3 encoded bitstream 9 in response to the quantized data output from quantizer 6, the coded differential exponent data output from stage 11, and the parametrically encoded data output from stage 12.
Quantizer 6 performs bit allocation and quantization based upon control data (including masking data) generated by controller 4. The masking data (determining a masking curve) is generated from the frequency domain data 3, on the basis of a psychoacoustic model (implemented by controller 4) of human hearing and aural perception. The psychoacoustic modeling takes into account the frequency-dependent thresholds of human hearing, and a psychoacoustic phenomenon referred to as masking, whereby a strong frequency component close to one or more weaker frequency components tends to mask the weaker components, rendering them inaudible to a human listener. This makes it possible to omit the weaker frequency components when encoding audio data, and thereby achieve a higher degree of compression, without adversely affecting the perceived quality of the encoded audio data (bitstream 9). The masking data comprises a masking curve value for each frequency band of the frequency domain audio data 3. These masking curve values represent the level of signal masked by the human ear in each frequency band. Quantizer 6 uses this information to decide how best to use the available number of data bits to represent the frequency domain data of each frequency band of the input audio signal.
It is known that in conventional E-AC-3 encoding, differential exponents (i.e., the difference between consecutive exponents) are coded instead of absolute exponents. The differential exponents can only take on one of five values: 2, 1, 0, −1, and −2. If a differential exponent outside this range is found, one of the exponents being subtracted is modified so that the differential exponent (after the modification) is within the noted range (this conventional method is known as “exponent tenting” or “tenting”). Tenting stage 10 of the FIG. 1 encoder generates tented exponents in response to the raw exponents asserted thereto, by performing such a tenting operation.
In a typical embodiment of E-AC-3 coding, a 5 or 5.1 channel audio signal is encoded at a bit rate in the range from about 96 kbps to about 192 kbps. Currently, at 192 kbps a typical E-AC-3 encoder encodes a 5-channel (or 5.1 channel) input signal using a combination of discrete waveform coding for the lower frequency components (e.g., up to 3.5 kHz or 4.6 kHz) of each channel of the signal, channel coupling for the intermediate frequency components (e.g., from 3.5 kHz to about 10 kHz or from 4.6 kHz to about 10 kHz) of each channel of the signal, and spectral extension for the higher frequency components (e.g., from about 10 kHz to 16 kHz or from about 10 kHz to 14.8 kHz) of each channel of the signal. While this yields acceptable quality, as the maximum bitrate available for delivering the encoded output signal is reduced below 192 kbps, the quality (of a decoded version of the encoded output signal) degrades rapidly. For example, when using E-AC-3 to encode 5.1 channel audio for streaming, temporary data bandwidth limitations may require a data rate lower than 192 kbps (e.g., to 64 kbps). However, using E-AC-3 to encode a 5.1 channel signal for delivery at a bitrate below 192 kbps does not produce “broadcast quality” encoded audio. In order to code a signal (using E-AC-3 encoding) for delivery at a bitrate substantially below 192 kbps (e.g., 96 kbps, or 128 kbps, or 160 kbps), the best available tradeoff between audio bandwidth (available for delivering the encoded audio signal), coding artifacts, and spatial collapse must be found. More generally, the inventors have recognized that the best tradeoff between audio bandwidth, coding artifacts, and spatial collapse must be found to otherwise encode multichannel input audio for delivery at low (or less than typical) bitrates.
One naive solution is to downmix the multichannel input audio to the number of channels that can be produced at adequate quality (e.g., “broadcast quality” if this is the minimum adequate quality) for the available bitrate, and then perform conventional encoding of each channel of the downmix. For example, one might downmix a five-channel input signal to a three-channel downmix (where the available bitrate is 128 kbps) or to a two-channel downmix (where the available bitrate is 96 kbps). However, this solution maintains coding quality and audio bandwidth at the expense of severe spatial collapse.
Another naive solution is to avoid downmixing (e.g., to produce a full 5.1 channel encoded output signal in response to a 5.1 channel input signal), and instead push the codec to its limit. However, this solution would introduce more coding artifacts and sacrifice audio bandwidth, although it would maintain as much spaciousness as possible.