1. Field of the Invention
The invention pertains to audio signal processing, and more particularly, to encoding of audio data with adaptive low frequency compensation. Some embodiments of the invention are useful for encoding audio data in accordance with one of the formats known as Dolby Digital (AC-3) and Dolby Digital Plus (E-AC-3), or in accordance with another encoding format. Dolby, Dolby Digital, and Dolby Digital Plus are trademarks of Dolby Laboratories Licensing Corporation.
2. Background of the Invention
Although the invention is not limited to use in encoding audio data in accordance with the AC-3 (Dolby Digital) format (or the Dolby Digital Plus format), for convenience it will be described in embodiments in which it encodes an audio bitstream in accordance with the AC-3 format. An AC-3 encoded bitstream comprises one to six channels of audio content, and metadata indicative of at least one characteristic of the audio content. The audio content is audio data that has been compressed using perceptual audio coding.
Details of AC-3 (also known as Dolby Digital) coding are well known and are set forth in many published references including the following:
ATSC Standard A52/A: Digital Audio Compression Standard (AC-3), Revision A, Advanced Television Systems Committee, 20 Aug. 2001;
Flexible Perceptual Coding for Audio Transmission and Storage,” by Craig C. Todd, et al, 96th Convention of the Audio Engineering Society, Feb. 26, 1994, Preprint 3796;
“Design and Implementation of AC-3 Coders,” by Steve Vernon, IEEE Trans. Consumer Electronics, Vol. 41, No. 3, August 1995;
“Dolby Digital Audio Coding Standards,” book chapter by Robert L. Andersen and Grant A. Davidson in The Digital Signal Processing Handbook, Second Edition, Vijay K. Madisetti, Editor-in-Chief, CRC Press, 2009;
“High Quality, Low-Rate Audio Transform Coding for Transmission and Multimedia Applications,” by Bosi et al, Audio Engineering Society Preprint 3365, 93rd AES Convention, October, 1992; and
U.S. Pat. Nos. 5,583,962; 5,632,005; 5,633,981; 5,727,119; and 6,021,386.
Details of Dolby Digital (AC-3) and Dolby Digital Plus (sometimes referred to as Enhanced AC-3 or “E-AC-3”) coding are set forth in “Introduction to Dolby Digital Plus, an Enhancement to the Dolby Digital Coding System,” AES Convention Paper 6196, 117th AES Convention, Oct. 28, 2004, and in the Dolby Digital/Dolby Digital Plus Specification (ATSC A/52:2010), available at http://www.atsc.org/cms/index.php/standards/published-standards.
In AC-3 encoding of an audio bitstream, blocks of input audio samples to be encoded undergo time-to-frequency domain transformation resulting in blocks of frequency domain data, commonly referred to as transform coefficients, frequency coefficients, or frequency components, located in uniformly spaced frequency bins. The frequency coefficient in each bin is then converted (e.g., in BFPE stage 7 of the FIG. 1 system) into a floating point format comprising an exponent and a mantissa.
Typical embodiments of AC-3 (and Dolby Digital Plus) encoders (and other audio data encoders) implement a psychoacoustic model to analyze the frequency domain data on a banded basis (i.e., typically 50 nonuniform bands approximating the frequency bands of the well known psychoacoustic scale known as the Bark scale) to determine an optimal allocation of bits to each mantissa. The mantissa data is then quantized (e.g., in quantizer 6 of the FIG. 1 system) to a number of bits corresponding to the determined bit allocation. The quantized mantissa data is then formatted (e.g., in formatter 8 of the FIG. 1 system) into an encoded output bitstream.
Typically, the mantissa bit assignment is based on the difference between a fine-grain signal spectrum (represented by a power spectral density (“PSD”) value for each frequency bin) and a coarse-grain masking curve (represented by a mask value for each frequency band). Typically also, the psychoacoustic model implements low frequency compensation (sometimes referred to as “lowcomp” compensation or “lowcomp”) to determine correction values (sometimes referred to herein as “lowcomp” parameter values) for correcting the masking curve values for low frequency bands. Each lowcomp parameter value may be subtracted from (or otherwise applied to) a preliminary masking curve value for a different one of the low frequency bands, in order to generate a final masking curve value for the band.
As noted, mantissa bit assignment in audio encoding can be based on the difference between signal spectrum and a masking curve. A simple algorithm for implementing such bit assignment may assume that quantization noise in one particular frequency band is independent of bit assignments in neighboring bands. However, this is typically not a reasonable assumption, especially at lower frequencies, due to finite frequency selectivity and high degree of overlap between bands in the decoder filter-bank, and due to leakage from one band into neighboring bands at low frequencies, where the slope of the masking curve can equal or exceed the slope of the filter-bank transition skirts.
Thus, the mantissa bit assignment process in audio encoding often includes a low frequency compensation process which determines a corrected masking curve. The corrected masking curve is then used to determine a signal-to-mask ratio value for each frequency component of the audio data. Low frequency compensation is a decoder selectivity compensation process for improved coding performance at low frequencies for signals with prominent low-frequency tonal components. Typically, low frequency compensation is a filter-bank response correction that, for convenience, may be incorporated into the computation of the excitation function which is used to determine the signal-to-mask values. As will be explained in greater detail below, a typical implementation of low frequency compensation searches for prominent low frequency signal components by looking for frequency bands with a PSD value that is 12-dB less than the PSD value for the next (higher frequency) band. When such a PSD value is found, the excitation function value for the band is immediately reduced by 18 dB (or an amount up to 18 dB). This reduction is then slowly backed out by 3 dB per subsequent band.
FIG. 1 is an encoder configured to perform AC-3 (or enhanced AC-3) encoding on time-domain input audio data 1. Analysis filter bank 2 converts the time-domain input audio data 1 into frequency domain audio data 3, and block floating point encoding (BFPE) stage 7 generates a floating point representation of each frequency component of data 3, comprising an exponent and mantissa for each frequency bin. The frequency-domain data output from stage 7 will sometimes also be referred to herein as frequency domain audio data 3. The frequency domain audio data output from stage 7 are then encoded, including by quantization of its mantissas in quantizer 6 and tenting of its exponents (in tenting stage 10) and encoding (in exponent coding stage 11) of the tented exponents generated in stage 10. Formatter 8 generates an AC-3 (or enhanced AC-3) encoded bitstream 9 in response to the quantized data output from quantizer 6 and coded differential exponent data output from stage 11.
Quantizer 6 performs bit allocation and quantization based upon control data (including masking data) generated by controller 4. The masking data (determining a masking curve) is generated from the frequency domain data 3, on the basis of a psychoacoustic model (implemented by controller 4) of human hearing and aural perception. The psychoacoustic modeling takes into account the frequency-dependent thresholds of human hearing, and a psychoacoustic phenomenon referred to as masking, whereby a strong frequency component close to one or more weaker frequency components tends to mask the weaker components, rendering them inaudible to a human listener. This makes it possible to omit the weaker frequency components when encoding audio data, and thereby achieve a higher degree of compression, without adversely affecting the perceived quality of the encoded audio data (bitstream 9). The masking data comprises a masking curve value for each frequency band of the frequency domain audio data 3. These masking curve values represent the level of signal masked by the human ear in each frequency band. Quantizer 6 uses this information to decide how best to use the available number of data bits to represent the frequency domain data of each frequency band of the input audio signal.
Controller 4 may implement a conventional low frequency compensation process (sometimes referred to herein as “lowcomp” compensation) to generate lowcomp parameter values) for correcting the masking curve values for the low frequency bands. The corrected masking curve values are used to generate the signal-to-mask ratio value for each frequency component of the frequency-domain audio data 3. Low frequency compensation is a feature of the psychoacoustic model typically implemented during AC-3 (and Dolby Digital Plus) encoding of audio data. Lowcomp compensation improves the encoding of highly tonal low-frequency components (of the input audio data to be encoded) by preferentially reducing the mask in the relevant frequency region, and in consequence allocating more bits to the code words employed to encode such components.
Lowcomp compensation determines a lowcomp parameter for each low frequency band. The lowcomp parameter for each band is effectively subtracted from an “excitation” value (which is determined in a well-known manner) for the band, and the resulting difference values are used to determine the corrected masking curve values. Reducing the excitation value for a band (e.g., by subtracting a lowcomp parameter therefrom, or increasing the value of a lowcomp parameter that is subtracted therefrom) results in increasing the number of bits allocated to the encoded version of the audio in the band for the following reason. While the excitation value for a band is not necessarily equal to the final (corrected) mask value (which is effectively subtracted from the audio data value for the band), it is used in the calculation of the final mask value (the final mask value takes into account absolute hearing thresholds and potentially other wideband and/or banded adjustments). Since the number of coding bits allocated to audio in a band is greater if the “signal to mask” ratio for the band is greater, reducing the mask value for a band would increase the number of bits allocated to the encoded version of the audio in that band. Therefore, reducing the excitation value for a band generally leads to a reduced mask value for the band, and consequently, an increase in the number of allocated bits for that band.
We next describe in more detail the manner in which conventional lowcomp compensation would typically be performed by the psychoacoustic model (e.g., the model implemented by controller 4 of FIG. 1). Controller 4 would scan through the low frequency bands (in the range from 0 Hz to 2.05 kHz, at 48 kHz sampling frequency) to look for a steep (12 dB) increase in power spectral density (PSD) between the current frequency band and the following (higher frequency) band, which is one characteristic of a strong tonal component. In response to identifying a PSD in a low frequency band as being indicative of a strong tonal component, lowcomp compensation is applied to cause more bits to be allocated to the data employed to encode the identified strong low frequency tonal component.
It will be understood that in AC-3 and Dolby Digital Plus encoding, each component of the frequency-domain audio data 3 (i.e., the contents of each transform bin) has a floating point representation comprising a mantissa and an exponent. To simplify the calculation of the masking curve, the Dolby Digital family of coders uses only the exponents to derive the masking curve. Or, stated alternately, the masking curve depends on the transform coefficient exponent values but is independent of the transform coefficient mantissa values. Because the range of exponents is rather limited (generally, integer values from 0-24), the exponent values are mapped onto a PSD scale with a larger range (generally, integer values from 0-3072) for the purposes of computing the masking curve. Thus, the loudest frequency components (i.e., those with an exponent of 0) are mapped to a PSD value of 3072, while the softest frequency-domain data components (i.e., those with an exponent of 24) are mapped to a PSD value of 0.
It is known that in conventional Dolby Digital (or Dolby Digital Plus) encoding, differential exponents (i.e., the difference between consecutive exponents) are coded instead of absolute exponents. The differential exponents can only take on one of five values: 2, 1, 0, −1, and −2. If a differential exponent outside this range is found, one of the exponents being subtracted is modified so that the differential exponent (after the modification) is within the noted range (this conventional method is known as “exponent tenting” or “tenting”). Tenting stage 10 of the FIG. 1 encoder generates tented exponents in response to the raw exponents asserted thereto, by performing such a tenting operation.
Consider an example of a typical implementation of lowcomp compensation in which the psychoacoustic model (e.g., the model implemented by controller 4 of FIG. 1) scans through the low frequency bands, with band “N+1” being the next band, and the current band, “N,” having lower frequency than the next band. The scan may be from the lowest frequency band until band number 22, and typically does not include the last band of a LFE (low-frequency effects) channel. If it is determined that the PSD value for band N+1 minus the PSD value for band N is equal to 256 (which is indicative of a steep increase (12 dB) in PSD from the current band, N, to the next (higher frequency) band, N+1, lowcomp compensation is performed by immediately reducing the excitation function calculation for the current band (i.e., reducing the excitation value for the band) by 18 dB. The excitation value for the band is reduced by subtracting a lowcomp parameter equal to 384 from the excitation value that would otherwise be determined for the band. This excitation value reduction is slowly backed out (e.g., by up to 3 dB per subsequent band).
For subsequent bands, i.e., bands higher in frequency than a band for which lowcomp is initially enabled, if it is determined that the difference in PSD between one band and the next band is less than 256, the lowcomp parameter (that is subtracted from the excitation value for the band) is either maintained at the same value as for the previous band or reduced to a lower value. Until it is first determined (during a scan through all the low frequency bands) that the difference in PSD between two adjacent bands is equal to 256, lowcomp compensation is not performed (i.e., a lowcomp parameter having the value zero is “subtracted” from excitation values for the bands).
While the conventional Lowcomp process is beneficial for tonal signals with prominent low-frequency components, a handicap is that the 12 dB PSD difference criterion that triggers mask reduction is frequently met by a large number of non-tonal signals having low-frequency content. An audio data indicative of applause by a crowd is a well-known example of such a non-tonal signal, and will be referred to herein as representative of a non-tonal signal of the type (which is distinguished from a tonal signal in typical embodiments of the present invention). The inventors have recognized that redistributing coding bits from low to mid/high frequencies (relative to the coding bit distribution that would be employed in conventional AC-3 or E-AC-3 encoding with conventional lowcomp compensation) improves the perceived quality of applause and other non-tonal signals reproduced following the decoding of AC-3 (or E-AC-3) encoded versions of the signals, and thus that it would be desirable to disable lowcomp compensation of such non-tonal signals during AC-3 or E-AC-3 encoding of them (i.e., it would be desirable to switch lowcomp OFF during encoding of such signals). The inventors have also recognized that disabling of lowcomp compensation during AC-3 (or E-AC-3) encoding of tonal signals having low frequency content (e.g., signals produced by pitch pipes) during such encoding degrades the perceived quality of the tonal signals when they are reproduced following the decoding of AC-3 (or E-AC-3) encoded versions thereof.
Thus, the inventors have recognized that it would be desirable to implement an encoder that can adaptively apply low frequency compensation during encoding of audio signals having prominent low-frequency tonal components, but not during encoding of audio signals that do not have prominent low-frequency tonal components (e.g., applause signals, or other audio signals having low-frequency non-tonal content but not prominent tonal low-frequency content), and to do so in a manner that requires no decoder changes (i.e., in a manner allowing a conventional decoder to decode encoded audio that has been generated by the inventive encoder).
Some conventional audio encoding methods, in which mantissa bit assignment is based on the difference between signal spectrum and a masking curve, perform at least one masking value correction process, in addition to low frequency compensation, during generation of masking values for banded, frequency domain audio data to be encoded.
For example, some conventional audio encoders (e.g., AC-3 and E-AC-3 encoders) implement delta bit allocation, which is a provision for parametrically adjusting the masking curve for each audio channel to be encoded, in accordance with an additional improved psychoacoustic analysis. The encoder transmits additional bit stream codes designated as deltas, which convey differences between the masking curve employed and a default masking curve (i.e., the difference between the masking value determined by the default masking model at each frequency and the masking value determined by the improved masking model actually employed at the same frequency).
The delta bit allocation function is typically constrained to be a stair step function (e.g., ±6 dB steps up to ±18 dB). Each tread of the stair step corresponds to a masking level adjustment for an integral number of adjoining one-half Bark bands. Stair steps comprise a number of non-overlapping variable-length segments. The segments are run-length coded for transmission efficiency.
A conventional application of delta bit allocation is the conventional BABNDNORM process for masking level correction. In the BABNDNORM process (an example of a masking value correction process), for perceptual bands number 29 and above (of the Bark frequency bands employed in AC-3 and Enhanced AC-3 encoding), the signal energy in each perceptual band used to derive the excitation function is scaled by a value proportional to the inverse of the perceptual band width. Because all perceptual bands below band 29 have unit bandwidth (i.e., include only a single frequency bin), there is no need to scale signal energies for bands below 29. At progressively higher frequencies, the excitation function and hence the masking threshold estimate is lowered. This increases bit allocation at higher frequencies, particularly in the coupling channel. Some audio encoders which implement AC-3 (or E-AC-3) encoding are configured to implement the BABNDNORM process as a step of the encoding.
FIG. 5 is a graph of banded PSD (perceptual energy) values (the top curve) of banded, frequency domain audio data, a graph of scaled banded PSD values (the second curve from the top) generated by applying a conventional BABNDNORM process to the audio data, a graph of an excitation function (the third curve from the top) generated (e.g., by a conventional AC-3 or E-AC-3 encoder) for use in masking the audio data, and a graph of a scaled version of the excitation function (the bottom curve) generated (e.g., by a conventional AC-3 or E-AC-3 encoder) by applying a conventional BABNDNORM process to the excitation function. Each of the four curves is represented on a perceptual band (Bark frequency) scale. It is apparent that the top two curves begin to diverge from each other at band 29, and that the bottom two curves also begin to diverge from each other at band 29.
FIG. 6 is a graph of a frequency spectrum of an audio signal (the curve of FIG. 6 having widest dynamic range), a graph of a default masking curve for masking the audio signal (the second curve from the bottom), and a graph of a scaled version of the masking curve (the bottom curve) generated (e.g., by a conventional AC-3 or E-AC-3 encoder) by applying a conventional BABNDNORM process to the masking curve. It is apparent from FIG. 6 that at progressively higher frequencies, the BABNDNORM process lowers the masking curve by greater amounts.