The present invention relates generally to audio compression techniques, and more particularly to audio compression techniques which utilize psychoacoustic models or other types of perceptual models.
Perceptual audio coding techniques have been proposed for use in numerous digital communication systems, such as, e.g., terrestrial AM or FM in-band on-channel (IBOC) digital audio broadcasting (DAB) systems, satellite broadcasting systems, and Internet audio streaming systems. Perceptual audio coding devices, such as the perceptual audio coder (PAC) described in D. Sinha, J. D. Johnston, S. Dorward and S. R. Quackenbush, xe2x80x9cThe Perceptual Audio Coder,xe2x80x9d in Digital Audio, Section 42, pp. 42-1 to 42-18, CRC Press, 1998, which is incorporated by reference herein, perform audio coding using a noise allocation strategy whereby for each audio frame the bit requirement is computed based on a psychoacoustic model. PACs and other audio coding devices incorporating similar compression techniques are inherently packet-oriented, i.e., audio information for a fixed interval (frame) of time is represented by a variable bit length packet. Each packet includes certain control information followed by a quantized spectral/subband description of the audio frame. For stereo signals, the packet may contain the spectral description of two or more audio channels separately or differentially, as a center channel and side channels (e.g., a left channel and a right channel).
PAC encoding as described in the above-cited reference may be viewed as a perceptually-driven adaptive filter bank or transform coding algorithm. It incorporates advanced signal processing and psychoacoustic modeling techniques to achieve a high level of signal compression. More particularly, PAC encoding uses a signal adaptive switched filter bank which switches between a Modified Discrete Cosine Transform (MDCT) and a wavelet transform to obtain a compact description of the audio signal. The filter bank output is quantized using non-uniform vector quantizers. For the purpose of quantization, the filter bank outputs are grouped into so-called xe2x80x9ccoderbandsxe2x80x9d so that quantizer parameters, e.g., quantizer step sizes, may be independently chosen for each coderband. These step sizes are generated in accordance with a psychoacoustic model. Quantized coefficients are further compressed using an adaptive Huffman coding technique. PAC employs, e.g., a total of 15 different codebooks, and for each codeband, the best codebook may be chosen independently. For stereo and multichannel audio material, sum/difference or other forms of multichannel combinations may be encoded.
PAC encoding formats the compressed audio information into a packetized bitstream using a block sampling algorithm. At a 44.1 kHz sampling rate, each packet corresponds to 1024 input samples from each channel, regardless of the number of channels. The Huffman encoded filter bank outputs, codebook selection, quantizers and channel combination information for one 1024 sample block are arranged in a single packet. Although the size of the packet corresponding to each 1024 input audio sample block is variable, a long-term constant average packet length may be maintained as will be described below.
Depending on the application, various additional information may be added to the first frame or to every frame. For unreliable transmission channels, such as those in DAB applications, a header is added to each frame. This header contains critical PAC packet synchronization information for error recovery and may also contain other useful information such as sample rate, transmission bit rate, audio coding modes, etc. The critical control information is further protected by repeating it in two consecutive packets.
It is clear from the above description that the PAC bit demand depends primarily on the quantizer step sizes, as determined in accordance with the psychoacoustic model. However, due to the use of Huffman coding, it is generally not possible to predict the precise bit demand in advance, i.e., prior to the quantization and Huffman coding steps, and the bit demand varies from frame to frame. Conventional PAC encoders therefore utilize a buffering mechanism and a rate loop to meet long-term bit rate constraints. The size of the buffer in the buffering mechanism is determined by the allowable system delay.
In conventional PAC bit allocation, the encoder issues a request for allocation of a certain number of bits for a particular audio frame to a buffer control mechanism. Depending upon the state of the buffer and the average bit rate, the buffer control mechanism then returns the maximum number of bits which can actually be allocated to the current frame. It should be noted that this bit assignment can be significantly lower than the initial bit allocation request. This indicates that it may not be possible to encode the current frame at an accuracy level for perceptually transparent coding, i.e., as implied by the initial psychoacoustic model step sizes. It is the function of the rate loop to adjust the step sizes so that bit demand with the modified step sizes is less than, and close to, the actual bit allocation.
Despite the above-described advances provided by PAC coding, a need remains for further improvements in techniques for digital audio compression, so as to provide enhanced performance capabilities in DAB systems and other digital audio compression applications. In all of these applications, one generally strives to deliver the best audio playback quality given the bandwidth constraint. Conventional audio coding techniques such as PAC attempt to maximize audio quality for a wide range of audio signals. For non-real-time applications it is possible to tune the encoder separately for each audio track so that playback quality is maximized. Such tuning can significantly enhance the playback quality. However, in digital broadcasting and other real-time applications it is generally not possible to change the encoder xe2x80x9con the fly.xe2x80x9d As a result, given the richness and diversity of available audio material, the playback quality is somewhat compromised when a single psychoacoustic model is used for all of the different types of available audio material. More particularly, since different types of audio material, such as rock,jazz, classical, voice, etc., can have significantly different characteristics, the typical conventional approach of applying a single psychoacoustic model to all types of audio material inevitably results in less than optimal encoding performance for one or more particular types of audio material.
Another problem with conventional PAC coding relates to the audio processor which typically precedes the PAC audio encoder in a DAB system or other type of system. The audio processor performs processing functions such as attempting to reduce the dynamic range, stereo separation or bandwidth of an audio signal to be encoded. Like the PAC encoder itself, the settings or other parameters of the audio processor are typically not optimized for particular types of audio material in real-time applications.
A need therefore exists for a technique for preclassification of audio material so as to facilitate determination of an appropriate psychoacoustic model, audio processor setting or other coding-related parameter for use in perceptual audio coding of such material.
The present invention provides methods and apparatus for preclassification of audio material in digital audio compression applications. Advantageously, the invention ensures that appropriate psychoacoustic models, audio processor settings or other coding-related parameters are used for particular types of audio material, and thus improves the playback quality associated with the audio compression process.
In accordance with one aspect of the invention, audio tracks or other portions of a particular type of audio material to be encoded are analyzed to determine a value of at least one coding-related parameter suitable for providing a desired level of audio playback quality, e.g., an optimal encoding of the particular type of audio material. When a given portion of the particular type of audio material is to be encoded for transmission in a perceptual audio coder of a communication system, the value of the coding-related parameter is identified and then utilized in conjunction with the encoding of the given portion. The given portion of the particular type of audio material may be analyzed to determine the value of the coding-related parameter prior to encoding of the given portion in the perceptual audio coder. As another example, the given portion of the particular type of audio material may be analyzed to determine the value of the coding-related parameter at least in part during the encoding of the given portion in the perceptual audio coder.
The coding-related parameter in an illustrative embodiment comprises a psychoacoustic model specified at least in part as a combination of one or more of a tone masking noise ratio, a noise masking tone ratio, and a frequency spreading function. The value of the coding-related parameter in this case may be determined at least in part based on analysis which includes a determination of at least one of an average spectral flatness measure, an average energy entropy measure, and a coding criticality measure.
In accordance with a further aspect of the invention, the value of the coding-related parameter may comprise a setting of an audio processor utilized to process the given portion of the particular type of audio material prior to encoding the given portion in the perceptual audio coder. In this case, the value of the coding-related parameter may be determined based at least in part on an undercoding measure generated by analyzing at least part of the given portion of the particular type of audio material. Again, this analysis can be performed prior to or during the encoding of the audio material.
The invention can be utilized in a wide variety of digital audio compression applications, including, for example, AM or FM in-band on-channel (IBOC) digital audio broadcasting (DAB) systems, satellite broadcasting systems, Internet audio streaming, systems for simultaneous delivery of audio and data, etc.