Audio coding, or audio compression, algorithms are used to obtain compact digital representations of high-fidelity (i.e., wideband) audio signals for the purpose of efficient transmission and/or storage. The central objective in audio coding is to represent the signal with a minimum number of bits while achieving transparent signal reproduction, i.e., while generating output audio which cannot be humanly distinguished from the original input, even by a sensitive listener.
Advanced Audio Coding (“AAC”) is a wideband audio coding algorithm that exploits two primary coding strategies to dramatically reduce the amount of data needed to convey high-quality digital audio. Signal components that are “perceptually irrelevant” and can be discarded without a perceived loss of audio quality are removed. Further, redundancies in the coded audio signal are eliminated. Hence, efficient audio compression is achieved by a variety of perceptual audio coding and data compression tools, which are combined in the MPEG-4 AAC specification. The MPEG-4 AAC standard incorporates MPEG-2 AAC, forming the basis of the MPEG-4 audio compression technology for data rates above 32 kbps per channel. Additional tools increase the effectiveness of AAC at lower bit rates, and add scalability or error resilience characteristics. These additional tools extend AAC into its MPEG-4 incarnation (ISO/IEC 14496-3, Subpart 4).
AAC is referred to as a perceptual audio coder, or lossy coder, because it is based on a listener perceptual model, i.e., what a listener can actually hear, or perceive. The two basic bitrate modes for audio coding, such as AAC, are CBR (constant bitrate) and VBR (variable bitrate). Unlike CBR, in which bitrates are strictly constant at each instance, ABR (average bitrate) allows a small variation of bitrates for each instance while maintaining a certain average bitrate for the entire track, thereby resulting in a reasonably predictable size to the finished files.
A CBR codec is constant in bitrate along an audio time signal, but variable in sound quality. For example, for stereo encoding at a bitrate of 96 kb/s, an encoded speech track, which is “easy’ to encode due to its relatively narrow frequency bandwidth, sounds indistinguishable from the original source of the track. However, noticeable artifacts could be heard in similarly encoded complex classical music, which is “difficult” to encode due to a typically broad frequency bandwidth and, therefore, more data to encode. CBR is important to bitrate critical applications, such as audio streaming, but the variable sound quality produced makes CBR undesirable for other offline applications.
A VBR codec is targeted to produce audio having constant quality by using as many bits for encoding as are needed to meet a sound quality target. In other words, the bitrate varies depending on the difficulty associated with encoding a given audio track, with a goal of constant perception of the sound quality along the entirety of the audio stream. With VBR, the sound quality target is typically defined by the Noise-to-Masking Ratio (“NMR”), which is calculated for each block of audio data based on the psychoacoustic model used in the coder. Because the coding bitrate of a VBR codec may vary significantly, VBR is not always suitable for bitrate critical applications.
Simultaneous Masking is a frequency domain phenomenon where a low level signal, e.g., a smallband noise (the maskee) can be made inaudible by a simultaneously occurring stronger signal (the masker). A masking threshold can be measured below which any signal will not be audible. The masking threshold depends on the sound pressure level (SPL) and the frequency of the masker, and on the characteristics of the masker and maskee. If the source signal consists of many simultaneous maskers, a global masking threshold can be computed that describes the threshold of just noticeable distortions as a function of frequency. The most common way of calculating the global masking threshold is based on the high resolution short term amplitude spectrum of the audio or speech signal.
Coding audio based on the psychoacoustic model only encodes audio signals above a masking threshold, block by block of audio. Therefore, if distortion (typically referred to as quantization noise), which is inherent to an amplitude quantization process, is under the masking threshold, a typical human cannot hear the noise. A sound quality target is based on a subjective perceptual quality scale (e.g., from 0-5, with 5 being best quality). From an audio quality target on this perceptual quality scale, a noise profile, i.e., an offset from the applicable masking threshold, is determinable. This noise profile represents the level at which quantization noise can be masked, while achieving the desired quality target. From the noise profile, an appropriate coding quantization step is determinable. The quantization step is directly related to the coding bitrate.
A practical problem with a VBR codec is that the bitrate used to encode some tracks will be either too high (i.e., bits wasted) or too low (i.e., diminished perceptual quality). This phenomenon is due in part to the nature of the track, i.e., the ease or difficulty of encoding the track. However, this phenomenon is mainly due to the fact that current technology has simply not achieved a perfect psychoacoustic model because the understanding of human hearing is still limited. A consequence is inaccurate masking thresholds for targeting sound quality. In addition, the perceived sound quality is not solely dependent on the masking thresholds. Hence, even if a perfect psycho-model existed for generating accurate masking thresholds, the sound quality target derived from the masking threshold (e.g., NMR) still cannot perfectly match what is actually perceived.
Based on the foregoing, there is room for improvement in audio coding techniques.
The techniques described in this section are techniques that could be pursued, but not necessarily techniques that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the techniques described in this section qualify as prior art merely by virtue of their inclusion in this section.