The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it is not to be assumed that any of the approaches described in this section qualify as prior art, merely by virtue of their inclusion in this section.
Audio coding, or audio compression, algorithms are used to obtain compact digital representations of high-fidelity (i.e., wideband) audio signals for the purpose of efficient transmission and/or storage. A central objective in audio coding is to represent the signal with a minimum number of bits while achieving transparent signal reproduction, i.e., while generating output audio which cannot be humanly distinguished from the original input, even by a sensitive listener.
Advanced Audio Coding (“AAC”) is a wideband audio coding algorithm that exploits two primary coding strategies to dramatically reduce the amount of data needed to convey high-quality digital audio. Signal components that are “perceptually irrelevant” and can be discarded without a perceived loss of audio quality are removed. Further, redundancies in the coded audio signal are eliminated. Hence, efficient audio compression is achieved by a variety of perceptual audio coding and data compression tools, which are combined in the MPEG-4 AAC specification. The MPEG-4 AAC standard incorporates MPEG-2 AAC, forming the basis of the MPEG-4 audio compression technology for data rates above 32 kbps per channel. Additional tools increase the effectiveness of AAC at lower bit rates, and add scalability or error resilience characteristics. These additional tools extend AAC into its MPEG-4 incarnation (ISO/IEC 14496-3, Subpart 4).
AAC is referred to as a perceptual audio coder, or lossy coder, because it is based on a listener perceptual model, i.e., what a listener can actually hear, or perceive. A common problem in perceptual audio coding is bitrate control. According to the concept of Perceptual Entropy, the information content of an audio signal varies dependent on the signal properties. Thus, the required bitrate to encode this information generally varies over time. For some applications bitrate variations are not an issue. However, for many applications a firm control of the instantaneous and/or average bitrate is desired.
The three basic bitrate modes for audio coding are CBR (constant bitrate), ABR (average bitrate) and VBR (variable bitrate). CBR is important to bitrate-critical applications, such as audio streaming. Unlike CBR, in which bitrates are strictly constant at each instance, ABR allows a variation of bitrates for each instance while maintaining a certain average bitrate for the entire track, thereby resulting in a reasonably predictable size to the finished files. As the name indicates, VBR allows the bitrate to vary significantly; however, the sound quality is consistent.
A CBR codec is constant in bitrate along an audio time signal, but is typically variable in sound quality. For example, for stereo encoding at a bitrate of 96 kb/s, an encoded speech track, which is “easy” to encode due to its relatively narrow frequency bandwidth, sounds indistinguishable from the original source of the track. However, noticeable artifacts could be heard in similarly encoded complex classical music, which is “difficult” to encode due to a typically broad frequency bandwidth and, therefore, more data to encode.
Simultaneous Masking is a frequency domain phenomenon where a low level signal, e.g., a narrow-band noise (the maskee) can be made inaudible by a simultaneously occurring stronger signal (the masker). A masked threshold can be measured below which any signal will not be audible. The masked threshold depends on the sound pressure level (SPL) and the frequency of the masker, and on the characteristics of the masker and maskee. If the source signal consists of many simultaneous maskers, a global masked threshold can be computed that describes the threshold of just noticeable distortions as a function of frequency. The most common way of calculating the global masked threshold is based on the high resolution short term energy spectrum of the audio or speech signal.
Coding audio based on a psychoacoustic model encodes audio signals above a masked threshold block by block. Therefore, if distortion (typically referred to as quantization noise), which is inherent to an amplitude quantization process, is under the masked threshold, a typical human cannot hear the noise. A sound quality target is based on a subjective perceptual quality scale (e.g., from 0-5, with 5 being best quality). From an audio quality target on this perceptual quality scale, a noise profile, i.e., an offset from the applicable masked threshold, is determinable. This noise profile represents the level at which quantization noise can be masked, while achieving the desired quality target. From the noise profile, appropriate quantization step sizes are determinable. The quantization step sizes are a significant determining factor of the coding bitrate.
The more bits allocated for encoding a block of audio, the less noise may be generated during the quantization process. However current techniques for estimating how many bits to allocate are inefficient. For example, current techniques estimate audio quality based on an erroneous assumption of the noise-to-audio quality relationship. As another example, current techniques take into account all possible scale factor values at each scale factor band, which requires a significant number of calculations.
Based on the foregoing, there is room for improvement in estimating scale factor values when encoding audio data.