The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it is not to be assumed that any of the approaches described in this section qualify as prior art, merely by virtue of their inclusion in this section.
Digital media coding, or digital media compression, algorithms are used to obtain compact digital representations of high-fidelity (i.e., wideband) signals for the purpose of efficient transmission and/or storage. A central objective in (e.g. audio) coding is to represent the signal with a minimum number of bits while achieving transparent signal reproduction, i.e., while generating output digital media which cannot be humanly distinguished from the original input, even by a sensitive listener.
Advanced Audio Coding (“AAC”) is a wideband audio coding algorithm that exploits two primary coding strategies to dramatically reduce the amount of data needed to convey high-quality digital audio. Signal components that are “perceptually irrelevant” and can be discarded without a perceived loss of audio quality are removed. Further, redundancies in the coded audio signal are eliminated. Hence, efficient audio compression is achieved by a variety of perceptual audio coding and data compression tools, which are combined in the MPEG-4 AAC specification. The MPEG-4 AAC standard incorporates MPEG-2 AAC, forming the basis of the MPEG-4 audio compression technology for data rates above 32 kbps per channel. Additional tools increase the effectiveness of AAC at lower bit rates, and add scalability or error resilience characteristics. These additional tools extend AAC into its MPEG-4 incarnation (ISO/IEC 14496-3, Subpart 4).
AAC is referred to as a perceptual audio coder, or lossy coder, because it is based on a listener perceptual model, i.e., what a listener can actually hear, or perceive. A common problem in perceptual audio coding is bitrate control. According to the concept of Perceptual Entropy, the information content of an audio signal varies dependent on the signal properties. Thus, the required bitrate to encode this information generally varies over time. For some applications bitrate variations are not an issue. However, for many applications a firm control of the instantaneous and/or average bitrate is desired.
The three basic bitrate modes for audio coding are CBR (constant bitrate), ABR (average bitrate) and VBR (variable bitrate). CBR is important to bitrate-critical applications, such as audio streaming. Unlike CBR, in which bitrates are strictly constant at each instance, ABR allows a variation of bitrates for each instance while maintaining a certain average bitrate for the entire track, thereby resulting in a reasonably predictable size to the finished files. Although VBR allows the bitrate to vary significantly, the sound quality is consistent.
A CBR codec is constant in bitrate along an audio time signal, but is typically variable in sound quality. For example, for stereo encoding at a bitrate of 96 kb/s, an encoded speech track, which is “easy” to encode due to its relatively narrow frequency bandwidth, sounds indistinguishable from the original source of the track. However, noticeable artifacts could be heard in similarly encoded complex classical music, which is “difficult” to encode due to a typically broad frequency bandwidth and, therefore, more data to encode.
Simultaneous Masking is a frequency domain phenomenon where a low level signal, e.g., a narrow-band noise (the maskee) can be made inaudible by a simultaneously occurring stronger signal (the masker). A masked threshold can be measured below which any signal will not be audible. The masked threshold depends on the sound pressure level (SPL) and the frequency of the masker, and on the characteristics of the masker and maskee. If the source signal consists of many simultaneous maskers, a global masked threshold can be computed that describes the threshold of just noticeable distortions as a function of frequency. The most common way of calculating the global masked threshold is based on the high resolution short term energy spectrum of the audio or speech signal.
Coding audio based on an audio perceptual model (i.e. psychoacoustic model) encodes audio signals above a masked threshold block by block. Therefore, if distortion (typically referred to as quantization noise), which is inherent to an amplitude quantization process, is under the masked threshold, a typical human cannot hear the noise. A sound quality target is based on a subjective perceptual quality scale (e.g., from 0-5, with 5 being best quality). From an audio quality target on this perceptual quality scale, a noise profile, i.e., an offset from the applicable masked threshold, is determinable. This noise profile represents the level at which quantization noise can be masked, while achieving the desired quality target. From the noise profile, appropriate quantization step sizes are determinable. The quantization step sizes are a significant determining factor of the coding bitrate.
After a block of audio data has been encoded, a bit count for that block of audio data is determined. If the bit count is too high (i.e., given the particular CBR or ABR target bitrate), then one way to reduce the bit count is to increase the quantization step sizes uniformly across all frequency bands of the block of audio data. Although this adjustment may effectively reduce the bit count, the adjustment does not take into account how audio is perceived differently at different frequencies. This may cause unacceptable noise to be generated at certain frequencies when the encoded audio is decoded and subsequently played.
Based on the foregoing, there is room for improvement in audio coding techniques.
In the foregoing description, AAC has been described as an example audio coding algorithm. However, embodiments of the invention are not limited to AAC. Any audio or video coding algorithm that employs a perceptual model may be used, such as MP3, AC-3, and WMA.