With the introduction of compact disks, digital wireless telephone networks, and audio delivery over the Internet, digital audio has become commonplace. Engineers use a variety of techniques to control the quality and bitrate of digital audio. To understand these techniques, it helps to understand how audio information is represented in a computer and how humans perceive audio.
I. Representation of Audio Information in a Computer
A computer processes audio information as a series of numbers representing the audio information. For example, a single number can represent an audio sample, which is an amplitude (i.e., loudness) at a particular time. Several factors affect the quality of the audio information, including sample depth, sampling rate, and channel mode.
Sample depth (or precision) indicates the range of numbers used to represent a sample. The more values possible for the sample, the higher the quality because the number can capture more subtle variations in amplitude. For example, an 8-bit sample has 256 possible values, while a 16-bit sample has 65,536 possible values.
The sampling rate (usually measured as the number of samples per second) also affects quality. The higher the sampling rate, the higher the quality because more frequencies of sound can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second.
Mono and stereo are two common channel modes for audio. In mono mode, audio information is present in one channel. In stereo mode, audio information is present in two channels usually labeled the left and right channels. Other modes with more channels, such as 5-channel surround sound, are also possible. Table 1 shows several formats of audio with different quality levels, along with corresponding raw bitrate costs.
TABLE 1Bitrates for different quality audio informationSample DepthSampling RateRaw BitrateQuality(bits/sample)(samples/second)Mode(bits/second)Internet88,000mono64,000telephonytelephone811,025mono88,200CD audio1644,100stereo1,411,200high quality1648,000stereo1,536,000audio
As Table 1 shows, the cost of high quality audio information such as CD audio is high bitrate. High quality audio information consumes large amounts of computer storage and transmission capacity.
Compression (also called encoding or coding) decreases the cost of storing and transmitting audio information by converting the information into a lower bitrate form. Compression can be lossless (in which quality does not suffer) or lossy (in which quality suffers). Decompression (also called decoding) extracts a reconstructed version of the original information from the compressed form.
Quantization is a conventional lossy compression technique. There are many different kinds of quantization including uniform and non-uniform quantization, scalar and vector quantization, and adaptive and non-adaptive quantization. Quantization maps ranges of input values to single values. For example, with uniform, scalar quantization by a factor of 3.0, a sample with a value anywhere between −1.5 and 1.499 is mapped to 0, a sample with a value anywhere between 1.5 and 4.499 is mapped to 1, etc. To reconstruct the sample, the quantized value is multiplied by the quantization factor, but the reconstruction is imprecise. Continuing the example started above, the quantized value 1 reconstructs to 1×3=3; it is impossible to determine where the original sample value was in the range 1.5 to 4.499. Quantization causes a loss in fidelity of the reconstructed value compared to the original value. Quantization can dramatically improve the effectiveness of subsequent lossless compression, however, thereby reducing bitrate.
An audio encoder can use various techniques to provide the best possible quality for a given bitrate, including transform coding, modeling human perception of audio, and rate control. As a result of these techniques, an audio signal can be more heavily quantized at selected frequencies or times to decrease bitrate, yet the increased quantization will not significantly degrade perceived quality for a listener.
Transform coding techniques convert information into a form that makes it easier to separate perceptually important information from perceptually unimportant information. The less important information can then be quantized heavily, while the more important information is preserved, so as to provide the best perceived quality for a given bitrate. Transform coding techniques typically convert information into the frequency (or spectral) domain. For example, a transform coder converts a time series of audio samples into frequency coefficients. Transform coding techniques include Discrete Cosine Transform [“DCT”], Modulated Lapped Transform [“MLT”], and Fast Fourier Transform [“FFT”]. In practice, the input to a transform coder is partitioned into blocks, and each block is transform coded. Blocks may have varying or fixed sizes, and may or may not overlap with an adjacent block. After transform coding, a frequency range of coefficients may be grouped for the purpose of quantization, in which case each coefficient is quantized like the others in the group, and the frequency range is called a quantization band. For more information about transform coding and MLT in particular, see Gibson et al., Digital Compression for Multimedia, “Chapter 7: Frequency Domain Coding,” Morgan Kaufman Publishers, Inc., pp. 227–262 (1998); U.S. Pat. No. 6,115,689 to Malvar; H. S. Malvar, Signal Processing with Lapped Transforms, Artech House, Norwood, Mass., 1992; or Seymour Schein, “The Modulated Lapped Transform, Its Time-Varying Forms, and Its Application to Audio Coding Standards,” IEEE Transactions on Speech and Audio Processing, Vol. 5, No. 4, pp. 359–66, July 1997.
In addition to the factors that determine objective audio quality, perceived audio quality also depends on how the human body processes audio information. For this reason, audio processing tools often process audio information according to an auditory model of human perception.
Typically, an auditory model considers the range of human hearing and critical bands. Humans can hear sounds ranging from roughly 20 Hz to 20 kHz, and are most sensitive to sounds in the 2–4 kHz range. The human nervous system integrates sub-ranges of frequencies. For this reason, an auditory model may organize and process audio information by critical bands. Aside from range and critical bands, interactions between audio signals can dramatically affect perception. An audio signal that is clearly audible if presented alone can be completely inaudible in the presence of another audio signal, called the masker or the masking signal. The human ear is relatively insensitive to distortion or other loss in fidelity (i.e., noise) in the masked signal, so the masked signal can include more distortion without degrading perceived audio quality. An auditory model typically incorporates other factors relating to physical or neural aspects of human perception of sound.
Using an auditory model, an audio encoder can determine which parts of an audio signal can be heavily quantized without introducing audible distortion, and which parts should be quantized lightly or not at all. Thus, the encoder can spread distortion across the signal so as to decrease the audibility of the distortion.
II. Controlling Rate and Quality of Audio Information
Different audio applications have different quality and bitrate requirements. Certain applications require constant quality over time for compressed audio information. Other applications require variable quality and bitrate. Still other applications require constant or relatively constant bitrate [collectively, “constant bitrate” or “CBR”]. One such CBR application is encoding audio for streaming over the Internet.
A CBR encoder outputs compressed audio information at a constant bitrate despite changes in the complexity of the audio information. Complex audio information is typically less compressible than simple audio information. For the CBR encoder to meet bitrate requirements, the CBR encoder can adjust how the audio information is quantized. The quality of the compressed audio information then varies, with lower quality for periods of complex audio information due to increased quantization and higher quality for periods of simple audio information due to decreased quantization.
While adjustment of quantization and audio quality is necessary at times to satisfy constant bitrate requirements, current CBR encoders can cause unnecessary changes in quality, which can result in thrashing between high quality and low quality around the appropriate, middle quality. Moreover, when changes in audio quality are necessary, current CBR encoders often cause abrupt changes, which are more noticeable and objectionable than smooth changes.
Microsoft Corporation's Windows Media Audio version 7.0 [“WMA7”] includes an audio encoder that can be used to compress audio information for streaming at a constant bitrate. The WMA7 encoder uses a virtual buffer and rate control to handle variations in bitrate due to changes in the complexity of audio information.
To handle short-term fluctuations around the constant bitrate (such as those due to brief variations in complexity), the WMA7 encoder uses a virtual buffer that stores some duration of compressed audio information. For example, the virtual buffer stores compressed audio information for 5 seconds of audio playback. The virtual buffer outputs the compressed audio information at the constant bitrate, so long as the virtual buffer does not underflow or overflow. Using the virtual buffer, the encoder can compress audio information at relatively constant quality despite variations in complexity, so long as the virtual buffer is long enough to smooth out the variations. In practice, virtual buffers must be limited in duration in order to limit system delay, however, and buffer underflow or overflow can occur unless the encoder intervenes.
To handle longer-term deviations from the constant bitrate (such as those due to extended periods of complexity or silence), the WMA7 encoder adjusts the quantization step size of a uniform, scalar quantizer in a rate control loop. The relation between quantization step size and bitrate is complex and hard to predict in advance, so the encoder tries one or more different quantization step sizes until the encoder finds one that results in compressed audio information with a bitrate sufficiently close to a target bitrate. The encoder sets the target bitrate to reach a desired buffer fullness, preventing buffer underflow and overflow. Based upon the complexity of the audio information, the encoder can also allocate additional bits for a block or deallocate bits when setting the target bitrate for the rate control loop.
The WMA7 encoder measures the quality of the reconstructed audio information for certain operations (e.g., deciding which bands to truncate). The WMA7 encoder does not use the quality measurement in conjunction with adjustment of the quantization step size in a quantization loop, however.
The WMA7 encoder controls bitrate and provides good quality for a given bitrate, but can cause unnecessary quality changes. Moreover, with the WMA7 encoder, necessary changes in audio quality are not as smooth as they could be in transitions from one level of quality to another.
Numerous other audio encoders use rate control strategies; for example, see U.S. Pat. No. 5,845,243 to Smart et al. Such rate control strategies potentially consider information other than or in addition to current buffer fullness, for example, the complexity of the audio information.
Several international standards describe audio encoders that incorporate distortion and rate control. The Motion Picture Experts Group, Audio Layer 3 [“MP3”] and Motion Picture Experts Group 2, Advanced Audio Coding [“AAC”] standards each describe techniques for controlling distortion and bitrate of compressed audio information.
In MP3, the encoder uses nested quantization loops to control distortion and bitrate for a block of audio information called a granule. Within an outer quantization loop for controlling distortion, the MP3 encoder calls an inner quantization loop for controlling bitrate.
In the outer quantization loop, the MP3 encoder compares distortions for scale factor bands to allowed distortion thresholds for the scale factor bands. A scale factor band is a range of frequency coefficients for which the encoder calculates a weight called a scale factor. Each scale factor starts with a minimum weight for a scale factor band. After an iteration of the inner quantization loop, the encoder amplifies the scale factors until the distortion in each scale factor band is less than the allowed distortion threshold for that scale factor band, with the encoder calling the inner quantization loop for each set of scale factors. In special cases, the encoder exits the outer quantization loop even if distortion exceeds the allowed distortion threshold for a scale factor band (e.g., if all scale factors have been amplified or if a scale factor has reached a maximum amplification).
In the inner quantization loop, the MP3 encoder finds a satisfactory quantization step size for a given set of scale factors. The encoder starts with a quantization step size expected to yield more than the number of available bits for the granule. The encoder then gradually increases the quantization step size until it finds one that yields fewer than the number of available bits.
The MP3 encoder calculates the number of available bits for the granule based upon the average number of bits per granule, the number of bits in a bit reservoir, and an estimate of complexity of the granule called perceptual entropy. The bit reservoir counts unused bits from previous granules. If a granule uses less than the number of available bits, the MP3 encoder adds the unused bits to the bit reservoir. When the bit reservoir gets too full, the MP3 encoder preemptively allocates more bits to granules or adds padding bits to the compressed audio information. The MP3 encoder uses a psychoacoustic model to calculate the perceptual entropy of the granule based upon the energy, distortion thresholds, and widths for frequency ranges called threshold calculation partitions. Based upon the perceptual entropy, the encoder can allocate more than the average number of bits to a granule.
For additional information about MP3 and AAC, see the MP3 standard (“ISO/IEC 11172-3, Information Technology—Coding of Moving Pictures and Associated Audio for Digital Storage Media at Up to About 1.5 Mbit/s—Part 3: Audio”) and the AAC standard.
Although MP3 encoding has achieved widespread adoption, it is unsuitable for some applications (for example, real-time audio streaming at very low to mid bitrates) for several reasons. First, the nested quantization loops can be too time-consuming. Second, the nested quantization loops are designed for high quality applications, and do not work as well for lower bitrates which require the introduction of some audible distortion. Third, the MP3 control strategy assumes predictable rate-distortion characteristics in the audio (in which distortion decreases with the number of bits allocated), and does not address situations in which distortion increases with the number of bits allocated.
Other audio encoders use a combination of filtering and zero tree coding to jointly control quality and bitrate. An audio encoder decomposes an audio signal into bands at different frequencies and temporal resolutions. The encoder formats band information such that information for less perceptually important bands can be incrementally removed from a bitstream, if necessary, while preserving the most information possible for a given bitrate. For more information about zero tree coding, see Srinivasan et al., “High-Quality Audio Compression Using an Adaptive Wavelet Packet Decomposition and Psychoacoustic Modeling,” IEEE Transactions on Signal Processing, Vol. 46, No. 4, pp. (April 1998).
While this strategy works for high quality, high complexity applications, it does not work as well for very low to mid-bitrate applications. Moreover, the strategy assumes predictable rate-distortion characteristics in the audio, and does not address situations in which distortion increases with the number of bits allocated.
Outside of the field of audio encoding, various joint quality and bitrate control strategies for video encoding have been published. For example, see U.S. Pat. No. 5,686,964 to Naveen et al.; U.S. Pat. No. 5,995,151 to Naveen et al.; Caetano et al., “Rate Control Strategy for Embedded Wavelet Video Coders,” IEEE Electronics Letters, pp 1815–17 (Oct. 14, 1999); and Ribas-Corbera et al., “Rate Control in DCT Video Coding for Low-Delay Communications,” IEEE Trans Circuits and Systems for Video Technology, Vol. 9, No 1, (February 1999).
As one might expect given the importance of quality and rate control to encoder performance, the fields of quality and rate control for audio and video applications are well developed. Whatever the advantages of previous quality and rate control strategies, however, they do not offer the performance advantages of the present invention.