With the introduction of compact disks, digital wireless telephone networks, and audio delivery over the Internet, digital audio has become commonplace. Engineers use a variety of techniques to process digital audio efficiently while still maintaining the quality of the digital audio. To understand these techniques, it helps to understand how audio information is represented in a computer and how humans perceive audio.
I. Representation of Audio Information in a Computer
A computer processes audio information as a series of numbers representing the audio information. For example, a single number can represent an audio sample, which is an amplitude value (i.e., loudness) at a particular time. Several factors affect the quality of the audio information, including sample depth, sampling rate, and channel mode.
Sample depth (or precision) indicates the range of numbers used to represent a sample. The more values possible for the sample, the higher the quality because the number can capture more subtle variations in amplitude. For example, an 8-bit sample has 256 possible values, while a 16-bit sample has 65,536 possible values.
The sampling rate (usually measured as the number of samples per second) also affects quality. The higher the sampling rate, the higher the quality because more frequencies of sound can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second.
Mono and stereo are two common channel modes for audio. In mono mode, audio information is present in one channel. In stereo mode, audio information is present in two channels usually labeled the left and right channels. Other modes with more channels, such as 5-channel surround sound, are also possible. Table 1 shows several formats of audio with different quality levels, along with corresponding raw bitrate costs.
TABLE 1Bitrates for different quality audio informationSample DepthSampling RateRaw BitrateQuality(bits/sample)(samples/second)Mode(bits/second)Internet88,000mono64,000telephonyTelephone811,025mono88,200CD audio1644,100stereo1,411,200high quality1648,000stereo1,536,000audio
As Table 1 shows, the cost of high quality audio information such as CD audio is high bitrate. High quality audio information consumes large amounts of computer storage and transmission capacity.
Compression (also called encoding or coding) decreases the cost of storing and transmitting audio information by converting the information into a lower bitrate form. Compression can be lossless (in which quality does not suffer) or lossy (in which quality suffers). Decompression (also called decoding) extracts a reconstructed version of the original information from the compressed form.
Quantization is a conventional lossy compression technique. There are many different kinds of quantization including uniform and non-uniform quantization, scalar and vector quantization, and adaptive and non-adaptive quantization. Quantization maps ranges of input values to single values. For example, with uniform, scalar quantization by a factor of 3.0, a sample with a value anywhere between −1.5 and 1.499 is mapped to 0, a sample with a value anywhere between 1.5 and 4.499 is mapped to 1, etc. To reconstruct the sample, the quantized value is multiplied by the quantization factor, but the reconstruction is imprecise. Continuing the example started above, the quantized value 1 reconstructs to 1×3=3; it is impossible to determine where the original sample value was in the range 1.5 to 4.499. Quantization causes a loss in fidelity of the reconstructed value compared to the original value. Quantization can dramatically improves the effectiveness of subsequent lossless compression, however, thereby reducing bitrate.
An audio encoder can use various techniques to provide the best possible quality for a given bitrate, including transform coding, rate control, and modeling human perception of audio. As a result of these techniques, an audio signal can be more heavily quantized at selected frequencies or times to decrease bitrate, yet the increased quantization will not significantly degrade perceived quality for a listener.
Transform coding techniques convert data into a form that makes it easier to separate perceptually important information from perceptually unimportant information. The less important information can then be quantized heavily, while the more important information is preserved, so as to provide the best perceived quality for a given bitrate. Transform coding techniques typically convert data into the frequency (or spectral) domain. For example, a transform coder converts a time series of audio samples into frequency coefficients. Transform coding techniques include Discrete Cosine Transform [“DCT”], Modulated Lapped Transform [“MLT”], and Fast Fourier Transform [“FFT”]. In practice, the input to a transform coder is partitioned into blocks, and each block is transform coded. Blocks may have varying or fixed sizes, and may or may not overlap with an adjacent block. For more information about transform coding and MLT in particular, see Gibson et al., Digital Compression for Multimedia, “Chapter 7: Frequency Domain Coding,” Morgan Kaufman Publishers, Inc., pp. 227-262 (1998); U.S. Pat. No. 6,115,689 to Malvar; H. S. Malvar, Signal Processing with Lapped Transforms, Artech House, Norwood, Mass., 1992; or Seymour Schlein, “The Modulated Lapped Transform, Its Time-Varying Forms, and Its Application to Audio Coding Standards,” IEEE Transactions on Speech and Audio Processing, Vol. 5, No. 4, pp. 359-66, July 1997.
With rate control, an encoder adjusts quantization to regulate bitrate. For audio information at a constant quality, complex information typically has a higher bitrate (is less compressible) than simple information. So, if the complexity of audio information changes in a signal, the bitrate may change. In addition, changes in transmission capacity (such as those due to Internet traffic) affect available bitrate in some applications. The encoder can decrease bitrate by increasing quantization, and vice versa. Because the relation between degree of quantization and bitrate is complex and hard to predict in advance, the encoder can try different degrees of quantization to get the best quality possible for some bitrate, which is an example of a quantization loop.
II. Human Perception of Audio Information
In addition to the factors that determine objective audio quality, perceived audio quality also depends on how the human body processes audio information. For this reason, audio processing tools often process audio information according to an auditory model of human perception.
Typically, an auditory model considers the range of human hearing and critical bands. Humans can hear sounds ranging from roughly 20 Hz to 20 kHz, and are most sensitive to sounds in the 2-4 kHz range. The human nervous system integrates sub-ranges of frequencies. For this reason, an auditory model may organize and process audio information by critical bands. For example, one critical band scale groups frequencies into 24 critical bands with upper cut-off frequencies (in Hz) at 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, and 15500. Different auditory models use a different number of critical bands (e.g., 25, 32, 55, or 109) and/or different cut-off frequencies for the critical bands. Bark bands are a well-known example of critical bands.
Aside from range and critical bands, interactions between audio signals can dramatically affect perception. An audio signal that is clearly audible if presented alone can be completely inaudible in the presence of another audio signal, called the masker or the masking signal. The human ear is relatively insensitive to distortion or other loss in fidelity (i.e., noise) in the masked signal, so the masked signal can include more distortion without degrading perceived audio quality. Table 2 lists various factors and how the factors relate to perception of an audio signal.
TABLE 2Various factors that relate to perception of audioFactorRelation to Perception of an Audio Signalouter andGenerally, the outer and middle ear attenuate highermiddlefrequency information and pass middle frequencyear transferinformation. Noise is less audible in higher frequenciesthan middle frequencies.noise in theNoise present in the auditory nerve, together with noiseauditoryfrom the flow of blood, increases for low frequencynerveinformation. Noise is less audible in lower frequencies thanmiddle frequencies.perceptualDepending on the frequency of the audio signal, hair cellsfrequencyat different positions in the inner ear react, which affectsscalesthe pitch that a human perceives. Critical bands relatefrequency to pitch.excitationHair cells typically respond several milliseconds after theonset of the audio signal at a frequency. After exposure,hair cells and neural processes need time to recover fullsensitivity. Moreover, loud signals are processed faster thanquiet signals. Noise can be masked when the ear will notsense it.detectionHumans are better at detecting changes in loudness forquieter signals than louder signals. Noise can be masked inlouder signals.simultaneousFor a masker and maskee present at the same time, themaskingmaskee is masked at the frequency of the masker but alsoat frequencies above and below the masker. The amount ofmasking depends on the masker and maskee structures andthe masker frequency.temporalThe masker has a masking effect before and after than themaskingmasker itself. Generally, forward masking is morepronounced than backward masking. The masking effectdiminishes further away from the masker in time.loudnessPerceived loudness of a signal depends on frequency,duration, and sound pressure level. The components of asignal partially mask each other, and noise can be maskedas a result.cognitiveCognitive effects influence perceptual audio quality. Abruptprocessingchanges in quality are objectionable. Different componentsof an audio signal are important in different applications(e.g., speech vs. music).
An auditory model can consider any of the factors shown in Table 2 as well as other factors relating to physical or neural aspects of human perception of sound. For more information about auditory models, see:    1) Zwicker and Feldtkeller, “Das Ohr als Nachrichtenempfänger,” Hirzel-Verlag, Stuttgart, 1967;    2) Terhardt, “Calculating Virtual Pitch,” Hearing Research, 1:155-182, 1979;    3) Lufti, “Additivity of Simultaneous Masking,” Journal of Acoustic Society of America, 73:262 267, 1983;    4) Jesteadt et al., “Forward Masking as a Function of Frequency, Masker Level, and Signal Delay,” Journal of Acoustical Society of America, 71:950-962,1982;    5) ITU, Recommendation ITU-R BS 1387, Method for Objective Measurements of Perceived Audio Quality, 1998;    6) Beerends, “Audio Quality Determination Based on Perceptual Measurement Techniques,” Applications of Digital Signal Processing to Audio and Acoustics, Chapter 1, Ed. Mark Kahrs, Karlheinz Brandenburg, Kluwer Acad. Publ., 1998; and    7) Zwicker, Psychoakustik, Springer-Verlag, Berlin Heidelberg, New York, 1982.III. Generating Quantization Matrices
Quantization and other lossy compression techniques introduce potentially audible noise into an audio signal. The audibility of the noise depends on 1) how much noise there is and 2) how much of the noise the listener perceives. The first factor relates mainly to objective quality, while the second factor depends on human perception of sound.
Distortion is one measure of how much noise is in reconstructed audio. Distortion D can be calculated as the square of the differences between original values and reconstructed values:D=(u−q(u)Q)2  (1),where u is an original value, q(u) is a quantized value, and Q is a quantization factor. The distribution of noise in the reconstructed audio depends on the quantization scheme used in the encoder.
For example, if an audio encoder uses uniform, scalar quantization for each frequency coefficient of spectral audio data, noise is spread equally across the frequency spectrum of the reconstructed audio, and different levels are quantized at the same accuracy. Uniform, scalar quantization is relatively simple computationally, but can result in the complete loss of small values at moderate levels of quantization. Uniform, scalar quantization also fails to account for the varying sensitivity of the human ear to noise at different frequencies and levels of loudness, interaction with other sounds present in the signal (i.e., masking), or the physical limitations of the human ear (i.e., the need to recover sensitivity).
Power-law quantization (e.g., α-law) is a non-uniform quantization technique that varies quantization step size as a function of amplitude. Low levels are quantized with greater accuracy than high levels, which tends to preserve low levels along with high levels. Power-law quantization still fails to fully account for the audibility of noise, however.
Another non-uniform quantization technique uses quantization matrices. A quantization matrix is a set of weighting factors for series of values called quantization bands. Each value within a quantization band is weighted by the same weighting factor. A quantization matrix spreads distortion in unequal proportions, depending on the weighting factors. For example, if quantization bands are frequency ranges of frequency coefficients, a quantization matrix can spread distortion across the spectrum of reconstructed audio data in unequal proportions. Some parts of the spectrum can have more severe quantization and hence more distortion; other parts can have less quantization and hence less distortion.
Microsoft Corporation's Windows Media Audio version 7.0 [“WMA7”] generates quantization matrices for blocks of frequency coefficient data. In WMA7, an audio encoder uses a MLT to transform audio samples into frequency coefficients in variable-size transform blocks. For stereo mode audio data, the encoder can code left and right channels into sum and difference channels. The sum channel is the averages of the left and right channels; the difference channel is the differences between the left and right channels divided by two. The encoder computes a quantization matrix for each channel:
 Q[c][d]=E[d]  (2),
where c is a channel, d is a quantization band, and E[d] is an excitation pattern for the quantization band d. The WMA7 encoder calculates an excitation pattern for a quantization band by squaring coefficient values to determine energies and then summing the energies of the coefficients within the quantization band.
Since the quantization bands can have different sizes, the encoder adjusts the quantization matrix Q[c][d] by the quantization band sizes:                                                         Q              ⁡                              [                c                ]                                      ⁡                          [              d              ]                                ←                                    (                                                                    Q                    ⁡                                          [                      c                      ]                                                        ⁡                                      [                    d                    ]                                                                    Card                  ⁢                                      {                                          B                      ⁡                                              [                        d                        ]                                                              }                                                              )                        u                          ,                            (        3        )            where Card{B[d]} is the number of coefficients in the quantization band d, and where u is an experimentally derived exponent (in listening tests) that affects relative weights of bands of different energies. For stereo mode audio data, whether the data is in independently (i.e., left and right) or jointly (i.e., sum and difference) coded channels, the WMA7 encoder uses the same technique to generate quantization matrices for two individual coded channels.
The quantization matrices in WMA7 spread distortion between bands in proportion to the energies of the bands. Higher energy leads to a higher weight and more quantization; lower energy leads to a lower weight and less quantization. WMA7 still fails to account for the audibility of noise in several respects, however, including the varying sensitivity of the human ear to noise at different frequencies and times, temporal masking, and the physical limitations of the human ear.
In order to reconstruct audio data, a WMA7 decoder needs the quantization matrices used to compress the audio data. For this reason, the WMA7 encoder sends the quantization matrices to the decoder as side information in the bitstream of compressed output. To reduce bitrate, the encoder compresses the quantization matrices using a technique such as the direct compression technique (100) shown in FIG. 1.
In the direct compression technique (100), the encoder uniformly quantizes (110) each element of a quantization matrix (105). The encoder then differentially codes (120) the quantized elements, and Huffman codes (130) the differentially coded elements. The technique (100) is computationally simple and effective, but the resulting bitrate for the quantization matrix is not low enough for very low bitrate coding.
Aside from WMA7, several international standards describe audio encoders that spread distortion in unequal proportions across bands. The Motion Picture Experts Group, Audio Layer 3 [“MP3”] and Motion Picture Experts Group 2, Advanced Audio Coding [“AAC”] standards each describe scale factors used when quantizing spectral audio data.
In MP3, the scale factors are weights for ranges of frequency coefficients called scale factor bands. Each scale factor starts with a minimum weight for a scale factor band. The number of scale factor bands depends on sampling rate and block size (e.g., 21 scale factor bands for a long block of 48 kHz input). For the starting set of scale factors, the encoder finds a satisfactory quantization step size in an inner quantization loop. In an outer quantization loop, the encoder amplifies the scale factors until the distortion in each scale factor band is less than the allowed distortion threshold for that scale factor band, with the encoder repeating the inner quantization loop for each adjusted set of scale factors. In special cases, the encoder exits the outer quantization loop even if distortion exceeds the allowed distortion threshold for a scale factor band (e.g., if all scale factors have been amplified or if a scale factor has reached a maximum amplification). The MP3 encoder transmits the scale factors as side information using ad hoc differential coding and, potentially, entropy coding.
Before the quantization loops, the MP3 encoder can switch between long blocks of 576 frequency coefficients and short blocks of 192 frequency coefficients (sometimes called long windows or short windows). Instead of a long block, the encoder can use three short blocks for better time resolution. The number of scale factor bands is different for short blocks and long blocks (e.g., 12 scale factor bands vs. 21 scale factor bands).
The MP3 encoder can use any of several different coding channel modes, including single channel, two independent channels (left and right channels), or two jointly coded channels (sum and difference channels). If the encoder uses jointly coded channels, the encoder computes and transmits a set of scale factors for each of the sum and difference channels using the same techniques that are used for left and right channels. Or, if the encoder uses jointly coded channels, the encoder can instead use intensity stereo coding. Intensity stereo coding changes how scale factors are determined for higher frequency scale factor bands and changes how sum and difference channels are reconstructed, but the encoder still computes and transmits two sets of scale factors for the two channels.
The MP3 encoder incorporates a psychoacoustic model when determining the allowed distortion thresholds for scale factor bands. In a path separate from the rest of the encoder, the encoder processes the original audio data according to the psychoacoustic model. The psychoacoustic model uses a different frequency transform than the rest of the encoder (FFT vs. hybrid polyphase/MDCT filter bank) and uses separate computations for energy and other parameters. In the psychoacoustic model, the MP3 encoder processes the blocks of frequency coefficients according to threshold calculation partitions at sub-Bark band resolution (e.g., 62 partitions for a long block of 48 kHz input). The encoder calculates a Signal to Mask Ratio [“SMR”] for each partition, and then converts the SMRs for the partitions into SMRs for the scale factor bands. The MP3 encoder later converts the SMRs for scale factor bands into the allowed distortion thresholds for the scale factor bands. The encoder runs the psychoacoustic model twice (in parallel, once for long blocks and once for short blocks) using different techniques to calculate SMR depending on the block size.
For additional information about MP3 and AAC, see the MP3 standard (“ISO/IEC 11172-3, Information Technology—Coding of Moving Pictures and Associated Audio for Digital Storage Media at Up to About 1.5 Mbit/s—Part 3: Audio”) and the AAC standard.
Although MP3 encoding has achieved widespread adoption, it is unsuitable for some applications (for example, real-time audio streaming at very low to mid bitrates) for several reasons. First, MP3's iterative refinement of scale factors in the outer quantization loop consumes too many resources for some applications. Repeated iterations of the outer quantization loop consume time and computational resources. On the other hand, if the outer quantization loop exits quickly (i.e., with minimum scale factors and a small quantization step size), the MP3 encoder can waste bitrate encoding audio information with distortion well below the allowed distortion thresholds. Second, computing SMR with a psychoacoustic model separate from the rest of the MP3 encoder (e.g., separate frequency transform, computations of energy, etc.) consumes too much time and computational resources for some applications. Third, computing SMRs in parallel for long blocks as well as short blocks consumes more resources than is necessary when the encoder switches between long blocks or short blocks in the alternative. Computing SMRs in separate tracks also does not allow direct comparisons between blocks of different sizes for operations like temporal spreading. Fourth, the MP3 encoder does not adequately exploit differences between independently coded channels and jointly coded channels when computing and transmitting quantization matrices. Fifth, ad hoc differential coding and entropy coding of scale factors in MP3 gives good quality for the scale factors, but the bitrate for the scale factors is not low enough for very low bitrate applications.
IV. Parametric Coding of Audio Information
Parametric coding is an alternative to transform coding, quantization, and lossless compression in applications such as speech compression. With parametric coding, an encoder converts a block of audio samples into a set of parameters describing the block (rather than coded versions of the audio samples themselves). A decoder later synthesizes the block of audio samples from the set of parameters. Both the bitrate and the quality for parametric coding are typically lower than other compression methods.
One technique for parametrically compressing a block of audio samples uses Linear Predictive Coding [“LPC”] parameters and Line-Spectral Frequency [“LSF”] values. First, the audio encoder computes the LPC parameters. For example, the audio encoder computes autocorrelation values for the block of audio samples itself, which are short-term correlations between samples within the block. From the autocorrelation values, the encoder computes the LPC parameters using a technique such as Levinson recursion. Other techniques for determining LPC parameters use a covariance method or a lattice method.
Next, the encoder converts the LPC parameters to LSF values, which capture spectral information for the block of audio samples. LSF values have greater intra-block and inter-block correlation than LPC parameters, and are better suited for subsequent quantization. For example, the encoder computes partial correlation [“PARCOR”] or reflection coefficients from the LPC parameters. The encoder then computes the LSF values from the PARCOR coefficients using a method such as complex root, real root, ratio filter, Chebyshev, or adaptive sequential LMS. Finally, the encoder quantizes the LSF values. Instead of LSF values, different techniques convert LPC parameters to a log area ratio, inverse sine, or other representation. For more information about parametric coding, LPC parameters, and LSF values, see A. M. Kondoz, Digital Speech: Coding for Low Bit Rate Communications Systems, “Chapter 3.3: Linear Predictive Modeling of Speech Signals” and “Chapter 4: LPC Parameter Quantisation Using LSFs,” John Wiley & Sons (1994).
WMA7 allows a parametric coding mode in which the audio encoder parametrically codes the spectral shape of a block of audio samples. The resulting parameters represent the quantization matrix for the block, rather than the more conventional application of representing the audio signal itself. The parameters used in WMA7 represent spectral shape of the audio block, but do not adequately account for human perception of audio information.