The digital transmission and storage of audio signals are increasingly based on data reduction algorithms, which are adapted to the properties of the human auditory system and particularly rely on masking effects. Such algorithms do not mainly aim at minimizing the distortions but rather attempt to handle these distortions in a way that they are perceived as little as possible.
To understand these audio encoding techniques, it helps to understand how audio information is represented in a computer and how humans perceive audio.
I. Representation of Audio Information in a Computer
A computer processes audio information as a series of numbers representing the audio information. For example, a single number can represent an audio sample, which is an amplitude (i.e., loudness) at a particular time. Several factors affect the quality of the audio information, including sample depth, sampling rate, and channel mode.
Sample depth (or precision) indicates the range of numbers used to represent a sample. The more values possible for the sample, the higher the quality is because the number can capture more subtle variations in amplitude. For example, an 8-bit sample has 256 possible values, while a 16-bit sample has 65,536 possible values.
The sampling rate (usually measured as the number of samples per second) also affects quality. The higher the sampling rate, the higher the quality because more frequencies of sound can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second.
Mono and stereo are two common channel modes for audio. In mono mode, audio information is present in one channel. In stereo mode, audio information is present two channels usually labeled the left and right channels. Other modes with more channels, such as 5-channel surround sound, are also possible. Table 1 shows several formats of audio with different quality levels, along with corresponding raw bit rate costs.
TABLE 1Bit rates for different quality audio informationSampleDepthSampling RateRaw Bit rateQuality(bits/sample)(samples/second)Mode(bits/second)Internet telephony88,000mono64,000telephone811,025mono88,200CD audio1644,100stereo1,411,200high quality audio1648,000stereo1,536,000
As Table 1 shows, the cost of high quality audio information such as CD audio is high bit rate. High quality audio information consumes large amounts of computer storage and transmission capacity.
Compression (also called encoding or coding) decreases the cost of storing and transmitting audio information by converting the information into a lower bit rate form. Compression can be lossless (in which quality does not suffer) or lossy (in which quality suffers). Decompression (also called decoding) extracts a reconstructed version of the original information from the compressed form.
Quantization is a conventional lossy compression technique. There are many different kinds of quantization including uniform and non-uniform quantization, scalar and vector quantization, and adaptive and non-adaptive quantization. Quantization maps ranges of input values to single values. For example, with uniform, scalar quantization by a factor of 3.0, a sample with a value anywhere between −1.5 and 1.499 is mapped to 0, a sample with a value anywhere between 1.5 and 4.499 is mapped to 1, etc. To reconstruct the sample, the quantized value is multiplied by the quantization factor, but the reconstruction is imprecise. Continuing the example started above, the quantized value 1 reconstructs to 1×3=3; it is impossible to determine where the original sample value was in the range 1.5 to 4.499. Quantization causes a loss in fidelity of the reconstructed value compared to the original value. Quantization can dramatically improve the effectiveness of subsequent lossless compression, however, thereby reducing bit rate.
An audio encoder can use various techniques to provide the best possible quality for a given bit rate, including transform coding, rate control, and modeling human perception of audio. As a result of these techniques, an audio signal can be more heavily quantized at selected frequencies or times to decrease bit rate, yet the increased quantization will not significantly degrade perceived quality for a listener.
Transform coding techniques convert information into a form that makes it easier to separate perceptually important information from perceptually unimportant information. The less important information can then be quantized heavily, while the more important information is preserved, so as to provide the best perceived quality for a given bit rate. Transform coding techniques typically convert information into the frequency (or spectral) domain. For example, a transform coder converts a time series of audio samples into frequency coefficients. Transform coding techniques include Discrete Cosine Transform [“DCT”], Modulated Lapped Transform [“MLT”], and Fast Fourier Transform [“FFT”]. In practice, the input to a transform coder is partitioned into blocks, and each block is transform coded. Blocks may have varying or fixed sizes, and may or may not overlap with an adjacent block. After transform coding, a frequency range of coefficients may be grouped for the purpose of quantization, in which case each coefficient is quantized like the others in the group, and the frequency range is called a quantization band. For more information about transform coding and MLT in particular, see Gibson et al., Digital Compression for Multimedia, “Chapter 7: Frequency Domain Coding,” Morgan Kaufman Publishers, Inc., pp. 227-262 (1998); U.S. Pat. No. 6,115,689 to Malvar; H. S. Malvar, Signal Processing with Lapped Transforms, Artech House, Norwood, Mass., 1992; or Seymour Schlein, “The Modulated Lapped Transform, Its Time-Varying Forms, and Its Application to Audio Coding Standards,” IEEE Transactions on Speech and Audio Processing, Vol. 5, No. 4, pp. 359-66, July 1997.
With rate control, an encoder adjusts quantization to regulate bit rate. For audio information at a constant quality, complex information typically has a higher bit rate (is less compressible) than simple information. So, if the complexity of audio information changes in a signal, the bit rate may change. In addition, changes in transmission capacity (such as those due to Internet traffic) affect available bit rate in some applications. The encoder can decrease bit rate by increasing quantization, and vice versa. Because the relation between degree of quantization and bit rate is complex and hard to predict in advance, the encoder can try different degrees of quantization to get the best quality possible for some bit rate, which is an example of a quantization loop.
II. Human Perception of Audio Information
In addition to the factors that determine objective audio quality, perceived audio quality also depends on how the human body processes audio information. For this reason, audio processing tools often process audio information according to an auditory model of human perception.
Typically, an auditory model considers the range of human hearing and critical bands. Humans can hear sounds ranging from roughly 20 Hz to 20 kHz, and are most sensitive to sounds in the 2-4 kHz range. The human nervous system integrates sub-ranges of frequencies. For this reason, an auditory model may organize and process audio information by critical bands. For example, one critical band scale groups frequencies into 24 critical bands with upper cut-off frequencies (in Hz) at 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, and 15500. Different auditory models use a different number of critical bands (e.g., 25, 32, 55, or 109) and/or different cutoff frequencies for the critical bands. Bark bands are a well-known example of critical bands.
Aside from range and critical bands, interactions between audio signals can dramatically affect perception. An audio signal that is clearly audible if presented alone can be completely inaudible in the presence of another audio signal, called the masker or the masking signal. The human ear is relatively insensitive to distortion or other loss in fidelity (i.e., noise) in the masked signal, so the masked signal can include more distortion without degrading perceived audio quality. Table 2 lists various factors and how the factors relate to perception of an audio signal.
TABLE 2Various factors that relate to perception of audioFactorRelation to Perception of an Audio Signalouter and middleGenerally, the outer and middle ear attenuate higherear transferfrequency information and pass middle frequencyinformation. Noise is less audible in higher frequenciesthan middle frequencies.noise in theNoise present in the auditory nerve, together with noiseauditory nervefrom the flow of blood, increases for low frequencyinformation. Noise is less audible in lower frequenciesthan middle frequencies.perceptualDepending on the frequency of the audio signal, hairfrequency scalescells at different positions in the inner ear react, whichaffects the pitch that a human perceives. Critical bandsrelate frequency to pitch.ExcitationHair cells typically respond several milliseconds afterthe onset of the audio signal at a frequency. Afterexposure, hair cells and neural processes need time torecover full sensitivity. Moreover, loud signals areprocessed faster than quiet signals. Noise can be maskedwhen the ear will not sense it.DetectionHumans are better at detecting changes in loudness forquieter signals than louder signals. Noise can be masked in quieter signals.simultaneousFor a masker and maskee present at the same time, the maskingmaskee is masked at the frequency of the masker butalso at frequencies above and below the masker. Theamount of masking depends on the masker and maskeestructures and the masker frequency.temporalThe masker has a masking effect before and after thanmaskingthe masker itself. Generally, forward masking is morepronounced than backward masking. The masking effectdiminishes further away from the masker in time.loudnessPerceived loudness of a signal depends on frequency, duration, and sound pressure level. The components of asignal partially mask each other, and noise can bemasked as a result.cognitiveCognitive effects influence perceptual audio quality. processingAbrupt changes in quality are objectionable. Different components of an audio signal are important in differentapplications (e.g., speech vs. music).III. Measuring Audio Quality
In various applications, engineers measure audio quality. For example, quality measurement can be used to evaluate the performance of different audio encoders or other equipment, or the degradation introduced by a particular processing step. For some applications, speed is emphasized over accuracy. For other applications, quality is measured off-line and more rigorously.
Subjective listening tests are one way to measure audio quality. Different people evaluate quality differently, however, and even the same person can be inconsistent over time. By standardizing the evaluation procedure and quantifying the results of evaluation, subjective listening tests can be made more consistent, reliable, and reproducible. In many applications, however, quality must be measured quickly or results must be very consistent over time, so subjective listening tests are inappropriate.
Conventional measures of objective audio quality include signal to noise ratio [“SNR”] and distortion of the reconstructed audio signal compared to the original audio signal. SNR is the ratio of the amplitude of the noise to the amplitude of the signal, and is usually expressed in terms of decibels. Distortion D can be calculated as the square of the differences between original values and reconstructed values.D=(u−q(u)Q)2  (1)where u is an original value, q(u) is a quantized version of the original value, and Q is a quantization factor. Both SNR and distortion are simple to calculate, but fail to account for the audibility of noise. Namely, SNR and distortion fail to account for the varying sensitivity of the human ear to noise at different frequencies and levels of loudness, interaction with other sounds present in the signal (i.e., masking), or the physical limitations of the human ear (i.e., the need to recover sensitivity). Both SNR and distortion fail to accurately predict perceived audio quality in many cases.
ITU-R BS 1387 is an international standard for objectively measuring perceived audio quality. The standard describes several quality measurement techniques and auditory models. The techniques measure the quality of a test audio signal compared to a reference audio signal, in mono or stereo mode.
FIG. 1 shows a masked threshold approach (100) to measuring audio quality described in ITU-R BS 1387, Annex 1, Appendix 4, Sections 2, 3, and 4.2. In the masked threshold approach (100), a first time to frequency mapper (110) maps a reference signal (102) to frequency data, and a second time to frequency mapper (120) maps a test signal (104) to frequency data. A subtractor (130) determines an error signal from the difference between the reference signal frequency data and the test signal frequency data. An auditory modeler (140) processes the reference signal frequency data, including calculation of a masked threshold for the reference signal. The error to threshold comparator (150) then compares the error signal to the masked threshold, generating an audio quality estimate (152), for example, based upon the differences in levels between the error signal and the masked threshold.
ITU-R BS 1387 describes in greater detail several other quality measures and auditory models. In a FFT-based ear model, reference and test signals at 48 kHz are each split into windows of 2048 samples such that there is 50% overlap across consecutive windows. A Hann window function and FFT are applied, and the resulting frequency coefficients are filtered to model the filtering effects of the outer and middle ear. An error signal is calculated as the difference between the frequency coefficients of the reference signal and those of the test signal. For each of the error signal, the reference signal, and the test signal, the energy is calculated by squaring the signal values. The energies are then mapped to critical bands/pitches. For each critical band, the energies of the coefficients contributing to (e.g., within) that critical band are added together. For the reference signal and the test signal, the energies for the critical bands are then smeared across frequencies and time to model simultaneous and temporal masking. The outputs of the smearing are called excitation patterns. A masking threshold can then be calculated for an excitation pattern:
                              M          ⁡                      [                          k              ,              n                        ]                          =                              E            ⁡                          [                              k                ,                n                            ]                                            10                                          m                ⁡                                  [                  k                  ]                                            10                                                          (        2        )            for m[k]=3.0 if k*res≦12 and m[k]=k*res if k*res>12, where k is the critical band, res is the resolution of the band scale in terms of Bark bands, n is the frame, and E[k,n] is the excitation pattern.
From the excitation patterns, error signal, and other outputs of the ear model, ITU-R BS 1387 describes calculating Model Output Variables [“MOVs”]. One MOV is the average noise to mask ratio [“NMR”] for a frame:
                                          NMR            local                    ⁡                      [            n            ]                          =                  10          *                      log            10                    ⁢                      1            Z                    ⁢                                    ∑                              k                =                0                                            Z                -                1                                      ⁢                                                            P                  noise                                ⁡                                  [                                      k                    ,                    n                                    ]                                                            M                ⁡                                  [                                      k                    ,                    n                                    ]                                                                                        (        3        )            where n is the frame number, Z is the number of critical bands per frame, Pnoise[k,n] is the noise pattern, and M[k,n] is the masking threshold. NMR can also be calculated for a whole signal as a combination of NMR values for frames.
In ITU-R BS 1387, NMR and other MOVs are weighted and aggregated to give a single output quality value. The weighting ensures that the single output value is consistent with the results of subjective listening tests. For stereo signals, the linear average of MOVs for the left and right channels is taken. For more information about the FFT-based ear model and calculation of NMR and other MOVs, see ITU-R BS 1387, Annex 2, Sections 2.1 and 4-6. ITU-R BS 1387 also describes a filter bank-based ear model. The Beerends reference also describes audio quality measurement, as does Solari, Digital Video and Audio Compression, “Chapter 8: Sound and Audio,” McGraw-Hill, Inc., pp. 187-212 (1997).
Compared to subjective listening tests, the techniques described in ITU-R BS 1387 are more consistent and reproducible. Nonetheless, the techniques have several shortcomings. First, the techniques are complex and time-consuming, which limits their usefulness for real-time applications. For example, the techniques are too complex to be used effectively in a quantization loop in an audio encoder. Second, the NMR of ITU-R BS 1387 measures perceptible degradation compared to the masking threshold for the original signal, which can inaccurately estimate the perceptible degradation for a listener of the reconstructed signal. For example, the masking threshold of the original signal can be higher or lower than the masking threshold of the reconstructed signal due to the effects of quantization. A masking component in the original signal might not even be present in the reconstructed signal. Third, the NMR of ITU-R BS 1387 fails to adequately weight NMR on a per-band basis, which limits its usefulness and adaptability. Aside from these shortcomings, the techniques described in ITU-R BS 1387 present several practical problems for an audio encoder. The techniques presuppose input at a fixed rate (48 kHz). The techniques assume fixed transform block sizes, and use a transform and window function (in the FFT-based ear model) that can be different than the transform used in the encoder, which is inefficient. Finally, the number of quantization bands used in the encoder is not necessarily equal to the number of critical bands in an auditory model of ITU-R BS 1387.
Microsoft Corporation's Windows Media Audio version 7.0 [“WMA7”] partially addresses some of the problems with implementing quality measurement in an audio encoder. In WMA7, the encoder may jointly code the left and right channels of stereo mode audio into a sum channel and a difference channel. The sum channel is the averages of the left and right channels; the difference channel is the differences between the left and right channels divided by two. The encoder calculates a noise signal for each of the sum channel and the difference channel, where the noise signal is the difference between the original channel and the reconstructed channel. The encoder then calculates the maximum Noise to Excitation Ratio [“NER”] of all quantization bands in the sum channel and difference channel:
                              NER                      max            ⁢                                                  ⁢            ofalld                          =                  max          ⁡                      (                                                            max                  d                                ⁢                                  (                                                                                    F                        Diff                                            ⁡                                              [                        d                        ]                                                                                                            E                        Diff                                            ⁡                                              [                        d                        ]                                                                              )                                            ,                                                max                  d                                ⁢                                  (                                                                                    F                        Sum                                            ⁡                                              [                        d                        ]                                                                                                            E                        Sum                                            ⁡                                              [                        d                        ]                                                                              )                                                      )                                              (        4        )            where d is the quantization band number, maxd is the maximum value across all d, and EDiff[d], ESum[d], FDiff[d], and FSum[d] are the excitation pattern for the difference channel, the excitation pattern for the sum channel, the noise pattern of the difference channel, and the noise pattern of the sum channel, respectively, for quantization bands. In WMA7, calculating an excitation or noise pattern includes squaring values to determine energies, and then, for each quantization band, adding the energies of the coefficients within that quantization band. If WMA7 does not use jointly coded channels, the same equation is used to measure the quality of left and right channels. That is,
                              NER                      max            ⁢                                                  ⁢            ofalld                          =                  max          ⁡                      (                                                            max                  d                                ⁢                                  (                                                                                    F                        Left                                            ⁡                                              [                        d                        ]                                                                                                            E                        Leftf                                            ⁡                                              [                        d                        ]                                                                              )                                            ,                                                max                  d                                ⁢                                  (                                                                                    F                        Right                                            ⁡                                              [                        d                        ]                                                                                                            E                        Right                                            ⁡                                              [                        d                        ]                                                                              )                                                      )                                              (        5        )            
WMA7 works in real time and measures audio quality for input with rates other than 48 kHz. WMA7 uses a MLT with variable transform block sizes, and measures audio quality using the same frequency coefficients used in compression. WMA7 does not address several of the problems of ITU-R BS 1387, however, and WMA7 has several other shortcomings as well, each of which decreases the accuracy of the measurement of perceptual audio quality. First, although the quality measurement of WMA7 is simple enough to be used in a quantization loop of the audio encoder, it does not adequately correlate with actual human perception. As a result, changes in quality in order to keep constant bit rate can be dramatic and perceptible. Second, the NER of WMA7 measures perceptible degradation compared to the excitation pattern of the original information (as opposed to reconstructed information), which can inaccurately estimate perceptible degradation for a listener of the reconstructed signal. Third, the NER of WMA7 fails to adequately weight NER on a per-band basis, which limits its usefulness and adaptability. Fourth, although WMA7 works with variable-size transform blocks, WMA7 is unable perform operations such as temporal masking between blocks due to the variable sizes. Fifth, WMA7 measures quality with respect to excitation and noise patterns for quantization bands, which are not necessarily related to a model of human perception with critical bands, and which can be different in different variable-size blocks, preventing comparisons of results. Sixth, WMA7 measures the maximum NER for all quantization bands of a channel, which can inappropriately ignore the contribution of NER s for other quantization bands. Seventh, WMA7 applies the same quality measurement techniques whether independently or jointly coded channels are used, which ignores differences between the two channel modes.
Aside from WMA7, several international standards describe audio encoders that incorporate an auditory model. The Motion Picture Experts Group, Audio Layer 3 [“MP3”] and Motion Picture Experts Group 2, Advanced Audio Coding [“AAC”] standards each describe techniques for measuring distortion in a reconstructed audio signal against thresholds set with an auditory model.
In MP3, the encoder incorporates a psychoacoustic model to calculate Signal to Mask Ratios [“SMRs”] for frequency ranges called threshold calculation partitions. In a path separate from the rest of the encoder, the encoder processes the original audio information according to the psychoacoustic model The psychoacoustic model uses a different frequency transform than the rest of the encoder (FFT vs. hybrid polyphase/MDCT filter bank) and uses separate computations for energy and other parameters. In the psychoacoustic model, the MP3 encoder processes blocks of frequency coefficients according to the threshold calculation partitions, which have sub-Bark band resolution (e.g., 62 partitions for a long block of 48 kHz input). The encoder calculates a SMR for each partition. The encoder converts the SMRs for the partitions into SMRs for scale factor bands. A scale factor band is a range of frequency coefficients for which the encoder calculates a weight called a scale factor. The number of scale factor bands depends on sampling rate and block size (e.g., 21 scale factor bands for a long block of 48 kHz input). The encoder later converts the SMRs for the scale factor bands into allowed distortion thresholds for the scale factor bands.
In an outer quantization loop, the MP3 encoder compares distortions for scale factor bands to the allowed distortion thresholds for the scale factor bands. Each scale factor starts with a minimum weight for a scale factor band. For the starting set of scale factors, the encoder finds a satisfactory quantization step size in an inner quantization loop. In the outer quantization loop, the encoder amplifies the scale factors until the distortion in each scale factor band is less than the allowed distortion threshold for that scale factor band, with the encoder repeating the inner quantization loop for each adjusted set of scale factors. In special cases, the encoder exits the outer quantization loop even if distortion exceeds the allowed distortion threshold for a scale factor band (e.g., if all scale factors have been amplified or if a scale factor has reached a maximum amplification).
Before the quantization loops, the MP3 encoder can switch between long blocks of 576 frequency coefficients and short blocks of 192 frequency coefficients (sometimes called long windows or short windows). Instead of a long block, the encoder can use three short blocks for better time resolution. The number of scale factor bands is different for short blocks and long blocks (e.g., 12 scale factor bands vs. 21 scale factor bands). The MP3 encoder runs the psychoacoustic model twice (in parallel, once for long blocks and once for short blocks) using different techniques to calculate SMR depending on the block size.
The MP3 encoder can use any of several different coding channel modes, including single channel, two independent channels (left and right channels), or two jointly coded channels (sum and difference channels). If the encoder uses jointly coded channels, the encoder computes a set of scale factors for each of the sum and difference channels using the same techniques that are used for left and right channels. Or, if the encoder uses jointly coded channels, the encoder can instead use intensity stereo coding. Intensity stereo coding changes how scale factors are determined for higher frequency scale factor bands and changes how sum and difference channels are reconstructed, but the encoder still computes two sets of scale factors for the two channels.
For additional information about MP3 and AAC, see the MP3 standard (“ISO/IEC 11172-3, Information Technology—Coding of Moving Pictures and Associated Audio for Digital Storage Media at Up to About 1.5 Mbit/s—Part 3: Audio”) and the MC standard.
Although MP3 encoding has achieved widespread adoption, it is unsuitable for some applications (for example, real-time audio streaming at very low to mid bit rates) for several reasons. First, calculating SMRs and allowed distortion thresholds with MP3's psychoacoustic model occurs outside of the quantization loops. The psychoacoustic model is too complex for some applications, and cannot be integrated into a quantization loop for such applications. At the same time, as the psychoacoustic model is outside of the quantization loops, it works with original audio information (as opposed to reconstructed audio information), which can lead to inaccurate estimation of perceptible degradation for a listener of the reconstructed signal at lower bit rates. Second, the MP3 encoder fails to adequately weight SMRs and allowed distortion thresholds on a per-band basis, which limits the usefulness and adaptability of the MP3 encoder. Third, computing SMRs and allowed distortion thresholds in separate tracks for long blocks and short blocks prevents or complicates operations such as temporal spreading or comparing measures for blocks of different sizes. Fourth, the MP3 encoder does not adequately exploit differences between independently coded channels and jointly coded channels when calculating SMRs and allowed distortion thresholds.