With the introduction of compact disks, digital wireless telephone networks, and audio delivery over the Internet, digital audio has become commonplace. Engineers use a variety of techniques to process digital audio efficiently while still maintaining the quality of the digital audio. To understand these techniques, it helps to understand how audio information is represented and processed in a computer.
I. Representation of Audio Information in a Computer
A computer processes audio information as a series of numbers representing the audio information. For example, a single number can represent an audio sample, which is an amplitude value (i.e., loudness) at a particular time. Several factors affect the quality of the audio information, including sample depth, sampling rate, and channel mode.
Sample depth (or precision) indicates the range of numbers used to represent a sample. The more values possible for the sample, the higher the quality because the number can capture more subtle variations in amplitude. For example, an 8-bit sample has 256 possible values, while a 16-bit sample has 65,536 possible values. A 24-bit sample can capture normal loudness variations very finely, and can also capture unusually high loudness.
The sampling rate (usually measured as the number of samples per second) also affects quality. The higher the sampling rate, the higher the quality because more frequencies of sound can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second.
Mono and stereo are two common channel modes for audio. In mono mode, audio information is present in one channel. In stereo mode, audio information is present in two channels usually labeled the left and right channels. Other modes with more channels such as 5.1 channel, 7.1 channel, or 9.1 channel surround sound (the “1” indicates a sub-woofer or low-frequency effects channel) are also possible. Table 1 shows several formats of audio with different quality levels, along with corresponding raw bitrate costs.
TABLE 1Bitrates for different quality audio informationSampleDepthSampling RateRaw BitrateQuality(bits/sample)(samples/second)Mode(bits/second)Internet telephony88,000mono64,000Telephone811,025mono88,200CD audio1644,100stereo1,411,200
Surround sound audio typically has even higher raw bitrate. As Table 1 shows, the cost of high quality audio information is high bitrate. High quality audio information consumes large amounts of computer storage and transmission capacity. Companies and consumers increasingly depend on computers, however, to create, distribute, and play back high quality multi-channel audio content.
II. Processing Audio Information in a Computer
Many computers and computer networks lack the resources to process raw digital audio. Compression (also called encoding or coding) decreases the cost of storing and transmitting audio information by converting the information into a lower bitrate form. Compression can be lossless (in which quality does not suffer) or lossy (in which quality suffers but bitrate reduction from subsequent lossless compression is more dramatic). Decompression (also called decoding) extracts a reconstructed version of the original information from the compressed form.
A. Standard Perceptual Audio Encoders and Decoders
Generally, the goal of audio compression is to digitally represent audio signals to provide maximum signal quality with the least possible amount of bits. A conventional audio encoder/decoder [“codec”] system uses subband/transform coding, quantization, rate control, and variable length coding to achieve its compression. The quantization and other lossy compression techniques introduce potentially audible noise into an audio signal. The audibility of the noise depends on how much noise there is and how much of the noise the listener perceives. The first factor relates mainly to objective quality, while the second factor depends on human perception of sound.
FIG. 1 shows a generalized diagram of a transform-based, perceptual audio encoder (100) according to the prior art. FIG. 2 shows a generalized diagram of a corresponding audio decoder (200) according to the prior art. Though the codec system shown in FIGS. 1 and 2 is generalized, it has characteristics found in several real world codec systems, including versions of Microsoft Corporation's Windows Media Audio [“WMA”] encoder and decoder. Other codec systems are provided or specified by the Motion Picture Experts Group, Audio Layer 3 [“MP3”] standard, the Motion Picture Experts Group 2, Advanced Audio Coding [“AAC”] standard, and Dolby AC3. For additional information about the codec systems, see the respective standards or technical publications.
1. Perceptual Audio Encoder
Overall, the encoder (100) receives a time series of input audio samples (105), compresses the audio samples (105), and multiplexes information produced by the various modules of the encoder (100) to output a bitstream (195). The encoder (100) includes a frequency transformer (110), a multi-channel transformer (120), a perception modeler (130), a weighter (140), a quantizer (150), an entropy encoder (160), a controller (170), and a bitstream multiplexer [“MUX”] (180).
The frequency transformer (110) receives the audio samples (105) and converts them into data in the frequency domain. For example, the frequency transformer (110) splits the audio samples (105) into blocks, which can have variable size to allow variable temporal resolution. Small blocks allow for greater preservation of time detail at short but active transition segments in the input audio samples (105), but sacrifice some frequency resolution. In contrast, large blocks have better frequency resolution and worse time resolution, and usually allow for greater compression efficiency at longer and less active segments. Blocks can overlap to reduce perceptible discontinuities between blocks that could otherwise be introduced by later quantization. For multi-channel audio, the frequency transformer (110) uses the same pattern of windows for each channel in a particular frame. The frequency transformer (110) outputs blocks of frequency coefficient data to the multi-channel transformer (120) and outputs side information such as block sizes to the MUX (180).
For multi-channel audio data, the multiple channels of frequency coefficient data produced by the frequency transformer (110) often correlate. To exploit this correlation, the multi-channel transformer (120) can convert the multiple original, independently coded channels into jointly coded channels. For example, if the input is stereo mode, the multi-channel transformer (120) can convert the left and right channels into sum and difference channels:
                                                        X              Sum                        ⁡                          [              k              ]                                =                                                                      X                  Left                                ⁡                                  [                  k                  ]                                            +                                                X                  Right                                ⁡                                  [                  k                  ]                                                      2                          ,                            (        1        )                                                      X            Diff                    ⁡                      [            k            ]                          =                                                                              X                  Left                                ⁡                                  [                  k                  ]                                            -                                                X                  Right                                ⁡                                  [                  k                  ]                                                      2                    .                                    (        2        )            Or, the multi-channel transformer (120) can pass the left and right channels through as independently coded channels. The decision to use independently or jointly coded channels is predetermined or made adaptively during encoding. For example, the encoder (100) determines whether to code stereo channels jointly or independently with an open loop selection decision that considers the (a) energy separation between coding channels with and without the multi-channel transform and (b) the disparity in excitation patterns between the left and right input channels. Such a decision can be made on a window-by-window basis or only once per frame to simplify the decision. The multi-channel transformer (120) produces side information to the MUX (180) indicating the channel mode used.
The encoder (100) can apply multi-channel rematrixing to a block of audio data after a multi-channel transform. For low bitrate, multi-channel audio data in jointly coded channels, the encoder (100) selectively suppresses information in certain channels (e.g., the difference channel) to improve the quality of the remaining channel(s) (e.g., the sum channel). For example, the encoder (100) scales the difference channel by a scaling factor ρ:{tilde over (X)}Diff[k]=ρ·XDiff[k]  (3),where the value of ρ is based on: (a) current average levels of a perceptual audio quality measure such as Noise to Excitation Ratio [“NER”], (b) current fullness of a virtual buffer, (c) bitrate and sampling rate settings of the encoder (100), and (d) the channel separation in the left and right input channels.
The perception modeler (130) processes audio data according to a model of the human auditory system to improve the perceived quality of the reconstructed audio signal for a given bitrate. For example, an auditory model typically considers the range of human hearing and critical bands. The human nervous system integrates sub-ranges of frequencies. For this reason, an auditory model may organize and process audio information by critical bands. Different auditory models use a different number of critical bands (e.g., 25, 32, 55, or 109) and/or different cut-off frequencies for the critical bands. Bark bands are a well-known example of critical bands. Aside from range and critical bands, interactions between audio signals can dramatically affect perception. An audio signal that is clearly audible if presented alone can be completely inaudible in the presence of another audio signal, called the masker or the masking signal. The human ear is relatively insensitive to distortion or other loss in fidelity (i.e., noise) in the masked signal, so the masked signal can include more distortion without degrading perceived audio quality. In addition, an auditory model can consider a variety of other factors relating to physical or neural aspects of human perception of sound.
The perception modeler (130) outputs information that the weighter (140) uses to shape noise in the audio data to reduce the audibility of the noise. For example, using any of various techniques, the weighter (140) generates weighting factors (sometimes called scaling factors) for quantization matrices (sometimes called masks) based upon the received information. The weighting factors in a quantization matrix include a weight for each of multiple quantization bands in the audio data, where the quantization bands are frequency ranges of frequency coefficients. The number of quantization bands can be the same as or less than the number of critical bands. Thus, the weighting factors indicate proportions at which noise is spread across the quantization bands, with the goal of minimizing the audibility of the noise by putting more noise in bands where it is less audible, and vice versa. The weighting factors can vary in amplitudes and number of quantization bands from block to block. The weighter (140) then applies the weighting factors to the data received from the multi-channel transformer (120).
In one implementation, the weighter (140) generates a set of weighting factors for each window of each channel of multi-channel audio, or shares a single set of weighting factors for parallel windows of jointly coded channels. The weighter (140) outputs weighted blocks of coefficient data to the quantizer (150) and outputs side information such as the sets of weighting factors to the MUX (180).
A set of weighting factors can be compressed for more efficient representation using direct compression. In the direct compression technique, the encoder (100) uniformly quantizes each element of a quantization matrix. The encoder then differentially codes the quantized elements relative to preceding elements in the matrix, and Huffman codes the differentially coded elements. In some cases (e.g., when all of the coefficients of particular quantization bands have been quantized or truncated to a value of 0), the decoder (200) does not require weighting factors for all quantization bands. In such cases, the encoder (100) gives values to one or more unneeded weighting factors that are identical to the value of the next needed weighting factor in a series, which makes differential coding of elements of the quantization matrix more efficient.
Or, for low bitrate applications, the encoder (100) can parametrically compress a quantization matrix to represent the quantization matrix as a set of parameters, for example, using Linear Predictive Coding [“LPC”] of pseudo-autocorrelation parameters computed from the quantization matrix.
The quantizer (150) quantizes the output of the weighter (140), producing quantized coefficient data to the entropy encoder (160) and side information including quantization step size to the MUX (180). Quantization maps ranges of input values to single values, introducing irreversible loss of information, but also allowing the encoder (100) to regulate the quality and bitrate of the output bitstream (195) in conjunction with the controller (170). In FIG. 1, the quantizer (150) is an adaptive, uniform, scalar quantizer. The quantizer (150) applies the same quantization step size to each frequency coefficient, but the quantization step size itself can change from one iteration of a quantization loop to the next to affect the bitrate of the entropy encoder (160) output. Other kinds of quantization are non-uniform, vector quantization, and/or non-adaptive quantization.
The entropy encoder (160) losslessly compresses quantized coefficient data received from the quantizer (150). The entropy encoder (160) can compute the number of bits spent encoding audio information and pass this information to the rate/quality controller (170).
The controller (170) works with the quantizer (150) to regulate the bitrate and/or quality of the output of the encoder (100). The controller (170) receives information from other modules of the encoder (100) and processes the received information to determine a desired quantization step size given current conditions. The controller (170) outputs the quantization step size to the quantizer (150) with the goal of satisfying bitrate and quality constraints.
The encoder (100) can apply noise substitution and/or band truncation to a block of audio data. At low and mid-bitrates, the audio encoder (100) can use noise substitution to convey information in certain bands. In band truncation, if the measured quality for a block indicates poor quality, the encoder (100) can completely eliminate the coefficients in certain (usually higher frequency) bands to improve the overall quality in the remaining bands.
The MUX (180) multiplexes the side information received from the other modules of the audio encoder (100) along with the entropy encoded data received from the entropy encoder (160). The MUX (180) outputs the information in a format that an audio decoder recognizes. The MUX (180) includes a virtual buffer that stores the bitstream (195) to be output by the encoder (100) in order to smooth over short-term fluctuations in bitrate due to complexity changes in the audio.
2. Perceptual Audio Decoder
Overall, the decoder (200) receives a bitstream (205) of compressed audio information including entropy encoded data as well as side information, from which the decoder (200) reconstructs audio samples (295). The audio decoder (200) includes a bitstream demultiplexer [“DEMUX”] (210), an entropy decoder (220), an inverse quantizer (230), a noise generator (240), an inverse weighter (250), an inverse multi-channel transformer (260), and an inverse frequency transformer (270).
The DEMUX (210) parses information in the bitstream (205) and sends information to the modules of the decoder (200). The DEMUX (210) includes one or more buffers to compensate for short-term variations in bitrate due to fluctuations in complexity of the audio, network jitter, and/or other factors.
The entropy decoder (220) losslessly decompresses entropy codes received from the DEMUX (210), producing quantized frequency coefficient data. The entropy decoder (220) typically applies the inverse of the entropy encoding technique used in the encoder.
The inverse quantizer (230) receives a quantization step size from the DEMUX (210) and receives quantized frequency coefficient data from the entropy decoder (220). The inverse quantizer (230) applies the quantization step size to the quantized frequency coefficient data to partially reconstruct the frequency coefficient data.
From the DEMUX (210), the noise generator (240) receives information indicating which bands in a block of data are noise substituted as well as any parameters for the form of the noise. The noise generator (240) generates the patterns for the indicated bands, and passes the information to the inverse weighter (250).
The inverse weighter (250) receives the weighting factors from the DEMUX (210), patterns for any noise-substituted bands from the noise generator (240), and the partially reconstructed frequency coefficient data from the inverse quantizer (230). As necessary, the inverse weighter (250) decompresses the weighting factors, for example, entropy decoding, inverse differentially coding, and inverse quantizing the elements of the quantization matrix. The inverse weighter (250) applies the weighting factors to the partially reconstructed frequency coefficient data for bands that have not been noise substituted. The inverse weighter (250) then adds in the noise patterns received from the noise generator (240) for the noise-substituted bands.
The inverse multi-channel transformer (260) receives the reconstructed frequency coefficient data from the inverse weighter (250) and channel mode information from the DEMUX (210). If multi-channel audio is in independently coded channels, the inverse multi-channel transformer (260) passes the channels through. If multi-channel data is in jointly coded channels, the inverse multi-channel transformer (260) converts the data into independently coded channels.
The inverse frequency transformer (270) receives the frequency coefficient data output by the multi-channel transformer (260) as well as side information such as block sizes from the DEMUX (210). The inverse frequency transformer (270) applies the inverse of the frequency transform used in the encoder and outputs blocks of reconstructed audio samples (295).
B. Disadvantages of Standard Perceptual Audio Encoders and Decoders
Although perceptual encoders and decoders as described above have good overall performance for many applications, they have several drawbacks, especially for compression and decompression of multi-channel audio. The drawbacks limit the quality of reconstructed multi-channel audio in some cases, for example, when the available bitrate is small relative to the number of input audio channels.
1. Inflexibility in Frame Partitioning for Multi-Channel Audio
In various respects, the frame partitioning performed by the encoder (100) of FIG. 1 is inflexible.
As previously noted, the frequency transformer (110) breaks a frame of input audio samples (105) into one or more overlapping windows for frequency transformation, where larger windows provide better frequency resolution and redundancy removal, and smaller windows provide better time resolution. The better time resolution helps control audible pre-echo artifacts introduced when the signal transitions from low energy to high energy, but using smaller windows reduces compressibility, so the encoder must balance these considerations when selecting window sizes. For multi-channel audio, the frequency transformer (110) partitions the channels of a frame identically (i.e., identical window configurations in the channels), which can be inefficient in some cases, as illustrated in FIGS. 3a-3c. 
FIG. 3a shows the waveforms (300) of an example stereo audio signal. The signal in channel 0 includes transient activity, whereas the signal in channel 1 is relatively stationary. The encoder (100) detects the signal transition in channel 0 and, to reduce pre-echo, divides the frame into smaller overlapping, modulated windows (301) as shown in FIG. 3b. For the sake of simplicity, FIG. 3c shows the overlapped window configuration (302) in boxes, with dotted lines delimiting frame boundaries. Later figures also follow this convention.
A drawback of forcing all channels to have an identical window configuration is that a stationary signal in one or more channels (e.g., channel 1 in FIGS. 3a-3c) may be broken into smaller windows, lowering coding gains. Alternatively, the encoder (100) might force all channels to use larger windows, introducing pre-echo into one or more channels that have transients. This problem is exacerbated when more than two channels are to be coded.
AAC allows pair-wise grouping of channels for multi-channel transforms. Among left, right, center, back left, and back right channels, for example, the left and right channels might be grouped for stereo coding, and the back left and back right channels might be grouped for stereo coding. Different groups can have different window configurations, but both channels of a particular group have the same window configuration if stereo coding is used. This limits the flexibility of partitioning for multi-channel transforms in the AAC system, as does the use of only pair-wise groupings.
2. Inflexibility in Multi-Channel Transforms
The encoder (100) of FIG. 1 exploits some inter-channel redundancy, but is inflexible in various respects in terms of multi-channel transforms. The encoder (100) allows two kinds of transforms: (a) an identity transform (which is equivalent to no transform at all) or (b) sum-difference coding of stereo pairs. These limitations constrain multi-channel coding of more than two channels. Even in AAC, which can work with more than two channels, a multi-channel transform is limited to only a pair of channels at a time.
Several groups have experimented with multi-channel transformations for surround sound channels. For example, see Yang et al., “An Inter-Channel Redundancy Removal Approach for High-Quality Multichannel Audio Compression,” AES 109th Convention, Los Angeles, September 2000 [“Yang”], and Wang et al., “A Multichannel Audio Coding Algorithm for Inter-Channel Redundancy Removal,” AES 110th Convention, Amsterdam, Netherlands, May 2001 [“Wang”]. The Yang system uses a Karhunen-Loeve Transform [“KLT”] across channels to decorrelate the channels for good compression factors. The Wang system uses an integer-to-integer Discrete Cosine Transform [“DCT”]. Both systems give some good results, but still have several limitations.
First, using a KLT on audio samples (whether across the time domain or frequency domain as in the Yang system) does not control the distortion introduced in reconstruction. The KLT in the Yang system is not used successfully for perceptual audio coding of multi-channel audio. The Yang system does not control the amount of leakage from one (e.g., heavily quantized) coded channel across to multiple reconstructed channels in the inverse multi-channel transform. This shortcoming is pointed out in Kuo et al, “A Study of Why Cross Channel Prediction Is Not Applicable to Perceptual Audio Coding,” IEEE Signal Proc. Letters, vol. 8, no. 9, September 2001. In other words, quantization that is “inaudible” in one coded channel may become audible when spread in multiple reconstructed channels, since inverse weighting is performed before the inverse multi-channel transform. The Wang system overcomes this problem by placing the multi-channel transform after weighting and quantization in the encoder (and placing the inverse multi-channel transform before inverse quantization and inverse weighting in the decoder). The Wang system, however, has various other shortcomings. Performing the quantization prior to multi-channel transformation means that the multi-channel transformation must be integer-to-integer, limiting the number of transformations possible and limiting redundancy removal across channels.
Second, the Yang system is limited to KLT transforms. While KLT transforms adapt to the audio data being compressed, the flexibility of the Yang system to use different kinds of transforms is limited. Similarly, the Wang system uses integer-to-integer DCT for multi-channel transforms, which is not as good as conventional DCTs in terms of energy compaction, and the flexibility of the Wang system to use different kinds of transforms is limited.
Third, in the Yang and Wang systems, there is no mechanism to control which channels get transformed together, nor is there a mechanism to selectively group different channels at different times for multi-channel transformation. Such control helps limit the leakage of content across totally incompatible channels. Moreover, even channels that are compatible overall may be incompatible over some periods.
Fourth, in the Yang system, the multi-channel transformer lacks control over whether to apply the multi-channel transform at the frequency band level. Even among channels that are compatible overall, the channels might not be compatible at some frequencies or in some frequency bands. Similarly, the multi-channel transform of the encoder (100) of FIG. 1 lacks control at the sub-channel level; it does not control which bands of frequency coefficient data are multi-channel transformed, which ignores the inefficiencies that may result when less than all frequency bands of the input channels correlate.
Fifth, even when source channels are compatible, there is often a need to control the number of channels transformed together, so as to limit data overflow and reduce memory accesses while implementing the transform. In particular, the KLT of the Yang system is computationally complex. On the other hand, reducing the transform size also potentially reduces the coding gain compared to bigger transforms.
Sixth, sending information specifying multi-channel transformations can be costly in terms of bitrate. This is particularly true for the KLT of the Yang system, as the transform coefficients for the covariance matrix sent are real numbers.
Seventh, for low bitrate multi-channel audio, the quality of the reconstructed channels is very limited. Aside from the requirements of coding for low bitrate, this is in part due to the inability of the system to selectively and gracefully cut down the number of channels for which information is actually encoded.
3. Inefficiencies in Quantization and Weighting
In the encoder (100) of FIG. 1, the weighter (140) shapes distortion across bands in audio data and the quantizer (150) sets quantization step sizes to change the amplitude of the distortion for a frame and thereby balance quality versus bitrate. While the encoder (100) achieves a good balance of quality and bitrate in most applications, the encoder (100) still has several drawbacks.
First, the encoder (100) lacks direct control over quality at the channel level. The weighting factors shape overall distortion across quantization bands for an individual channel. The uniform, scalar quantization step size affects the amplitude of the distortion across all frequency bands and channels for a frame. Short of imposing very high or very low quality on all channels, the encoder (100) lacks direct control over setting equal or at least comparable quality in the reconstructed output for all channels.
Second, when weighting factors are lossy compressed, the encoder (100) lacks control over the resolution of quantization of the weighting factors. For direct compression of a quantization matrix, the encoder (100) uniformly quantizes elements of the quantization matrix, then uses differential coding and Huffman coding. The uniform quantization of mask elements does not adapt to changes in available bitrate or signal complexity. As a result, in some cases quantization matrices are encoded with more resolution than is needed given the overall low quality of the reconstructed audio, and in other cases quantization matrices are encoded with less resolution than should be used given the high quality of the reconstructed audio.
Third, the direct compression of quantization matrices in the encoder (100) fails to exploit temporal redundancies in the quantization matrices. The direct compression removes redundancy within a particular quantization matrix, but ignores temporal redundancy in a series of quantization matrices.
C. Down-Mixing Audio Channels
Aside from multi-channel audio encoding and decoding, Dolby Pro-Logic and several other systems perform down-mixing of multi-channel audio to facilitate compatibility with speaker configurations with different numbers of speakers. In the Dolby Pro-Logic down-mixing, for example, four channels are mixed down to two channels, with each of the two channels having some combination of the audio data in the original four channels. The two channels can be output on stereo-channel equipment, or the four channels can be reconstructed from the two-channels for output on four-channel equipment.
While down-mixing of this nature solves some compatibility problems, it is limited to certain set configurations, for example, four to two channel down-mixing. Moreover, the mixing formulas are pre-determined and do not allow changes over time to adapt to the signal.