Perceptual Transform Coding
The coding of audio utilizes coding techniques that exploit various perceptual models of human hearing. For example, many weaker tones near strong ones are masked so they do not need to be coded. In traditional perceptual audio coding, this is exploited as adaptive quantization of different frequency data. Perceptually important frequency data are allocated more bits and thus finer quantization and vice versa.
For example, transform coding is conventionally known as an efficient scheme for the compression of audio signals. In transform coding, a block of the input audio samples is transformed (e.g., via the Modified Discrete Cosine Transform or MDCT, which is the most widely used), processed, and quantized. The quantization of the transformed coefficients is performed based on the perceptual importance (e.g. masking effects and frequency sensitivity of human hearing), such as via a scalar quantizer.
When a scalar quantizer is used, the importance is mapped to relative weighting, and the quantizer resolution (step size) for each coefficient is derived from its weight and the global resolution. The global resolution can be determined from target quality, bit rate, etc. For a given step size, each coefficient is quantized into a level which is zero or non-zero integer value.
At lower bitrates, there are typically a lot more zero level coefficients than non-zero level coefficients. They can be coded with great efficiency using run-length coding. In run-length coding, all zero-level coefficients typically are represented by a value pair consisting of a zero run (i.e., length of a run of consecutive zero-level coefficients), and level of the non-zero coefficient following the zero run. The resulting sequence is R0,L0,R1,L1. . . , where R is zero run and L is non-zero level.
By exploiting the redundancies between R and L, it is possible to further improve the coding performance. Run-level Huffman coding is a reasonable approach to achieve it, in which R and L are combined into a 2-D array (R,L) and Huffman-coded. Because of memory restrictions, the entries in Huffman tables cannot cover all possible (R,L) combinations, which requires special handling of the outliers. A typical method used for the outliers is to embed an escape code into the Huffman tables, such that the outlier is coded by transmitting the escape code along with the independently quantized R and L.
When transform coding at low bit rates, a large number of the transform coefficients tend to be quantized to zero to achieve a high compression ratio. This could result in there being large missing portions of the spectral data in the compressed bitstream. After decoding and reconstruction of the audio, these missing spectral portions can produce an unnatural and annoying distortion in the audio. Moreover, the distortion in the audio worsens as the missing portions of spectral data become larger. Further, a lack of high frequencies due to quantization makes the decoded audio sound muffled and unpleasant.
Wide-Sense Perceptual Similarity
Perceptual coding also can be taken to a broader sense. For example, some parts of the spectrum can be coded with appropriately shaped noise. When taking this approach, the coded signal may not aim to render an exact or near exact version of the original. Rather the goal is to make it sound similar and pleasant when compared with the original. For example, a wide-sense perceptual similarity technique may code a portion of the spectrum as a scaled version of a code-vector, where the code vector may be chosen from either a fixed predetermined codebook (e.g., a noise codebook), or a codebook taken from a baseband portion of the spectrum (e.g., a baseband codebook).
All these perceptual effects can be used to reduce the bit-rate needed for coding of audio signals. This is because some frequency components do not need to be accurately represented as present in the original signal, but can be either not coded or replaced with something that gives the same perceptual effect as in the original.
In low bit rate coding, a recent trend is to exploit this wide-sense perceptual similarity and use a vector quantization (e.g., as a gain and shape code-vector) to represent the high frequency components with very few bits, e.g., 3 kbps. This can alleviate the distortion and unpleasant muffled effect from missing high frequencies and other spectral “holes.” The transform coefficients of the “spectral holes” are encoded using the vector quantization scheme. It has been shown that this approach enhances the audio quality with a small increase of bit rate.
Multi-Channel Coding
Some audio encoder/decoders also provide the capability to encode multiple channel audio. Joint coding of audio channels involves coding information from more than one channel together to reduce bitrate. For example, mid/side coding (also called M/S coding or sum-difference coding) involves performing a matrix operation on left and right stereo channels at an encoder, and sending resulting “mid” and “side” channels (normalized sum and difference channels) to a decoder. The decoder reconstructs the actual physical channels from the “mid” and “side” channels. M/S coding is lossless, allowing perfect reconstruction if no other lossy techniques (e.g., quantization) are used in the encoding process.
Intensity stereo coding is an example of a lossy joint coding technique that can be used at low bitrates. Intensity stereo coding involves summing a left and right channel at an encoder and then scaling information from the sum channel at a decoder during reconstruction of the left and right channels. Typically, intensity stereo coding is performed at higher frequencies where the artifacts introduced by this lossy technique are less noticeable.
In one prior audio coding technique that combined joint channel coding with vector quantization coding, the encoder/decoder coded a multi-channel sound source by coding a subset of the channels, along with parameters from which the decoder can reproduce a normalized version of a channel correlation matrix. Using the channel correlation matrix, the decoder could reconstruct the remaining channels from the coded subset of the channels. In short summary, the decoder performed the following processing flow: decode parameters, produce a normalized complex channel correlation matrix from the parameters, derive a complex transform from the complex correlation matrix, perform complex scaling and rotation on complex spectral transform coefficients using values from the matrix, and perform complex post-processing using values from the matrix. However, this technique required a very high complexity decoder (in other words, very processing intensive operations, having high processor and memory resource load).
More specifically, the technique used a complex rotation in the modulated complex lapped transform (MCLT) domain, followed by post-processing to reconstruct the individual channels from the coded channel subset. Further, the reconstruction of the channels required the decoder to perform a forward and inverse complex transform, again adding to the processing complexity. In addition, in cases where other processing such as for vector quantization (which uses a real-only transform, such as the modulated lapped transform (MLT)) also is performed in the reconstruction domain, then the complexity of the decoder is even further increased. In such case, the decoder's processing flow (in short summary) becomes: apply inverse MLT to reconstruct base band, apply forward MLT, perform inverse vector quantization to reconstruct extension region, perform an MLT to MCLT conversion, perform the channel extension processing (as summarized briefly above), and apply the inverse MCLT. This processing flow adds the additional MLT to MCLT conversion. Further, the MCLT has roughly twice the processing complexity as the inverse MLT.